diff --git a/SparkContext/index.html b/SparkContext/index.html
index 2ea0220045..e35e233eb4 100644
--- a/SparkContext/index.html
+++ b/SparkContext/index.html
@@ -11,8 +11,10 @@
 <span class=c1>// recursive = false</span>
 <span class=n>addFile</span><span class=p>(</span>
 <span class=w>  </span><span class=n>path</span><span class=p>:</span><span class=w> </span><span class=nc>String</span><span class=p>):</span><span class=w> </span><span class=nc>Unit</span>
-</code></pre></div> <p><code>addFile</code> creates a Hadoop <code>Path</code> from the given <code>path</code>. For a no-schema path, <code>addFile</code> converts it to a canonical form.</p> <p><code>addFile</code> prints out the following WARN message to the logs and exits.</p> <div class=highlight><pre><span></span><code>File with &#39;local&#39; scheme is not supported to add to file server, since it is already available on every node.
-</code></pre></div> <p><code>addFile</code>...FIXME</p> <p>In the end, <code>addFile</code> adds the file to the <a href=#addedFiles>addedFiles</a> internal registry (with the current timestamp):</p> <ul> <li> <p>For new files, <code>addFile</code> prints out the following INFO message to the logs, <a href=../Utils/#fetchFile>fetches the file</a> (to the root directory and without using the cache) and <a href=#postEnvironmentUpdate>postEnvironmentUpdate</a>.</p> <div class=highlight><pre><span></span><code>Added file [path] at [key] with timestamp [timestamp]
+</code></pre></div> <p>Firstly, <code>addFile</code> validate the schema of given <code>path</code>. For a no-schema path, <code>addFile</code> converts it to a canonical form. For a local schema path, <code>addFile</code> prints out the following WARN message to the logs and exits.</p> <p><div class=highlight><pre><span></span><code>File with &#39;local&#39; scheme is not supported to add to file server, since it is already available on every node.
+</code></pre></div> And for other schema path, <code>addFile</code> creates a Hadoop Path from the given path.</p> <p><code>addFile</code> Will validate the URL if the path is an HTTP, HTTPS or FTP URI.</p> <p><code>addFile</code> Will throw <code>SparkException</code> with below message if path is local directories but not in local mode.</p> <div class=highlight><pre><span></span><code>addFile does not support local directories when not running local mode.
+</code></pre></div> <p><code>addFile</code> Will throw <code>SparkException</code> with below message if path is directories but not turn on <code>recursive</code> flag.</p> <div class=highlight><pre><span></span><code>Added file $hadoopPath is a directory and recursive is not turned on.
+</code></pre></div> <p>In the end, <code>addFile</code> adds the file to the <a href=#addedFiles>addedFiles</a> internal registry (with the current timestamp):</p> <ul> <li> <p>For new files, <code>addFile</code> prints out the following INFO message to the logs, <a href=../Utils/#fetchFile>fetches the file</a> (to the root directory and without using the cache) and <a href=#postEnvironmentUpdate>postEnvironmentUpdate</a>.</p> <div class=highlight><pre><span></span><code>Added file [path] at [key] with timestamp [timestamp]
 </code></pre></div> </li> <li> <p>For files that were already added, <code>addFile</code> prints out the following WARN message to the logs:</p> <div class=highlight><pre><span></span><code>The path [path] has been added already. Overwriting of added paths is not supported in the current version.
 </code></pre></div> </li> </ul> <p><code>addFile</code> is used when:</p> <ul> <li><code>SparkContext</code> is <a href=../SparkContext-creating-instance-internals/#addFile>created</a></li> </ul> <h3 id=listfiles><span id=listFiles> listFiles<a class=headerlink href=#listfiles title="Permanent link">&para;</a></h3> <div class=highlight><pre><span></span><code><span class=n>listFiles</span><span class=p>():</span><span class=w> </span><span class=nc>Seq</span><span class=p>[</span><span class=nc>String</span><span class=p>]</span>
 </code></pre></div> <p><code>listFiles</code> is the <a href=#addedFiles>files added</a>.</p> <h3 id=addedfiles-internal-registry><span id=addedFiles> addedFiles Internal Registry<a class=headerlink href=#addedfiles-internal-registry title="Permanent link">&para;</a></h3> <div class=highlight><pre><span></span><code><span class=n>addedFiles</span><span class=p>:</span><span class=w> </span><span class=nc>Map</span><span class=p>[</span><span class=nc>String</span><span class=p>,</span><span class=w> </span><span class=nc>Long</span><span class=p>]</span>
diff --git a/local/index.html b/local/index.html
index 3422c1440d..bbbcf3f255 100644
--- a/local/index.html
+++ b/local/index.html
@@ -1,4 +1,4 @@
-<!doctype html><html lang=en class=no-js> <head><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=description content="Demystifying inner-workings of Spark Core"><meta name=author content="Jacek Laskowski"><link href=https://books.japila.pl/apache-spark-internals/local/ rel=canonical><link href=../spark-tips-and-tricks-running-spark-windows/ rel=prev><link href=LocalSchedulerBackend/ rel=next><link rel=icon href=../assets/images/favicon.png><meta name=generator content="mkdocs-1.5.3, mkdocs-material-9.5.2+insiders-4.47.1"><title>Spark local - The Internals of Spark Core</title><link rel=stylesheet href=../assets/stylesheets/main.78d85e4f.min.css><link rel=stylesheet href=../assets/stylesheets/palette.ab4e12ef.min.css><link rel=preconnect href=https://fonts.gstatic.com crossorigin><link rel=stylesheet href="https://fonts.googleapis.com/css?family=Roboto:300,300i,400,400i,700,700i%7CRoboto+Mono:400,400i,700,700i&display=fallback"><style>:root{--md-text-font:"Roboto";--md-code-font:"Roboto Mono"}</style><script>__md_scope=new URL("..",location),__md_hash=e=>[...e].reduce((e,_)=>(e<<5)-e+_.charCodeAt(0),0),__md_get=(e,_=localStorage,t=__md_scope)=>JSON.parse(_.getItem(t.pathname+"."+e)),__md_set=(e,_,t=localStorage,a=__md_scope)=>{try{t.setItem(a.pathname+"."+e,JSON.stringify(_))}catch(e){}}</script><script id=__analytics>function __md_analytics(){function n(){dataLayer.push(arguments)}window.dataLayer=window.dataLayer||[],n("js",new Date),n("config","G-FTGMP2ZTLT"),document.addEventListener("DOMContentLoaded",function(){document.forms.search&&document.forms.search.query.addEventListener("blur",function(){this.value&&n("event","search",{search_term:this.value})}),document$.subscribe(function(){var a=document.forms.feedback;if(void 0!==a)for(var e of a.querySelectorAll("[type=submit]"))e.addEventListener("click",function(e){e.preventDefault();var t=document.location.pathname,e=this.getAttribute("data-md-value");n("event","feedback",{page:t,data:e}),a.firstElementChild.disabled=!0;e=a.querySelector(".md-feedback__note [data-md-value='"+e+"']");e&&(e.hidden=!1)}),a.hidden=!1}),location$.subscribe(function(e){n("config","G-FTGMP2ZTLT",{page_path:e.pathname})})});var e=document.createElement("script");e.async=!0,e.src="https://www.googletagmanager.com/gtag/js?id=G-FTGMP2ZTLT",document.getElementById("__analytics").insertAdjacentElement("afterEnd",e)}</script><script>"undefined"!=typeof __md_analytics&&__md_analytics()</script></head> <body dir=ltr data-md-color-scheme=default data-md-color-primary=indigo data-md-color-accent=indigo> <input class=md-toggle data-md-toggle=drawer type=checkbox id=__drawer autocomplete=off> <input class=md-toggle data-md-toggle=search type=checkbox id=__search autocomplete=off> <label class=md-overlay for=__drawer></label> <div data-md-component=skip> <a href=#spark-local class=md-skip> Skip to content </a> </div> <div data-md-component=announce> </div> <header class="md-header md-header--shadow md-header--lifted" data-md-component=header> <nav class="md-header__inner md-grid" aria-label=Header> <a href=.. title="The Internals of Spark Core" class="md-header__button md-logo" aria-label="The Internals of Spark Core" data-md-component=logo> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="m19 2-5 4.5v11l5-4.5V2M6.5 5C4.55 5 2.45 5.4 1 6.5v14.66c0 .25.25.5.5.5.1 0 .15-.07.25-.07 1.35-.65 3.3-1.09 4.75-1.09 1.95 0 4.05.4 5.5 1.5 1.35-.85 3.8-1.5 5.5-1.5 1.65 0 3.35.31 4.75 1.06.1.05.15.03.25.03.25 0 .5-.25.5-.5V6.5c-.6-.45-1.25-.75-2-1V19c-1.1-.35-2.3-.5-3.5-.5-1.7 0-4.15.65-5.5 1.5V6.5C10.55 5.4 8.45 5 6.5 5Z"/></svg> </a> <label class="md-header__button md-icon" for=__drawer> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M3 6h18v2H3V6m0 5h18v2H3v-2m0 5h18v2H3v-2Z"/></svg> </label> <div class=md-header__title data-md-component=header-title> <div class=md-header__ellipsis> <div class=md-header__topic> <span class=md-ellipsis> The Internals of Spark Core </span> </div> <div class=md-header__topic data-md-component=header-topic> <span class=md-ellipsis> Spark local </span> </div> </div> </div> <form class=md-header__option data-md-component=palette> <input class=md-option data-md-color-media data-md-color-scheme=default data-md-color-primary=indigo data-md-color-accent=indigo aria-label="Switch to dark mode" type=radio name=__palette id=__palette_0> <label class="md-header__button md-icon" title="Switch to dark mode" for=__palette_1 hidden> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M17 6H7c-3.31 0-6 2.69-6 6s2.69 6 6 6h10c3.31 0 6-2.69 6-6s-2.69-6-6-6zm0 10H7c-2.21 0-4-1.79-4-4s1.79-4 4-4h10c2.21 0 4 1.79 4 4s-1.79 4-4 4zM7 9c-1.66 0-3 1.34-3 3s1.34 3 3 3 3-1.34 3-3-1.34-3-3-3z"/></svg> </label> <input class=md-option data-md-color-media data-md-color-scheme=slate data-md-color-primary=blue data-md-color-accent=blue aria-label="Switch to light mode" type=radio name=__palette id=__palette_1> <label class="md-header__button md-icon" title="Switch to light mode" for=__palette_0 hidden> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M17 7H7a5 5 0 0 0-5 5 5 5 0 0 0 5 5h10a5 5 0 0 0 5-5 5 5 0 0 0-5-5m0 8a3 3 0 0 1-3-3 3 3 0 0 1 3-3 3 3 0 0 1 3 3 3 3 0 0 1-3 3Z"/></svg> </label> </form> <script>var media,input,key,value,palette=__md_get("__palette");if(palette&&palette.color){"(prefers-color-scheme)"===palette.color.media&&(media=matchMedia("(prefers-color-scheme: light)"),input=document.querySelector(media.matches?"[data-md-color-media='(prefers-color-scheme: light)']":"[data-md-color-media='(prefers-color-scheme: dark)']"),palette.color.media=input.getAttribute("data-md-color-media"),palette.color.scheme=input.getAttribute("data-md-color-scheme"),palette.color.primary=input.getAttribute("data-md-color-primary"),palette.color.accent=input.getAttribute("data-md-color-accent"));for([key,value]of Object.entries(palette.color))document.body.setAttribute("data-md-color-"+key,value)}</script> <label class="md-header__button md-icon" for=__search> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M9.5 3A6.5 6.5 0 0 1 16 9.5c0 1.61-.59 3.09-1.56 4.23l.27.27h.79l5 5-1.5 1.5-5-5v-.79l-.27-.27A6.516 6.516 0 0 1 9.5 16 6.5 6.5 0 0 1 3 9.5 6.5 6.5 0 0 1 9.5 3m0 2C7 5 5 7 5 9.5S7 14 9.5 14 14 12 14 9.5 12 5 9.5 5Z"/></svg> </label> <div class=md-search data-md-component=search role=dialog> <label class=md-search__overlay for=__search></label> <div class=md-search__inner role=search> <form class=md-search__form name=search> <input type=text class=md-search__input name=query aria-label=Search placeholder=Search autocapitalize=off autocorrect=off autocomplete=off spellcheck=false data-md-component=search-query required> <label class="md-search__icon md-icon" for=__search> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M9.5 3A6.5 6.5 0 0 1 16 9.5c0 1.61-.59 3.09-1.56 4.23l.27.27h.79l5 5-1.5 1.5-5-5v-.79l-.27-.27A6.516 6.516 0 0 1 9.5 16 6.5 6.5 0 0 1 3 9.5 6.5 6.5 0 0 1 9.5 3m0 2C7 5 5 7 5 9.5S7 14 9.5 14 14 12 14 9.5 12 5 9.5 5Z"/></svg> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M20 11v2H8l5.5 5.5-1.42 1.42L4.16 12l7.92-7.92L13.5 5.5 8 11h12Z"/></svg> </label> <nav class=md-search__options aria-label=Search> <a href=javascript:void(0) class="md-search__icon md-icon" title=Share aria-label=Share data-clipboard data-clipboard-text data-md-component=search-share tabindex=-1> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M18 16.08c-.76 0-1.44.3-1.96.77L8.91 12.7c.05-.23.09-.46.09-.7 0-.24-.04-.47-.09-.7l7.05-4.11c.54.5 1.25.81 2.04.81a3 3 0 0 0 3-3 3 3 0 0 0-3-3 3 3 0 0 0-3 3c0 .24.04.47.09.7L8.04 9.81C7.5 9.31 6.79 9 6 9a3 3 0 0 0-3 3 3 3 0 0 0 3 3c.79 0 1.5-.31 2.04-.81l7.12 4.15c-.05.21-.08.43-.08.66 0 1.61 1.31 2.91 2.92 2.91 1.61 0 2.92-1.3 2.92-2.91A2.92 2.92 0 0 0 18 16.08Z"/></svg> </a> <button type=reset class="md-search__icon md-icon" title=Clear aria-label=Clear tabindex=-1> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M19 6.41 17.59 5 12 10.59 6.41 5 5 6.41 10.59 12 5 17.59 6.41 19 12 13.41 17.59 19 19 17.59 13.41 12 19 6.41Z"/></svg> </button> </nav> <div class=md-search__suggest data-md-component=search-suggest></div> </form> <div class=md-search__output> <div class=md-search__scrollwrap data-md-scrollfix> <div class=md-search-result data-md-component=search-result> <div class=md-search-result__meta> Initializing search </div> <ol class=md-search-result__list role=presentation></ol> </div> </div> </div> </div> </div> <div class=md-header__source> <a href=https://github.com/japila-books/apache-spark-internals title="Go to repository" class=md-source data-md-component=source> <div class="md-source__icon md-icon"> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 496 512"><!-- Font Awesome Free 6.5.1 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2023 Fonticons, Inc.--><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> </div> <div class=md-source__repository> spark-internals </div> </a> </div> </nav> <nav class=md-tabs aria-label=Tabs data-md-component=tabs> <div class=md-grid> <ul class=md-tabs__list> <li class=md-tabs__item> <a href=.. class=md-tabs__link> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M10 20v-6h4v6h5v-8h3L12 3 2 12h3v8h5Z"/></svg> Spark Core </a> </li> <li class=md-tabs__item> <a href=../features/ class=md-tabs__link> Features </a> </li> <li class="md-tabs__item md-tabs__item--active"> <a href=../overview/ class=md-tabs__link> Internals </a> </li> <li class=md-tabs__item> <a href=../accumulators/ class=md-tabs__link> Shared Variables </a> </li> <li class=md-tabs__item> <a href=../ConsoleProgressBar/ class=md-tabs__link> Monitoring </a> </li> <li class=md-tabs__item> <a href=../tools/ class=md-tabs__link> Tools </a> </li> <li class=md-tabs__item> <a href=../rdd/ class=md-tabs__link> RDD </a> </li> <li class=md-tabs__item> <a href=../demo/ class=md-tabs__link> Demos </a> </li> <li class=md-tabs__item> <a href=../webui/ class=md-tabs__link> Web UIs </a> </li> </ul> </div> </nav> </header> <div class=md-container data-md-component=container> <main class=md-main data-md-component=main> <div class="md-main__inner md-grid"> <div class="md-sidebar md-sidebar--primary" data-md-component=sidebar data-md-type=navigation> <div class=md-sidebar__scrollwrap> <div class=md-sidebar__inner> <nav class="md-nav md-nav--primary md-nav--lifted" aria-label=Navigation data-md-level=0> <label class=md-nav__title for=__drawer> <a href=.. title="The Internals of Spark Core" class="md-nav__button md-logo" aria-label="The Internals of Spark Core" data-md-component=logo> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="m19 2-5 4.5v11l5-4.5V2M6.5 5C4.55 5 2.45 5.4 1 6.5v14.66c0 .25.25.5.5.5.1 0 .15-.07.25-.07 1.35-.65 3.3-1.09 4.75-1.09 1.95 0 4.05.4 5.5 1.5 1.35-.85 3.8-1.5 5.5-1.5 1.65 0 3.35.31 4.75 1.06.1.05.15.03.25.03.25 0 .5-.25.5-.5V6.5c-.6-.45-1.25-.75-2-1V19c-1.1-.35-2.3-.5-3.5-.5-1.7 0-4.15.65-5.5 1.5V6.5C10.55 5.4 8.45 5 6.5 5Z"/></svg> </a> The Internals of Spark Core </label> <div class=md-nav__source> <a href=https://github.com/japila-books/apache-spark-internals title="Go to repository" class=md-source data-md-component=source> <div class="md-source__icon md-icon"> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 496 512"><!-- Font Awesome Free 6.5.1 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2023 Fonticons, Inc.--><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> </div> <div class=md-source__repository> spark-internals </div> </a> </div> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=.. class=md-nav__link> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M10 20v-6h4v6h5v-8h3L12 3 2 12h3v8h5Z"/></svg> <span class=md-ellipsis> Spark Core </span> </a> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_2> <div class="md-nav__link md-nav__container"> <a href=../features/ class="md-nav__link "> <span class=md-ellipsis> Features </span> </a> <label class="md-nav__link " for=__nav_2 id=__nav_2_label tabindex> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_2_label aria-expanded=false> <label class=md-nav__title for=__nav_2> <span class="md-nav__icon md-icon"></span> Features </label> <ul class=md-nav__list data-md-scrollfix> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_2_2> <div class="md-nav__link md-nav__container"> <a href=../barrier-execution-mode/ class="md-nav__link "> <span class=md-ellipsis> Barrier Execution Mode <br> <small>Barrier Scheduling</small> </span> </a> <label class="md-nav__link " for=__nav_2_2 id=__nav_2_2_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_2_2_label aria-expanded=false> <label class=md-nav__title for=__nav_2_2> <span class="md-nav__icon md-icon"></span> Barrier Execution Mode </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../barrier-execution-mode/BarrierCoordinator/ class=md-nav__link> <span class=md-ellipsis> BarrierCoordinator </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/BarrierCoordinatorMessage/ class=md-nav__link> <span class=md-ellipsis> BarrierCoordinatorMessage </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/BarrierJobAllocationFailed/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> BarrierJobAllocationFailed </span> </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> BarrierJobSlotsNumberCheckFailed </span> </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/BarrierTaskContext/ class=md-nav__link> <span class=md-ellipsis> BarrierTaskContext </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/ContextBarrierState/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ContextBarrierState </span> </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/RDDBarrier/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> RDDBarrier </span> </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/RequestMethod/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> RequestMethod </span> </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/RequestToSync/ class=md-nav__link> <span class=md-ellipsis> RequestToSync </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../configuration-properties/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> Configuration Properties </span> </span> </a> </li> <li class=md-nav__item> <a href=../developer-api/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> Developer API </span> </span> </a> </li> <li class=md-nav__item> <a href=../spark-logging/ class=md-nav__link> <span class=md-ellipsis> Logging </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_2_6> <div class="md-nav__link md-nav__container"> <a href=../plugins/ class="md-nav__link "> <span class=md-ellipsis> Plugin Framework </span> </a> <label class="md-nav__link " for=__nav_2_6 id=__nav_2_6_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_2_6_label aria-expanded=false> <label class=md-nav__title for=__nav_2_6> <span class="md-nav__icon md-icon"></span> Plugin Framework </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../plugins/DriverPlugin/ class=md-nav__link> <span class=md-ellipsis> DriverPlugin </span> </a> </li> <li class=md-nav__item> <a href=../plugins/DriverPluginContainer/ class=md-nav__link> <span class=md-ellipsis> DriverPluginContainer </span> </a> </li> <li class=md-nav__item> <a href=../plugins/ExecutorPlugin/ class=md-nav__link> <span class=md-ellipsis> ExecutorPlugin </span> </a> </li> <li class=md-nav__item> <a href=../plugins/ExecutorPluginContainer/ class=md-nav__link> <span class=md-ellipsis> ExecutorPluginContainer </span> </a> </li> <li class=md-nav__item> <a href=../plugins/PluginContainer/ class=md-nav__link> <span class=md-ellipsis> PluginContainer </span> </a> </li> <li class=md-nav__item> <a href=../plugins/PluginContextImpl/ class=md-nav__link> <span class=md-ellipsis> PluginContextImpl </span> </a> </li> <li class=md-nav__item> <a href=../plugins/SparkPlugin/ class=md-nav__link> <span class=md-ellipsis> SparkPlugin </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../push-based-shuffle/ class=md-nav__link> <span class=md-ellipsis> Push-Based Shuffle </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_2_8> <div class="md-nav__link md-nav__container"> <a href=../stage-level-scheduling/ class="md-nav__link "> <span class=md-ellipsis> Stage-Level Scheduling </span> </a> <label class="md-nav__link " for=__nav_2_8 id=__nav_2_8_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_2_8_label aria-expanded=false> <label class=md-nav__title for=__nav_2_8> <span class="md-nav__icon md-icon"></span> Stage-Level Scheduling </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../stage-level-scheduling/ExecutorResourceInfo/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ExecutorResourceInfo </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ExecutorResourceRequest/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ExecutorResourceRequest </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ExecutorResourceRequests/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ExecutorResourceRequests </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ResourceAllocator/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ResourceAllocator </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ResourceID/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ResourceID </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ResourceProfile/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ResourceProfile </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ResourceProfileBuilder/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ResourceProfileBuilder </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ResourceProfileManager/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ResourceProfileManager </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ResourceUtils/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ResourceUtils </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/SparkListenerResourceProfileAdded/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkListenerResourceProfileAdded </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/TaskResourceProfile/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> TaskResourceProfile </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/TaskResourceRequest/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> TaskResourceRequest </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/TaskResourceRequests/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> TaskResourceRequests </span> </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--active md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3 checked> <label class=md-nav__link for=__nav_3 id=__nav_3_label tabindex> <span class=md-ellipsis> Internals </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_3_label aria-expanded=true> <label class=md-nav__title for=__nav_3> <span class="md-nav__icon md-icon"></span> Internals </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../overview/ class=md-nav__link> <span class=md-ellipsis> Overview </span> </a> </li> <li class=md-nav__item> <a href=../SparkEnv/ class=md-nav__link> <span class=md-ellipsis> SparkEnv </span> </a> </li> <li class=md-nav__item> <a href=../SparkConf/ class=md-nav__link> <span class=md-ellipsis> SparkConf </span> </a> </li> <li class=md-nav__item> <a href=../SparkContext/ class=md-nav__link> <span class=md-ellipsis> SparkContext </span> </a> </li> <li class=md-nav__item> <a href=../SparkCoreErrors/ class=md-nav__link> <span class=md-ellipsis> SparkCoreErrors </span> </a> </li> <li class=md-nav__item> <a href=../local-properties/ class=md-nav__link> <span class=md-ellipsis> Local Properties </span> </a> </li> <li class=md-nav__item> <a href=../SparkContext-creating-instance-internals/ class=md-nav__link> <span class=md-ellipsis> Inside Creating SparkContext </span> </a> </li> <li class=md-nav__item> <a href=../SparkFiles/ class=md-nav__link> <span class=md-ellipsis> SparkFiles </span> </a> </li> <li class=md-nav__item> <a href=../spark-properties/ class=md-nav__link> <span class=md-ellipsis> Spark Properties </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_10> <div class="md-nav__link md-nav__container"> <a href=../executor/ class="md-nav__link "> <span class=md-ellipsis> Executor </span> </a> <label class="md-nav__link " for=__nav_3_10 id=__nav_3_10_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_10_label aria-expanded=false> <label class=md-nav__title for=__nav_3_10> <span class="md-nav__icon md-icon"></span> Executor </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../executor/CoarseGrainedExecutorBackend/ class=md-nav__link> <span class=md-ellipsis> CoarseGrainedExecutorBackend </span> </a> </li> <li class=md-nav__item> <a href=../executor/Executor/ class=md-nav__link> <span class=md-ellipsis> Executor </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorBackend/ class=md-nav__link> <span class=md-ellipsis> ExecutorBackend </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorLogUrlHandler/ class=md-nav__link> <span class=md-ellipsis> ExecutorLogUrlHandler </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorMetrics/ class=md-nav__link> <span class=md-ellipsis> ExecutorMetrics </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorMetricsPoller/ class=md-nav__link> <span class=md-ellipsis> ExecutorMetricsPoller </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorMetricsSource/ class=md-nav__link> <span class=md-ellipsis> ExecutorMetricsSource </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorMetricType/ class=md-nav__link> <span class=md-ellipsis> ExecutorMetricType </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorSource/ class=md-nav__link> <span class=md-ellipsis> ExecutorSource </span> </a> </li> <li class=md-nav__item> <a href=../executor/ShuffleReadMetrics/ class=md-nav__link> <span class=md-ellipsis> ShuffleReadMetrics </span> </a> </li> <li class=md-nav__item> <a href=../executor/ShuffleWriteMetrics/ class=md-nav__link> <span class=md-ellipsis> ShuffleWriteMetrics </span> </a> </li> <li class=md-nav__item> <a href=../executor/TaskMetrics/ class=md-nav__link> <span class=md-ellipsis> TaskMetrics </span> </a> </li> <li class=md-nav__item> <a href=../executor/TaskRunner/ class=md-nav__link> <span class=md-ellipsis> TaskRunner </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_11> <div class="md-nav__link md-nav__container"> <a href=../external-shuffle-service/ class="md-nav__link "> <span class=md-ellipsis> External Shuffle Service </span> </a> <label class="md-nav__link " for=__nav_3_11 id=__nav_3_11_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_11_label aria-expanded=false> <label class=md-nav__title for=__nav_3_11> <span class="md-nav__icon md-icon"></span> External Shuffle Service </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../external-shuffle-service/configuration-properties/ class=md-nav__link> <span class=md-ellipsis> Configuration Properties </span> </a> </li> <li class=md-nav__item> <a href=../external-shuffle-service/ExternalShuffleService/ class=md-nav__link> <span class=md-ellipsis> ExternalShuffleService </span> </a> </li> <li class=md-nav__item> <a href=../external-shuffle-service/ExternalBlockHandler/ class=md-nav__link> <span class=md-ellipsis> ExternalBlockHandler </span> </a> </li> <li class=md-nav__item> <a href=../external-shuffle-service/ExternalShuffleBlockResolver/ class=md-nav__link> <span class=md-ellipsis> ExternalShuffleBlockResolver </span> </a> </li> <li class=md-nav__item> <a href=../external-shuffle-service/ExecutorShuffleInfo/ class=md-nav__link> <span class=md-ellipsis> ExecutorShuffleInfo </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_12> <label class=md-nav__link for=__nav_3_12 id=__nav_3_12_label tabindex=0> <span class=md-ellipsis> MapOutputTracker </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_12_label aria-expanded=false> <label class=md-nav__title for=__nav_3_12> <span class="md-nav__icon md-icon"></span> MapOutputTracker </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../scheduler/MapOutputStatistics/ class=md-nav__link> <span class=md-ellipsis> MapOutputStatistics </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/MapOutputTracker/ class=md-nav__link> <span class=md-ellipsis> MapOutputTracker </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/MapOutputTrackerMaster/ class=md-nav__link> <span class=md-ellipsis> MapOutputTrackerMaster </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/MapOutputTrackerMasterEndpoint/ class=md-nav__link> <span class=md-ellipsis> MapOutputTrackerMasterEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/MapOutputTrackerWorker/ class=md-nav__link> <span class=md-ellipsis> MapOutputTrackerWorker </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ShuffleStatus/ class=md-nav__link> <span class=md-ellipsis> ShuffleStatus </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_13> <div class="md-nav__link md-nav__container"> <a href=../shuffle/ class="md-nav__link "> <span class=md-ellipsis> Shuffle System </span> </a> <label class="md-nav__link " for=__nav_3_13 id=__nav_3_13_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_13_label aria-expanded=false> <label class=md-nav__title for=__nav_3_13> <span class="md-nav__icon md-icon"></span> Shuffle System </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../shuffle/BaseShuffleHandle/ class=md-nav__link> <span class=md-ellipsis> BaseShuffleHandle </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/BlockStoreShuffleReader/ class=md-nav__link> <span class=md-ellipsis> BlockStoreShuffleReader </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/BypassMergeSortShuffleHandle/ class=md-nav__link> <span class=md-ellipsis> BypassMergeSortShuffleHandle </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/BypassMergeSortShuffleWriter/ class=md-nav__link> <span class=md-ellipsis> BypassMergeSortShuffleWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/DownloadFileManager/ class=md-nav__link> <span class=md-ellipsis> DownloadFileManager </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/FetchFailedException/ class=md-nav__link> <span class=md-ellipsis> FetchFailedException </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/IndexShuffleBlockResolver/ class=md-nav__link> <span class=md-ellipsis> IndexShuffleBlockResolver </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/MigratableResolver/ class=md-nav__link> <span class=md-ellipsis> MigratableResolver </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/SerializedShuffleHandle/ class=md-nav__link> <span class=md-ellipsis> SerializedShuffleHandle </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleBlockPusher/ class=md-nav__link> <span class=md-ellipsis> ShuffleBlockPusher </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleBlockResolver/ class=md-nav__link> <span class=md-ellipsis> ShuffleBlockResolver </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleExternalSorter/ class=md-nav__link> <span class=md-ellipsis> ShuffleExternalSorter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleHandle/ class=md-nav__link> <span class=md-ellipsis> ShuffleHandle </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleInMemorySorter/ class=md-nav__link> <span class=md-ellipsis> ShuffleInMemorySorter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleManager/ class=md-nav__link> <span class=md-ellipsis> ShuffleManager </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleReader/ class=md-nav__link> <span class=md-ellipsis> ShuffleReader </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleWriteMetricsReporter/ class=md-nav__link> <span class=md-ellipsis> ShuffleWriteMetricsReporter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleWriteProcessor/ class=md-nav__link> <span class=md-ellipsis> ShuffleWriteProcessor </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleWriter/ class=md-nav__link> <span class=md-ellipsis> ShuffleWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/SortShuffleManager/ class=md-nav__link> <span class=md-ellipsis> SortShuffleManager </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/SortShuffleWriter/ class=md-nav__link> <span class=md-ellipsis> SortShuffleWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/UnsafeShuffleWriter/ class=md-nav__link> <span class=md-ellipsis> UnsafeShuffleWriter </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_13_24> <label class=md-nav__link for=__nav_3_13_24 id=__nav_3_13_24_label tabindex=0> <span class=md-ellipsis> Spillable Collections </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_3_13_24_label aria-expanded=false> <label class=md-nav__title for=__nav_3_13_24> <span class="md-nav__icon md-icon"></span> Spillable Collections </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../shuffle/Spillable/ class=md-nav__link> <span class=md-ellipsis> Spillable </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ExternalAppendOnlyMap/ class=md-nav__link> <span class=md-ellipsis> ExternalAppendOnlyMap </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ExternalSorter/ class=md-nav__link> <span class=md-ellipsis> ExternalSorter </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_13_25> <label class=md-nav__link for=__nav_3_13_25 id=__nav_3_13_25_label tabindex=0> <span class=md-ellipsis> ShuffleDataIOs </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_3_13_25_label aria-expanded=false> <label class=md-nav__title for=__nav_3_13_25> <span class="md-nav__icon md-icon"></span> ShuffleDataIOs </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../shuffle/ShuffleDataIO/ class=md-nav__link> <span class=md-ellipsis> ShuffleDataIO </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/LocalDiskShuffleDataIO/ class=md-nav__link> <span class=md-ellipsis> LocalDiskShuffleDataIO </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleDriverComponents/ class=md-nav__link> <span class=md-ellipsis> ShuffleDriverComponents </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleExecutorComponents/ class=md-nav__link> <span class=md-ellipsis> ShuffleExecutorComponents </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleMapOutputWriter/ class=md-nav__link> <span class=md-ellipsis> ShuffleMapOutputWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/LocalDiskShuffleMapOutputWriter/ class=md-nav__link> <span class=md-ellipsis> LocalDiskShuffleMapOutputWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ExecutorDiskUtils/ class=md-nav__link> <span class=md-ellipsis> ExecutorDiskUtils </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/SingleSpillShuffleMapOutputWriter/ class=md-nav__link> <span class=md-ellipsis> SingleSpillShuffleMapOutputWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/LocalDiskShuffleExecutorComponents/ class=md-nav__link> <span class=md-ellipsis> LocalDiskShuffleExecutorComponents </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/LocalDiskSingleSpillMapOutputWriter/ class=md-nav__link> <span class=md-ellipsis> LocalDiskSingleSpillMapOutputWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleDataIOUtils/ class=md-nav__link> <span class=md-ellipsis> ShuffleDataIOUtils </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_14> <div class="md-nav__link md-nav__container"> <a href=../dynamic-allocation/ class="md-nav__link "> <span class=md-ellipsis> Dynamic Resource Allocation </span> </a> <label class="md-nav__link " for=__nav_3_14 id=__nav_3_14_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_14_label aria-expanded=false> <label class=md-nav__title for=__nav_3_14> <span class="md-nav__icon md-icon"></span> Dynamic Resource Allocation </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../dynamic-allocation/configuration-properties/ class=md-nav__link> <span class=md-ellipsis> Configuration Properties </span> </a> </li> <li class=md-nav__item> <a href=../dynamic-allocation/ExecutorAllocationManager/ class=md-nav__link> <span class=md-ellipsis> ExecutorAllocationManager </span> </a> </li> <li class=md-nav__item> <a href=../dynamic-allocation/ExecutorMonitor/ class=md-nav__link> <span class=md-ellipsis> ExecutorMonitor </span> </a> </li> <li class=md-nav__item> <a href=../dynamic-allocation/Tracker/ class=md-nav__link> <span class=md-ellipsis> Tracker </span> </a> </li> <li class=md-nav__item> <a href=../dynamic-allocation/ExecutorAllocationClient/ class=md-nav__link> <span class=md-ellipsis> ExecutorAllocationClient </span> </a> </li> <li class=md-nav__item> <a href=../dynamic-allocation/ExecutorAllocationManagerSource/ class=md-nav__link> <span class=md-ellipsis> ExecutorAllocationManagerSource </span> </a> </li> <li class=md-nav__item> <a href=../dynamic-allocation/ExecutorAllocationListener/ class=md-nav__link> <span class=md-ellipsis> ExecutorAllocationListener </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_15> <label class=md-nav__link for=__nav_3_15 id=__nav_3_15_label tabindex=0> <span class=md-ellipsis> Core </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_15_label aria-expanded=false> <label class=md-nav__title for=__nav_3_15> <span class="md-nav__icon md-icon"></span> Core </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../core/ContextCleaner/ class=md-nav__link> <span class=md-ellipsis> ContextCleaner </span> </a> </li> <li class=md-nav__item> <a href=../core/CleanerListener/ class=md-nav__link> <span class=md-ellipsis> CleanerListener </span> </a> </li> <li class=md-nav__item> <a href=../core/BlockFetchingListener/ class=md-nav__link> <span class=md-ellipsis> BlockFetchingListener </span> </a> </li> <li class=md-nav__item> <a href=../core/RetryingBlockFetcher/ class=md-nav__link> <span class=md-ellipsis> RetryingBlockFetcher </span> </a> </li> <li class=md-nav__item> <a href=../core/BlockFetchStarter/ class=md-nav__link> <span class=md-ellipsis> BlockFetchStarter </span> </a> </li> <li class=md-nav__item> <a href=../core/KVStore/ class=md-nav__link> <span class=md-ellipsis> KVStore </span> </a> </li> <li class=md-nav__item> <a href=../core/InMemoryStore/ class=md-nav__link> <span class=md-ellipsis> InMemoryStore </span> </a> </li> <li class=md-nav__item> <a href=../core/LevelDB/ class=md-nav__link> <span class=md-ellipsis> LevelDB </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../OutputCommitCoordinator/ class=md-nav__link> <span class=md-ellipsis> OutputCommitCoordinator </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_17> <div class="md-nav__link md-nav__container"> <a href=../scheduler/ class="md-nav__link "> <span class=md-ellipsis> Scheduler </span> </a> <label class="md-nav__link " for=__nav_3_17 id=__nav_3_17_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_17_label aria-expanded=false> <label class=md-nav__title for=__nav_3_17> <span class="md-nav__icon md-icon"></span> Scheduler </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../scheduler/ActiveJob/ class=md-nav__link> <span class=md-ellipsis> ActiveJob </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/BlacklistTracker/ class=md-nav__link> <span class=md-ellipsis> BlacklistTracker </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/CoarseGrainedSchedulerBackend/ class=md-nav__link> <span class=md-ellipsis> CoarseGrainedSchedulerBackend </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/DAGScheduler/ class=md-nav__link> <span class=md-ellipsis> DAGScheduler </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/DAGSchedulerEvent/ class=md-nav__link> <span class=md-ellipsis> DAGSchedulerEvent </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/DAGSchedulerEventProcessLoop/ class=md-nav__link> <span class=md-ellipsis> DAGSchedulerEventProcessLoop </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/DAGSchedulerSource/ class=md-nav__link> <span class=md-ellipsis> DAGSchedulerSource </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/DriverEndpoint/ class=md-nav__link> <span class=md-ellipsis> DriverEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ExecutorData/ class=md-nav__link> <span class=md-ellipsis> ExecutorData </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ExternalClusterManager/ class=md-nav__link> <span class=md-ellipsis> ExternalClusterManager </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/FairSchedulableBuilder/ class=md-nav__link> <span class=md-ellipsis> FairSchedulableBuilder </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/FIFOSchedulableBuilder/ class=md-nav__link> <span class=md-ellipsis> FIFOSchedulableBuilder </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/JobListener/ class=md-nav__link> <span class=md-ellipsis> JobListener </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/JobWaiter/ class=md-nav__link> <span class=md-ellipsis> JobWaiter </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/LiveListenerBus/ class=md-nav__link> <span class=md-ellipsis> LiveListenerBus </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_17_17> <label class=md-nav__link for=__nav_3_17_17 id=__nav_3_17_17_label tabindex=0> <span class=md-ellipsis> MapStatuses </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_3_17_17_label aria-expanded=false> <label class=md-nav__title for=__nav_3_17_17> <span class="md-nav__icon md-icon"></span> MapStatuses </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../scheduler/MapStatus/ class=md-nav__link> <span class=md-ellipsis> MapStatus </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/CompressedMapStatus/ class=md-nav__link> <span class=md-ellipsis> CompressedMapStatus </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/HighlyCompressedMapStatus/ class=md-nav__link> <span class=md-ellipsis> HighlyCompressedMapStatus </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../scheduler/Pool/ class=md-nav__link> <span class=md-ellipsis> Pool </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ResultStage/ class=md-nav__link> <span class=md-ellipsis> ResultStage </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ResultTask/ class=md-nav__link> <span class=md-ellipsis> ResultTask </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/Schedulable/ class=md-nav__link> <span class=md-ellipsis> Schedulable </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/SchedulableBuilder/ class=md-nav__link> <span class=md-ellipsis> SchedulableBuilder </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/SchedulerBackend/ class=md-nav__link> <span class=md-ellipsis> SchedulerBackend </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/SchedulerBackendUtils/ class=md-nav__link> <span class=md-ellipsis> SchedulerBackendUtils </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/SchedulingMode/ class=md-nav__link> <span class=md-ellipsis> SchedulingMode </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ShuffleMapStage/ class=md-nav__link> <span class=md-ellipsis> ShuffleMapStage </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ShuffleMapTask/ class=md-nav__link> <span class=md-ellipsis> ShuffleMapTask </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/Stage/ class=md-nav__link> <span class=md-ellipsis> Stage </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/StageInfo/ class=md-nav__link> <span class=md-ellipsis> StageInfo </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskScheduler/ class=md-nav__link> <span class=md-ellipsis> TaskScheduler </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskSchedulerImpl/ class=md-nav__link> <span class=md-ellipsis> TaskSchedulerImpl </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/Task/ class=md-nav__link> <span class=md-ellipsis> Task </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskContext/ class=md-nav__link> <span class=md-ellipsis> TaskContext </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskContextImpl/ class=md-nav__link> <span class=md-ellipsis> TaskContextImpl </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskDescription/ class=md-nav__link> <span class=md-ellipsis> TaskDescription </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskInfo/ class=md-nav__link> <span class=md-ellipsis> TaskInfo </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskLocation/ class=md-nav__link> <span class=md-ellipsis> TaskLocation </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskResult/ class=md-nav__link> <span class=md-ellipsis> TaskResult </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskResultGetter/ class=md-nav__link> <span class=md-ellipsis> TaskResultGetter </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskSet/ class=md-nav__link> <span class=md-ellipsis> TaskSet </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskSetBlacklist/ class=md-nav__link> <span class=md-ellipsis> TaskSetBlacklist </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskSetManager/ class=md-nav__link> <span class=md-ellipsis> TaskSetManager </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_18> <div class="md-nav__link md-nav__container"> <a href=../rpc/ class="md-nav__link "> <span class=md-ellipsis> RPC </span> </a> <label class="md-nav__link " for=__nav_3_18 id=__nav_3_18_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_18_label aria-expanded=false> <label class=md-nav__title for=__nav_3_18> <span class="md-nav__icon md-icon"></span> RPC </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../rpc/RpcEnv/ class=md-nav__link> <span class=md-ellipsis> RpcEnv </span> </a> </li> <li class=md-nav__item> <a href=../rpc/NettyRpcEnv/ class=md-nav__link> <span class=md-ellipsis> NettyRpcEnv </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcEnvConfig/ class=md-nav__link> <span class=md-ellipsis> RpcEnvConfig </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcEndpoint/ class=md-nav__link> <span class=md-ellipsis> RpcEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcEndpointRef/ class=md-nav__link> <span class=md-ellipsis> RpcEndpointRef </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcAddress/ class=md-nav__link> <span class=md-ellipsis> RpcAddress </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcEndpointAddress/ class=md-nav__link> <span class=md-ellipsis> RpcEndpointAddress </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcEnvFactory/ class=md-nav__link> <span class=md-ellipsis> RpcEnvFactory </span> </a> </li> <li class=md-nav__item> <a href=../rpc/NettyRpcEnvFactory/ class=md-nav__link> <span class=md-ellipsis> NettyRpcEnvFactory </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcEnvFileServer/ class=md-nav__link> <span class=md-ellipsis> RpcEnvFileServer </span> </a> </li> <li class=md-nav__item> <a href=../rpc/spark-rpc-netty/ class=md-nav__link> <span class=md-ellipsis> spark-rpc-netty </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcUtils/ class=md-nav__link> <span class=md-ellipsis> RpcUtils </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_19> <div class="md-nav__link md-nav__container"> <a href=../memory/ class="md-nav__link "> <span class=md-ellipsis> Memory </span> </a> <label class="md-nav__link " for=__nav_3_19 id=__nav_3_19_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_19_label aria-expanded=false> <label class=md-nav__title for=__nav_3_19> <span class="md-nav__icon md-icon"></span> Memory </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../memory/ExecutionMemoryPool/ class=md-nav__link> <span class=md-ellipsis> ExecutionMemoryPool </span> </a> </li> <li class=md-nav__item> <a href=../memory/MemoryAllocator/ class=md-nav__link> <span class=md-ellipsis> MemoryAllocator </span> </a> </li> <li class=md-nav__item> <a href=../memory/MemoryConsumer/ class=md-nav__link> <span class=md-ellipsis> MemoryConsumer </span> </a> </li> <li class=md-nav__item> <a href=../memory/MemoryManager/ class=md-nav__link> <span class=md-ellipsis> MemoryManager </span> </a> </li> <li class=md-nav__item> <a href=../memory/MemoryPool/ class=md-nav__link> <span class=md-ellipsis> MemoryPool </span> </a> </li> <li class=md-nav__item> <a href=../memory/StorageMemoryPool/ class=md-nav__link> <span class=md-ellipsis> StorageMemoryPool </span> </a> </li> <li class=md-nav__item> <a href=../memory/TaskMemoryManager/ class=md-nav__link> <span class=md-ellipsis> TaskMemoryManager </span> </a> </li> <li class=md-nav__item> <a href=../memory/UnifiedMemoryManager/ class=md-nav__link> <span class=md-ellipsis> UnifiedMemoryManager </span> </a> </li> <li class=md-nav__item> <a href=../memory/UnsafeExternalSorter/ class=md-nav__link> <span class=md-ellipsis> UnsafeExternalSorter </span> </a> </li> <li class=md-nav__item> <a href=../memory/UnsafeInMemorySorter/ class=md-nav__link> <span class=md-ellipsis> UnsafeInMemorySorter </span> </a> </li> <li class=md-nav__item> <a href=../memory/UnsafeSorterSpillReader/ class=md-nav__link> <span class=md-ellipsis> UnsafeSorterSpillReader </span> </a> </li> <li class=md-nav__item> <a href=../memory/UnsafeSorterSpillWriter/ class=md-nav__link> <span class=md-ellipsis> UnsafeSorterSpillWriter </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_20> <div class="md-nav__link md-nav__container"> <a href=../storage/ class="md-nav__link "> <span class=md-ellipsis> Storage </span> </a> <label class="md-nav__link " for=__nav_3_20 id=__nav_3_20_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_20_label aria-expanded=false> <label class=md-nav__title for=__nav_3_20> <span class="md-nav__icon md-icon"></span> Storage </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../storage/BlockData/ class=md-nav__link> <span class=md-ellipsis> BlockData </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockDataManager/ class=md-nav__link> <span class=md-ellipsis> BlockDataManager </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockEvictionHandler/ class=md-nav__link> <span class=md-ellipsis> BlockEvictionHandler </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockId/ class=md-nav__link> <span class=md-ellipsis> BlockId </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockInfo/ class=md-nav__link> <span class=md-ellipsis> BlockInfo </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockInfoManager/ class=md-nav__link> <span class=md-ellipsis> BlockInfoManager </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManager/ class=md-nav__link> <span class=md-ellipsis> BlockManager </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerDecommissioner/ class=md-nav__link> <span class=md-ellipsis> BlockManagerDecommissioner </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerStorageEndpoint/ class=md-nav__link> <span class=md-ellipsis> BlockManagerStorageEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerId/ class=md-nav__link> <span class=md-ellipsis> BlockManagerId </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerInfo/ class=md-nav__link> <span class=md-ellipsis> BlockManagerInfo </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerMaster/ class=md-nav__link> <span class=md-ellipsis> BlockManagerMaster </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerMasterEndpoint/ class=md-nav__link> <span class=md-ellipsis> BlockManagerMasterEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerMasterHeartbeatEndpoint/ class=md-nav__link> <span class=md-ellipsis> BlockManagerMasterHeartbeatEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerSlaveEndpoint/ class=md-nav__link> <span class=md-ellipsis> BlockManagerSlaveEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerSource/ class=md-nav__link> <span class=md-ellipsis> BlockManagerSource </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockReplicationPolicy/ class=md-nav__link> <span class=md-ellipsis> BlockReplicationPolicy </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockStoreClient/ class=md-nav__link> <span class=md-ellipsis> BlockStoreClient </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockStoreUpdater/ class=md-nav__link> <span class=md-ellipsis> BlockStoreUpdater </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockTransferService/ class=md-nav__link> <span class=md-ellipsis> BlockTransferService </span> </a> </li> <li class=md-nav__item> <a href=../storage/ByteBufferBlockStoreUpdater/ class=md-nav__link> <span class=md-ellipsis> ByteBufferBlockStoreUpdater </span> </a> </li> <li class=md-nav__item> <a href=../storage/DiskBlockObjectWriter/ class=md-nav__link> <span class=md-ellipsis> DiskBlockObjectWriter </span> </a> </li> <li class=md-nav__item> <a href=../storage/DiskStore/ class=md-nav__link> <span class=md-ellipsis> DiskStore </span> </a> </li> <li class=md-nav__item> <a href=../storage/DiskBlockManager/ class=md-nav__link> <span class=md-ellipsis> DiskBlockManager </span> </a> </li> <li class=md-nav__item> <a href=../storage/ExternalBlockStoreClient/ class=md-nav__link> <span class=md-ellipsis> ExternalBlockStoreClient </span> </a> </li> <li class=md-nav__item> <a href=../storage/FallbackStorage/ class=md-nav__link> <span class=md-ellipsis> FallbackStorage </span> </a> </li> <li class=md-nav__item> <a href=../storage/MemoryStore/ class=md-nav__link> <span class=md-ellipsis> MemoryStore </span> </a> </li> <li class=md-nav__item> <a href=../storage/ShuffleMetricsSource/ class=md-nav__link> <span class=md-ellipsis> ShuffleMetricsSource </span> </a> </li> <li class=md-nav__item> <a href=../storage/NettyBlockTransferService/ class=md-nav__link> <span class=md-ellipsis> NettyBlockTransferService </span> </a> </li> <li class=md-nav__item> <a href=../storage/NettyBlockRpcServer/ class=md-nav__link> <span class=md-ellipsis> NettyBlockRpcServer </span> </a> </li> <li class=md-nav__item> <a href=../storage/OneForOneBlockFetcher/ class=md-nav__link> <span class=md-ellipsis> OneForOneBlockFetcher </span> </a> </li> <li class=md-nav__item> <a href=../storage/RandomBlockReplicationPolicy/ class=md-nav__link> <span class=md-ellipsis> RandomBlockReplicationPolicy </span> </a> </li> <li class=md-nav__item> <a href=../storage/RDDInfo/ class=md-nav__link> <span class=md-ellipsis> RDDInfo </span> </a> </li> <li class=md-nav__item> <a href=../storage/ShuffleBlockFetcherIterator/ class=md-nav__link> <span class=md-ellipsis> ShuffleBlockFetcherIterator </span> </a> </li> <li class=md-nav__item> <a href=../storage/ShuffleFetchCompletionListener/ class=md-nav__link> <span class=md-ellipsis> ShuffleFetchCompletionListener </span> </a> </li> <li class=md-nav__item> <a href=../storage/ShuffleMigrationRunnable/ class=md-nav__link> <span class=md-ellipsis> ShuffleMigrationRunnable </span> </a> </li> <li class=md-nav__item> <a href=../storage/StorageLevel/ class=md-nav__link> <span class=md-ellipsis> StorageLevel </span> </a> </li> <li class=md-nav__item> <a href=../storage/StorageStatus/ class=md-nav__link> <span class=md-ellipsis> StorageStatus </span> </a> </li> <li class=md-nav__item> <a href=../storage/StorageUtils/ class=md-nav__link> <span class=md-ellipsis> StorageUtils </span> </a> </li> <li class=md-nav__item> <a href=../storage/TempFileBasedBlockStoreUpdater/ class=md-nav__link> <span class=md-ellipsis> TempFileBasedBlockStoreUpdater </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_21> <div class="md-nav__link md-nav__container"> <a href=../serializer/ class="md-nav__link "> <span class=md-ellipsis> Serialization </span> </a> <label class="md-nav__link " for=__nav_3_21 id=__nav_3_21_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_21_label aria-expanded=false> <label class=md-nav__title for=__nav_3_21> <span class="md-nav__icon md-icon"></span> Serialization </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../serializer/DeserializationStream/ class=md-nav__link> <span class=md-ellipsis> DeserializationStream </span> </a> </li> <li class=md-nav__item> <a href=../serializer/JavaSerializerInstance/ class=md-nav__link> <span class=md-ellipsis> JavaSerializerInstance </span> </a> </li> <li class=md-nav__item> <a href=../serializer/KryoSerializer/ class=md-nav__link> <span class=md-ellipsis> KryoSerializer </span> </a> </li> <li class=md-nav__item> <a href=../serializer/KryoSerializerInstance/ class=md-nav__link> <span class=md-ellipsis> KryoSerializerInstance </span> </a> </li> <li class=md-nav__item> <a href=../serializer/SerializationStream/ class=md-nav__link> <span class=md-ellipsis> SerializationStream </span> </a> </li> <li class=md-nav__item> <a href=../serializer/Serializer/ class=md-nav__link> <span class=md-ellipsis> Serializer </span> </a> </li> <li class=md-nav__item> <a href=../serializer/SerializerInstance/ class=md-nav__link> <span class=md-ellipsis> SerializerInstance </span> </a> </li> <li class=md-nav__item> <a href=../serializer/SerializerManager/ class=md-nav__link> <span class=md-ellipsis> SerializerManager </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../speculative-execution-of-tasks/ class=md-nav__link> <span class=md-ellipsis> Speculative Execution of Tasks </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_23> <label class=md-nav__link for=__nav_3_23 id=__nav_3_23_label tabindex=0> <span class=md-ellipsis> Deployment Architecture </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_23_label aria-expanded=false> <label class=md-nav__title for=__nav_3_23> <span class="md-nav__icon md-icon"></span> Deployment Architecture </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../architecture/ class=md-nav__link> <span class=md-ellipsis> Architecture </span> </a> </li> <li class=md-nav__item> <a href=../driver/ class=md-nav__link> <span class=md-ellipsis> Driver </span> </a> </li> <li class=md-nav__item> <a href=../master/ class=md-nav__link> <span class=md-ellipsis> Master </span> </a> </li> <li class=md-nav__item> <a href=../workers/ class=md-nav__link> <span class=md-ellipsis> Workers </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_24> <label class=md-nav__link for=__nav_3_24 id=__nav_3_24_label tabindex=0> <span class=md-ellipsis> Internal IO </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_24_label aria-expanded=false> <label class=md-nav__title for=__nav_3_24> <span class="md-nav__icon md-icon"></span> Internal IO </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../SparkHadoopWriter/ class=md-nav__link> <span class=md-ellipsis> SparkHadoopWriter </span> </a> </li> <li class=md-nav__item> <a href=../HadoopWriteConfigUtil/ class=md-nav__link> <span class=md-ellipsis> HadoopWriteConfigUtil </span> </a> </li> <li class=md-nav__item> <a href=../FileCommitProtocol/ class=md-nav__link> <span class=md-ellipsis> FileCommitProtocol </span> </a> </li> <li class=md-nav__item> <a href=../HadoopMapReduceCommitProtocol/ class=md-nav__link> <span class=md-ellipsis> HadoopMapReduceCommitProtocol </span> </a> </li> <li class=md-nav__item> <a href=../HadoopMapRedCommitProtocol/ class=md-nav__link> <span class=md-ellipsis> HadoopMapRedCommitProtocol </span> </a> </li> <li class=md-nav__item> <a href=../HadoopMapReduceWriteConfigUtil/ class=md-nav__link> <span class=md-ellipsis> HadoopMapReduceWriteConfigUtil </span> </a> </li> <li class=md-nav__item> <a href=../HadoopMapRedWriteConfigUtil/ class=md-nav__link> <span class=md-ellipsis> HadoopMapRedWriteConfigUtil </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_25> <div class="md-nav__link md-nav__container"> <a href=../status/ class="md-nav__link "> <span class=md-ellipsis> Status </span> </a> <label class="md-nav__link " for=__nav_3_25 id=__nav_3_25_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_25_label aria-expanded=false> <label class=md-nav__title for=__nav_3_25> <span class="md-nav__icon md-icon"></span> Status </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../status/AppStatusListener/ class=md-nav__link> <span class=md-ellipsis> AppStatusListener </span> </a> </li> <li class=md-nav__item> <a href=../status/AppStatusSource/ class=md-nav__link> <span class=md-ellipsis> AppStatusSource </span> </a> </li> <li class=md-nav__item> <a href=../status/AppStatusStore/ class=md-nav__link> <span class=md-ellipsis> AppStatusStore </span> </a> </li> <li class=md-nav__item> <a href=../status/ElementTrackingStore/ class=md-nav__link> <span class=md-ellipsis> ElementTrackingStore </span> </a> </li> <li class=md-nav__item> <a href=../status/LiveEntity/ class=md-nav__link> <span class=md-ellipsis> LiveEntity </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_26> <div class="md-nav__link md-nav__container"> <a href=../network/ class="md-nav__link "> <span class=md-ellipsis> Network </span> </a> <label class="md-nav__link " for=__nav_3_26 id=__nav_3_26_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_26_label aria-expanded=false> <label class=md-nav__title for=__nav_3_26> <span class="md-nav__icon md-icon"></span> Network </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../network/SparkTransportConf/ class=md-nav__link> <span class=md-ellipsis> SparkTransportConf </span> </a> </li> <li class=md-nav__item> <a href=../network/TransportClientFactory/ class=md-nav__link> <span class=md-ellipsis> TransportClientFactory </span> </a> </li> <li class=md-nav__item> <a href=../network/TransportContext/ class=md-nav__link> <span class=md-ellipsis> TransportContext </span> </a> </li> <li class=md-nav__item> <a href=../network/TransportConf/ class=md-nav__link> <span class=md-ellipsis> TransportConf </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_27> <label class=md-nav__link for=__nav_3_27 id=__nav_3_27_label tabindex=0> <span class=md-ellipsis> Misc </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_27_label aria-expanded=false> <label class=md-nav__title for=__nav_3_27> <span class="md-nav__icon md-icon"></span> Misc </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../BytesToBytesMap/ class=md-nav__link> <span class=md-ellipsis> BytesToBytesMap </span> </a> </li> <li class=md-nav__item> <a href=../ExecutorDeadException/ class=md-nav__link> <span class=md-ellipsis> ExecutorDeadException </span> </a> </li> <li class=md-nav__item> <a href=../HeartbeatReceiver/ class=md-nav__link> <span class=md-ellipsis> HeartbeatReceiver </span> </a> </li> <li class=md-nav__item> <a href=../InterruptibleIterator/ class=md-nav__link> <span class=md-ellipsis> InterruptibleIterator </span> </a> </li> <li class=md-nav__item> <a href=../Utils/ class=md-nav__link> <span class=md-ellipsis> Utils </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_28> <label class=md-nav__link for=__nav_3_28 id=__nav_3_28_label tabindex=0> <span class=md-ellipsis> Spark Tips and Tricks </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_28_label aria-expanded=false> <label class=md-nav__title for=__nav_3_28> <span class="md-nav__icon md-icon"></span> Spark Tips and Tricks </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../spark-tips-and-tricks/ class=md-nav__link> <span class=md-ellipsis> Spark Tips and Tricks </span> </a> </li> <li class=md-nav__item> <a href=../spark-tips-and-tricks-access-private-members-spark-shell/ class=md-nav__link> <span class=md-ellipsis> Access private members in Scala in Spark shell </span> </a> </li> <li class=md-nav__item> <a href=../spark-tips-and-tricks-sparkexception-task-not-serializable/ class=md-nav__link> <span class=md-ellipsis> Task not serializable Exception </span> </a> </li> <li class=md-nav__item> <a href=../spark-tips-and-tricks-running-spark-windows/ class=md-nav__link> <span class=md-ellipsis> Running Spark Applications on Windows </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--active md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_29 checked> <div class="md-nav__link md-nav__container"> <a href=./ class="md-nav__link md-nav__link--active"> <span class=md-ellipsis> Spark Local </span> </a> <label class="md-nav__link md-nav__link--active" for=__nav_3_29 id=__nav_3_29_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_29_label aria-expanded=true> <label class=md-nav__title for=__nav_3_29> <span class="md-nav__icon md-icon"></span> Spark Local </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=LocalSchedulerBackend/ class=md-nav__link> <span class=md-ellipsis> LocalSchedulerBackend </span> </a> </li> <li class=md-nav__item> <a href=LocalEndpoint/ class=md-nav__link> <span class=md-ellipsis> LocalEndpoint </span> </a> </li> <li class=md-nav__item> <a href=LauncherBackend/ class=md-nav__link> <span class=md-ellipsis> LauncherBackend </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_4> <label class=md-nav__link for=__nav_4 id=__nav_4_label tabindex> <span class=md-ellipsis> Shared Variables </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_4_label aria-expanded=false> <label class=md-nav__title for=__nav_4> <span class="md-nav__icon md-icon"></span> Shared Variables </label> <ul class=md-nav__list data-md-scrollfix> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_4_1> <div class="md-nav__link md-nav__container"> <a href=../accumulators/ class="md-nav__link "> <span class=md-ellipsis> Accumulators </span> </a> <label class="md-nav__link " for=__nav_4_1 id=__nav_4_1_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_4_1_label aria-expanded=false> <label class=md-nav__title for=__nav_4_1> <span class="md-nav__icon md-icon"></span> Accumulators </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../accumulators/AccumulatorV2/ class=md-nav__link> <span class=md-ellipsis> AccumulatorV2 </span> </a> </li> <li class=md-nav__item> <a href=../accumulators/AccumulatorContext/ class=md-nav__link> <span class=md-ellipsis> AccumulatorContext </span> </a> </li> <li class=md-nav__item> <a href=../accumulators/InternalAccumulator/ class=md-nav__link> <span class=md-ellipsis> InternalAccumulator </span> </a> </li> <li class=md-nav__item> <a href=../accumulators/AccumulatorSource/ class=md-nav__link> <span class=md-ellipsis> AccumulatorSource </span> </a> </li> <li class=md-nav__item> <a href=../accumulators/AccumulableInfo/ class=md-nav__link> <span class=md-ellipsis> AccumulableInfo </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_4_2> <div class="md-nav__link md-nav__container"> <a href=../broadcast-variables/ class="md-nav__link "> <span class=md-ellipsis> Broadcast Variables </span> </a> <label class="md-nav__link " for=__nav_4_2 id=__nav_4_2_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_4_2_label aria-expanded=false> <label class=md-nav__title for=__nav_4_2> <span class="md-nav__icon md-icon"></span> Broadcast Variables </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../broadcast-variables/Broadcast/ class=md-nav__link> <span class=md-ellipsis> Broadcast </span> </a> </li> <li class=md-nav__item> <a href=../broadcast-variables/BroadcastFactory/ class=md-nav__link> <span class=md-ellipsis> BroadcastFactory </span> </a> </li> <li class=md-nav__item> <a href=../broadcast-variables/BroadcastManager/ class=md-nav__link> <span class=md-ellipsis> BroadcastManager </span> </a> </li> <li class=md-nav__item> <a href=../broadcast-variables/TorrentBroadcast/ class=md-nav__link> <span class=md-ellipsis> TorrentBroadcast </span> </a> </li> <li class=md-nav__item> <a href=../broadcast-variables/TorrentBroadcastFactory/ class=md-nav__link> <span class=md-ellipsis> TorrentBroadcastFactory </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5> <label class=md-nav__link for=__nav_5 id=__nav_5_label tabindex> <span class=md-ellipsis> Monitoring </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_5_label aria-expanded=false> <label class=md-nav__title for=__nav_5> <span class="md-nav__icon md-icon"></span> Monitoring </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../ConsoleProgressBar/ class=md-nav__link> <span class=md-ellipsis> ConsoleProgressBar </span> </a> </li> <li class=md-nav__item> <a href=../spark-debugging/ class=md-nav__link> <span class=md-ellipsis> Debugging Spark </span> </a> </li> <li class=md-nav__item> <a href=../DriverLogger/ class=md-nav__link> <span class=md-ellipsis> DriverLogger </span> </a> </li> <li class=md-nav__item> <a href=../ListenerBus/ class=md-nav__link> <span class=md-ellipsis> ListenerBus </span> </a> </li> <li class=md-nav__item> <a href=../SparkListener/ class=md-nav__link> <span class=md-ellipsis> SparkListener </span> </a> </li> <li class=md-nav__item> <a href=../SparkListenerBus/ class=md-nav__link> <span class=md-ellipsis> SparkListenerBus </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5_7> <label class=md-nav__link for=__nav_5_7 id=__nav_5_7_label tabindex=0> <span class=md-ellipsis> SparkListenerEvents </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_5_7_label aria-expanded=false> <label class=md-nav__title for=__nav_5_7> <span class="md-nav__icon md-icon"></span> SparkListenerEvents </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../SparkListenerEvent/ class=md-nav__link> <span class=md-ellipsis> SparkListenerEvent </span> </a> </li> <li class=md-nav__item> <a href=../SparkListenerTaskEnd/ class=md-nav__link> <span class=md-ellipsis> SparkListenerTaskEnd </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../SparkListenerInterface/ class=md-nav__link> <span class=md-ellipsis> SparkListenerInterface </span> </a> </li> <li class=md-nav__item> <a href=../SparkStatusTracker/ class=md-nav__link> <span class=md-ellipsis> SparkStatusTracker </span> </a> </li> <li class=md-nav__item> <a href=../SpillListener/ class=md-nav__link> <span class=md-ellipsis> SpillListener </span> </a> </li> <li class=md-nav__item> <a href=../StatsReportListener/ class=md-nav__link> <span class=md-ellipsis> StatsReportListener </span> </a> </li> <li class=md-nav__item> <a href=../TaskCompletionListener/ class=md-nav__link> <span class=md-ellipsis> TaskCompletionListener </span> </a> </li> <li class=md-nav__item> <a href=../TaskFailureListener/ class=md-nav__link> <span class=md-ellipsis> TaskFailureListener </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5_14> <div class="md-nav__link md-nav__container"> <a href=../history-server/ class="md-nav__link "> <span class=md-ellipsis> Spark History Server </span> </a> <label class="md-nav__link " for=__nav_5_14 id=__nav_5_14_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_5_14_label aria-expanded=false> <label class=md-nav__title for=__nav_5_14> <span class="md-nav__icon md-icon"></span> Spark History Server </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../history-server/configuration-properties/ class=md-nav__link> <span class=md-ellipsis> Configuration Properties </span> </a> </li> <li class=md-nav__item> <a href=../history-server/HistoryServer/ class=md-nav__link> <span class=md-ellipsis> HistoryServer </span> </a> </li> <li class=md-nav__item> <a href=../history-server/HistoryAppStatusStore/ class=md-nav__link> <span class=md-ellipsis> HistoryAppStatusStore </span> </a> </li> <li class=md-nav__item> <a href=../history-server/HistoryServerDiskManager/ class=md-nav__link> <span class=md-ellipsis> HistoryServerDiskManager </span> </a> </li> <li class=md-nav__item> <a href=../history-server/EventLoggingListener/ class=md-nav__link> <span class=md-ellipsis> EventLoggingListener </span> </a> </li> <li class=md-nav__item> <a href=../history-server/SQLHistoryListener/ class=md-nav__link> <span class=md-ellipsis> SQLHistoryListener </span> </a> </li> <li class=md-nav__item> <a href=../history-server/ApplicationHistoryProvider/ class=md-nav__link> <span class=md-ellipsis> ApplicationHistoryProvider </span> </a> </li> <li class=md-nav__item> <a href=../history-server/FsHistoryProvider/ class=md-nav__link> <span class=md-ellipsis> FsHistoryProvider </span> </a> </li> <li class=md-nav__item> <a href=../history-server/HistoryServerArguments/ class=md-nav__link> <span class=md-ellipsis> HistoryServerArguments </span> </a> </li> <li class=md-nav__item> <a href=../history-server/ApplicationCacheOperations/ class=md-nav__link> <span class=md-ellipsis> ApplicationCacheOperations </span> </a> </li> <li class=md-nav__item> <a href=../history-server/ApplicationCache/ class=md-nav__link> <span class=md-ellipsis> ApplicationCache </span> </a> </li> <li class=md-nav__item> <a href=../history-server/ReplayListenerBus/ class=md-nav__link> <span class=md-ellipsis> ReplayListenerBus </span> </a> </li> <li class=md-nav__item> <a href=../history-server/EventLogFileWriter/ class=md-nav__link> <span class=md-ellipsis> EventLogFileWriter </span> </a> </li> <li class=md-nav__item> <a href=../history-server/JsonProtocol/ class=md-nav__link> <span class=md-ellipsis> JsonProtocol </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5_15> <div class="md-nav__link md-nav__container"> <a href=../rest/ class="md-nav__link "> <span class=md-ellipsis> Status REST API </span> </a> <label class="md-nav__link " for=__nav_5_15 id=__nav_5_15_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_5_15_label aria-expanded=false> <label class=md-nav__title for=__nav_5_15> <span class="md-nav__icon md-icon"></span> Status REST API </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../rest/ApiRootResource/ class=md-nav__link> <span class=md-ellipsis> ApiRootResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/ApplicationListResource/ class=md-nav__link> <span class=md-ellipsis> ApplicationListResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/OneApplicationResource/ class=md-nav__link> <span class=md-ellipsis> OneApplicationResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/StagesResource/ class=md-nav__link> <span class=md-ellipsis> StagesResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/OneApplicationAttemptResource/ class=md-nav__link> <span class=md-ellipsis> OneApplicationAttemptResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/AbstractApplicationResource/ class=md-nav__link> <span class=md-ellipsis> AbstractApplicationResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/BaseAppResource/ class=md-nav__link> <span class=md-ellipsis> BaseAppResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/ApiRequestContext/ class=md-nav__link> <span class=md-ellipsis> ApiRequestContext </span> </a> </li> <li class=md-nav__item> <a href=../rest/UIRoot/ class=md-nav__link> <span class=md-ellipsis> UIRoot </span> </a> </li> <li class=md-nav__item> <a href=../rest/UIRootFromServletContext/ class=md-nav__link> <span class=md-ellipsis> UIRootFromServletContext </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5_16> <div class="md-nav__link md-nav__container"> <a href=../metrics/ class="md-nav__link "> <span class=md-ellipsis> Metrics </span> </a> <label class="md-nav__link " for=__nav_5_16 id=__nav_5_16_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_5_16_label aria-expanded=false> <label class=md-nav__title for=__nav_5_16> <span class="md-nav__icon md-icon"></span> Metrics </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../metrics/configuration-properties/ class=md-nav__link> <span class=md-ellipsis> Configuration Properties </span> </a> </li> <li class=md-nav__item> <a href=../metrics/MetricsSystem/ class=md-nav__link> <span class=md-ellipsis> MetricsSystem </span> </a> </li> <li class=md-nav__item> <a href=../metrics/MetricsConfig/ class=md-nav__link> <span class=md-ellipsis> MetricsConfig </span> </a> </li> <li class=md-nav__item> <a href=../metrics/Source/ class=md-nav__link> <span class=md-ellipsis> Source </span> </a> </li> <li class=md-nav__item> <a href=../metrics/Sink/ class=md-nav__link> <span class=md-ellipsis> Sink </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5_16_7> <label class=md-nav__link for=__nav_5_16_7 id=__nav_5_16_7_label tabindex=0> <span class=md-ellipsis> Sources </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_5_16_7_label aria-expanded=false> <label class=md-nav__title for=__nav_5_16_7> <span class="md-nav__icon md-icon"></span> Sources </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../metrics/JvmSource/ class=md-nav__link> <span class=md-ellipsis> JvmSource </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5_16_8> <label class=md-nav__link for=__nav_5_16_8 id=__nav_5_16_8_label tabindex=0> <span class=md-ellipsis> Sinks </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_5_16_8_label aria-expanded=false> <label class=md-nav__title for=__nav_5_16_8> <span class="md-nav__icon md-icon"></span> Sinks </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../metrics/MetricsServlet/ class=md-nav__link> <span class=md-ellipsis> MetricsServlet </span> </a> </li> <li class=md-nav__item> <a href=../metrics/PrometheusServlet/ class=md-nav__link> <span class=md-ellipsis> PrometheusServlet </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_6> <div class="md-nav__link md-nav__container"> <a href=../tools/ class="md-nav__link "> <span class=md-ellipsis> Tools </span> </a> <label class="md-nav__link " for=__nav_6 id=__nav_6_label tabindex> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_6_label aria-expanded=false> <label class=md-nav__title for=__nav_6> <span class="md-nav__icon md-icon"></span> Tools </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../tools/AbstractLauncher/ class=md-nav__link> <span class=md-ellipsis> AbstractLauncher </span> </a> </li> <li class=md-nav__item> <a href=../tools/Main/ class=md-nav__link> <span class=md-ellipsis> Main </span> </a> </li> <li class=md-nav__item> <a href=../tools/pyspark/ class=md-nav__link> <span class=md-ellipsis> pyspark </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_6_5> <div class="md-nav__link md-nav__container"> <a href=../tools/spark-submit/ class="md-nav__link "> <span class=md-ellipsis> spark-submit </span> </a> <label class="md-nav__link " for=__nav_6_5 id=__nav_6_5_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_6_5_label aria-expanded=false> <label class=md-nav__title for=__nav_6_5> <span class="md-nav__icon md-icon"></span> spark-submit </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmit/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmit </span> </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmitArguments/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmitArguments </span> </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmitCommandBuilder.OptionParser/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmitCommandBuilder.OptionParser </span> </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmitCommandBuilder/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmitCommandBuilder </span> </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmitOperation/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmitOperation </span> </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmitOptionParser/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmitOptionParser </span> </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmitUtils/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmitUtils </span> </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../tools/SparkApplication/ class=md-nav__link> <span class=md-ellipsis> SparkApplication </span> </a> </li> <li class=md-nav__item> <a href=../tools/JavaMainApplication/ class=md-nav__link> <span class=md-ellipsis> JavaMainApplication </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-shell/ class=md-nav__link> <span class=md-ellipsis> spark-shell </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-class/ class=md-nav__link> <span class=md-ellipsis> spark-class </span> </a> </li> <li class=md-nav__item> <a href=../tools/SparkLauncher/ class=md-nav__link> <span class=md-ellipsis> SparkLauncher </span> </a> </li> <li class=md-nav__item> <a href=../tools/SparkClassCommandBuilder/ class=md-nav__link> <span class=md-ellipsis> SparkClassCommandBuilder </span> </a> </li> <li class=md-nav__item> <a href=../tools/DependencyUtils/ class=md-nav__link> <span class=md-ellipsis> DependencyUtils </span> </a> </li> <li class=md-nav__item> <a href=../tools/AbstractCommandBuilder/ class=md-nav__link> <span class=md-ellipsis> AbstractCommandBuilder </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_7> <div class="md-nav__link md-nav__container"> <a href=../rdd/ class="md-nav__link "> <span class=md-ellipsis> RDD </span> </a> <label class="md-nav__link " for=__nav_7 id=__nav_7_label tabindex> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_7_label aria-expanded=false> <label class=md-nav__title for=__nav_7> <span class="md-nav__icon md-icon"></span> RDD </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../rdd/CoalescedRDD/ class=md-nav__link> <span class=md-ellipsis> CoalescedRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/CheckpointRDD/ class=md-nav__link> <span class=md-ellipsis> CheckpointRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/CoGroupedRDD/ class=md-nav__link> <span class=md-ellipsis> CoGroupedRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/Dependency/ class=md-nav__link> <span class=md-ellipsis> Dependency </span> </a> </li> <li class=md-nav__item> <a href=../rdd/HadoopRDD/ class=md-nav__link> <span class=md-ellipsis> HadoopRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/LocalCheckpointRDD/ class=md-nav__link> <span class=md-ellipsis> LocalCheckpointRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/MapPartitionsRDD/ class=md-nav__link> <span class=md-ellipsis> MapPartitionsRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/NarrowDependency/ class=md-nav__link> <span class=md-ellipsis> NarrowDependency </span> </a> </li> <li class=md-nav__item> <a href=../rdd/NewHadoopRDD/ class=md-nav__link> <span class=md-ellipsis> NewHadoopRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/ParallelCollectionRDD/ class=md-nav__link> <span class=md-ellipsis> ParallelCollectionRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/RDD/ class=md-nav__link> <span class=md-ellipsis> RDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/ReliableCheckpointRDD/ class=md-nav__link> <span class=md-ellipsis> ReliableCheckpointRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/ShuffleDependency/ class=md-nav__link> <span class=md-ellipsis> ShuffleDependency </span> </a> </li> <li class=md-nav__item> <a href=../rdd/ShuffledRDD/ class=md-nav__link> <span class=md-ellipsis> ShuffledRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/SubtractedRDD/ class=md-nav__link> <span class=md-ellipsis> SubtractedRDD </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_7_17> <label class=md-nav__link for=__nav_7_17 id=__nav_7_17_label tabindex=0> <span class=md-ellipsis> Operators </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_7_17_label aria-expanded=false> <label class=md-nav__title for=__nav_7_17> <span class="md-nav__icon md-icon"></span> Operators </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../rdd/spark-rdd-operations/ class=md-nav__link> <span class=md-ellipsis> Operators </span> </a> </li> <li class=md-nav__item> <a href=../rdd/spark-rdd-transformations/ class=md-nav__link> <span class=md-ellipsis> Transformations </span> </a> </li> <li class=md-nav__item> <a href=../rdd/OrderedRDDFunctions/ class=md-nav__link> <span class=md-ellipsis> OrderedRDDFunctions </span> </a> </li> <li class=md-nav__item> <a href=../rdd/PairRDDFunctions/ class=md-nav__link> <span class=md-ellipsis> PairRDDFunctions </span> </a> </li> <li class=md-nav__item> <a href=../rdd/AsyncRDDActions/ class=md-nav__link> <span class=md-ellipsis> AsyncRDDActions </span> </a> </li> <li class=md-nav__item> <a href=../rdd/spark-rdd-actions/ class=md-nav__link> <span class=md-ellipsis> Actions </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_7_18> <label class=md-nav__link for=__nav_7_18 id=__nav_7_18_label tabindex=0> <span class=md-ellipsis> Partitioners </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_7_18_label aria-expanded=false> <label class=md-nav__title for=__nav_7_18> <span class="md-nav__icon md-icon"></span> Partitioners </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../rdd/Partitioner/ class=md-nav__link> <span class=md-ellipsis> Partitioner </span> </a> </li> <li class=md-nav__item> <a href=../rdd/HashPartitioner/ class=md-nav__link> <span class=md-ellipsis> HashPartitioner </span> </a> </li> <li class=md-nav__item> <a href=../rdd/RangePartitioner/ class=md-nav__link> <span class=md-ellipsis> RangePartitioner </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../rdd/lineage/ class=md-nav__link> <span class=md-ellipsis> RDD Lineage </span> </a> </li> <li class=md-nav__item> <a href=../rdd/spark-rdd-caching/ class=md-nav__link> <span class=md-ellipsis> Caching and Persistence </span> </a> </li> <li class=md-nav__item> <a href=../rdd/spark-rdd-partitions/ class=md-nav__link> <span class=md-ellipsis> Partitions and Partitioning </span> </a> </li> <li class=md-nav__item> <a href=../rdd/Partition/ class=md-nav__link> <span class=md-ellipsis> Partition </span> </a> </li> <li class=md-nav__item> <a href=../rdd/checkpointing/ class=md-nav__link> <span class=md-ellipsis> RDD Checkpointing </span> </a> </li> <li class=md-nav__item> <a href=../rdd/RDDCheckpointData/ class=md-nav__link> <span class=md-ellipsis> RDDCheckpointData </span> </a> </li> <li class=md-nav__item> <a href=../rdd/LocalRDDCheckpointData/ class=md-nav__link> <span class=md-ellipsis> LocalRDDCheckpointData </span> </a> </li> <li class=md-nav__item> <a href=../rdd/ReliableRDDCheckpointData/ class=md-nav__link> <span class=md-ellipsis> ReliableRDDCheckpointData </span> </a> </li> <li class=md-nav__item> <a href=../rdd/Aggregator/ class=md-nav__link> <span class=md-ellipsis> Aggregator </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_8> <div class="md-nav__link md-nav__container"> <a href=../demo/ class="md-nav__link "> <span class=md-ellipsis> Demos </span> </a> <label class="md-nav__link " for=__nav_8 id=__nav_8_label tabindex> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_8_label aria-expanded=false> <label class=md-nav__title for=__nav_8> <span class="md-nav__icon md-icon"></span> Demos </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../demo/diskblockmanager-and-block-data/ class=md-nav__link> <span class=md-ellipsis> DiskBlockManager and Block Data </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_8_3> <label class=md-nav__link for=__nav_8_3 id=__nav_8_3_label tabindex=0> <span class=md-ellipsis> Exercises </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_8_3_label aria-expanded=false> <label class=md-nav__title for=__nav_8_3> <span class="md-nav__icon md-icon"></span> Exercises </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../exercises/spark-exercise-pairrddfunctions-oneliners/ class=md-nav__link> <span class=md-ellipsis> One-liners using PairRDDFunctions </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-exercise-take-multiple-jobs/ class=md-nav__link> <span class=md-ellipsis> Learning Jobs and Partitions Using take Action </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-exercise-standalone-master-ha/ class=md-nav__link> <span class=md-ellipsis> Spark Standalone - Using ZooKeeper for High-Availability of Master </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-hello-world-using-spark-shell/ class=md-nav__link> <span class=md-ellipsis> Spark's Hello World using Spark shell and Scala </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-examples-wordcount-spark-shell/ class=md-nav__link> <span class=md-ellipsis> WordCount using Spark shell </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-first-app/ class=md-nav__link> <span class=md-ellipsis> Your first complete Spark application (using Scala and sbt) </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-sql-hive-orc-example/ class=md-nav__link> <span class=md-ellipsis> Using Spark SQL to update data in Hive using ORC files </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-exercise-custom-scheduler-listener/ class=md-nav__link> <span class=md-ellipsis> Developing Custom SparkListener to monitor DAGScheduler in Scala </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-exercise-dataframe-jdbc-postgresql/ class=md-nav__link> <span class=md-ellipsis> Working with Datasets from JDBC Data Sources (and PostgreSQL) </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-exercise-failing-stage/ class=md-nav__link> <span class=md-ellipsis> Causing Stage to Fail </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9> <div class="md-nav__link md-nav__container"> <a href=../webui/ class="md-nav__link "> <span class=md-ellipsis> Web UIs </span> </a> <label class="md-nav__link " for=__nav_9 id=__nav_9_label tabindex> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_9_label aria-expanded=false> <label class=md-nav__title for=__nav_9> <span class="md-nav__icon md-icon"></span> Web UIs </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/configuration-properties/ class=md-nav__link> <span class=md-ellipsis> Configuration Properties </span> </a> </li> <li class=md-nav__item> <a href=../webui/WebUI/ class=md-nav__link> <span class=md-ellipsis> WebUI </span> </a> </li> <li class=md-nav__item> <a href=../webui/WebUIPage/ class=md-nav__link> <span class=md-ellipsis> WebUIPage </span> </a> </li> <li class=md-nav__item> <a href=../webui/WebUITab/ class=md-nav__link> <span class=md-ellipsis> WebUITab </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9_6> <label class=md-nav__link for=__nav_9_6 id=__nav_9_6_label tabindex=0> <span class=md-ellipsis> Spark UI </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_9_6_label aria-expanded=false> <label class=md-nav__title for=__nav_9_6> <span class="md-nav__icon md-icon"></span> Spark UI </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/SparkUI/ class=md-nav__link> <span class=md-ellipsis> SparkUI </span> </a> </li> <li class=md-nav__item> <a href=../webui/SparkUITab/ class=md-nav__link> <span class=md-ellipsis> SparkUITab </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9_6_3> <label class=md-nav__link for=__nav_9_6_3 id=__nav_9_6_3_label tabindex=0> <span class=md-ellipsis> Environment </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_9_6_3_label aria-expanded=false> <label class=md-nav__title for=__nav_9_6_3> <span class="md-nav__icon md-icon"></span> Environment </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/EnvironmentTab/ class=md-nav__link> <span class=md-ellipsis> EnvironmentTab </span> </a> </li> <li class=md-nav__item> <a href=../webui/EnvironmentPage/ class=md-nav__link> <span class=md-ellipsis> EnvironmentPage </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9_6_4> <label class=md-nav__link for=__nav_9_6_4 id=__nav_9_6_4_label tabindex=0> <span class=md-ellipsis> Executors </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_9_6_4_label aria-expanded=false> <label class=md-nav__title for=__nav_9_6_4> <span class="md-nav__icon md-icon"></span> Executors </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/ExecutorsTab/ class=md-nav__link> <span class=md-ellipsis> ExecutorsTab </span> </a> </li> <li class=md-nav__item> <a href=../webui/ExecutorsPage/ class=md-nav__link> <span class=md-ellipsis> ExecutorsPage </span> </a> </li> <li class=md-nav__item> <a href=../webui/ExecutorThreadDumpPage/ class=md-nav__link> <span class=md-ellipsis> ExecutorThreadDumpPage </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9_6_5> <label class=md-nav__link for=__nav_9_6_5 id=__nav_9_6_5_label tabindex=0> <span class=md-ellipsis> Jobs </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_9_6_5_label aria-expanded=false> <label class=md-nav__title for=__nav_9_6_5> <span class="md-nav__icon md-icon"></span> Jobs </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/JobsTab/ class=md-nav__link> <span class=md-ellipsis> JobsTab </span> </a> </li> <li class=md-nav__item> <a href=../webui/AllJobsPage/ class=md-nav__link> <span class=md-ellipsis> AllJobsPage </span> </a> </li> <li class=md-nav__item> <a href=../webui/JobPage/ class=md-nav__link> <span class=md-ellipsis> JobPage </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9_6_6> <label class=md-nav__link for=__nav_9_6_6 id=__nav_9_6_6_label tabindex=0> <span class=md-ellipsis> Stages </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_9_6_6_label aria-expanded=false> <label class=md-nav__title for=__nav_9_6_6> <span class="md-nav__icon md-icon"></span> Stages </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/StagesTab/ class=md-nav__link> <span class=md-ellipsis> StagesTab </span> </a> </li> <li class=md-nav__item> <a href=../webui/AllStagesPage/ class=md-nav__link> <span class=md-ellipsis> AllStagesPage </span> </a> </li> <li class=md-nav__item> <a href=../webui/StagePage/ class=md-nav__link> <span class=md-ellipsis> StagePage </span> </a> </li> <li class=md-nav__item> <a href=../webui/PoolPage/ class=md-nav__link> <span class=md-ellipsis> PoolPage </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9_6_7> <label class=md-nav__link for=__nav_9_6_7 id=__nav_9_6_7_label tabindex=0> <span class=md-ellipsis> Storage </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_9_6_7_label aria-expanded=false> <label class=md-nav__title for=__nav_9_6_7> <span class="md-nav__icon md-icon"></span> Storage </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/StorageTab/ class=md-nav__link> <span class=md-ellipsis> StorageTab </span> </a> </li> <li class=md-nav__item> <a href=../webui/StoragePage/ class=md-nav__link> <span class=md-ellipsis> StoragePage </span> </a> </li> <li class=md-nav__item> <a href=../webui/RDDPage/ class=md-nav__link> <span class=md-ellipsis> RDDPage </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../webui/PrometheusResource/ class=md-nav__link> <span class=md-ellipsis> PrometheusResource </span> </a> </li> <li class=md-nav__item> <a href=../webui/UIUtils/ class=md-nav__link> <span class=md-ellipsis> UIUtils </span> </a> </li> <li class=md-nav__item> <a href=../webui/JettyUtils/ class=md-nav__link> <span class=md-ellipsis> JettyUtils </span> </a> </li> </ul> </nav> </li> </ul> </nav> </div> </div> </div> <div class="md-sidebar md-sidebar--secondary" data-md-component=sidebar data-md-type=toc> <div class=md-sidebar__scrollwrap> <div class=md-sidebar__inner> <nav class="md-nav md-nav--secondary" aria-label="Table of contents"> <label class=md-nav__title for=__toc> <span class="md-nav__icon md-icon"></span> Table of contents </label> <ul class=md-nav__list data-md-component=toc data-md-scrollfix> <li class=md-nav__item> <a href=#master-url class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> Master URL </span> </span> </a> </li> </ul> </nav> </div> </div> </div> <div class=md-content data-md-component=content> <nav class=md-path aria-label=Navigation> <ol class=md-path__list> <li class=md-path__item> <a href=.. class=md-path__link> <span class=md-ellipsis> Spark Core </span> </a> </li> <li class=md-path__item> <a href=../overview/ class=md-path__link> <span class=md-ellipsis> Internals </span> </a> </li> <li class=md-path__item> <a href=./ class=md-path__link> <span class=md-ellipsis> Spark Local </span> </a> </li> </ol> </nav> <article class="md-content__inner md-typeset"> <h1 id=spark-local>Spark local<a class=headerlink href=#spark-local title="Permanent link">&para;</a></h1> <p><em>Spark local</em> is one of the available runtime environments in Apache Spark. It is the only available runtime with no need for a proper cluster manager (and hence many call it a <em>pseudo-cluster</em>, however such concept do exist in Spark and is a bit different).</p> <p>Spark local is used for the following <em>master URLs</em> (as specified using &lt;&lt;../SparkConf.md#, SparkConf.setMaster&gt;&gt; method or &lt;&lt;../configuration-properties.md#spark.master, spark.master&gt;&gt; configuration property):</p> <ul> <li> <p><em>local</em> (with exactly 1 CPU core)</p> </li> <li> <p><em>local[n]</em> (with exactly <code>n</code> CPU cores)</p> </li> <li> <p><em>++local[</em>]++* (with the total number of CPU cores that is the number of available CPU cores on the local machine)</p> </li> <li> <p><em>local[n, m]</em> (with exactly <code>n</code> CPU cores and <code>m</code> retries when a task fails)</p> </li> <li> <p><em>++local[</em>, m]++* (with the total number of CPU cores that is the number of available CPU cores on the local machine)</p> </li> </ul> <p>Internally, Spark local uses &lt;<spark-localschedulerbackend.md#, localschedulerbackend>&gt; as the &lt;&lt;../SchedulerBackend.md#, SchedulerBackend&gt;&gt; and executor:ExecutorBackend.md[].</p> <p>.Architecture of Spark local image::../diagrams/spark-local-architecture.png[align="center"]</p> <p>In this non-distributed multi-threaded runtime environment, Spark spawns all the main execution components - the spark-driver.md[driver] and an executor:Executor.md[] - in the same single JVM.</p> <p>The default parallelism is the number of threads as specified in the &lt;<masterurl, master url>&gt;. This is the only mode where a driver is used for execution (as it acts both as the driver and the only executor).</p> <p>The local mode is very convenient for testing, debugging or demonstration purposes as it requires no earlier setup to launch Spark applications.</p> <p>This mode of operation is also called <a href=http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark[Spark>http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark[Spark</a> in-process] or (less commonly) <em>a local version of Spark</em>.</p> <p><code>SparkContext.isLocal</code> returns <code>true</code> when Spark runs in local mode.</p> <div class=highlight><pre><span></span><code>scala&gt; sc.isLocal
+<!doctype html><html lang=en class=no-js> <head><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=description content="Demystifying inner-workings of Spark Core"><meta name=author content="Jacek Laskowski"><link href=https://books.japila.pl/apache-spark-internals/local/ rel=canonical><link href=../spark-tips-and-tricks-running-spark-windows/ rel=prev><link href=LocalSchedulerBackend/ rel=next><link rel=icon href=../assets/images/favicon.png><meta name=generator content="mkdocs-1.5.3, mkdocs-material-9.5.2+insiders-4.47.1"><title>Spark local - The Internals of Spark Core</title><link rel=stylesheet href=../assets/stylesheets/main.78d85e4f.min.css><link rel=stylesheet href=../assets/stylesheets/palette.ab4e12ef.min.css><link rel=preconnect href=https://fonts.gstatic.com crossorigin><link rel=stylesheet href="https://fonts.googleapis.com/css?family=Roboto:300,300i,400,400i,700,700i%7CRoboto+Mono:400,400i,700,700i&display=fallback"><style>:root{--md-text-font:"Roboto";--md-code-font:"Roboto Mono"}</style><script>__md_scope=new URL("..",location),__md_hash=e=>[...e].reduce((e,_)=>(e<<5)-e+_.charCodeAt(0),0),__md_get=(e,_=localStorage,t=__md_scope)=>JSON.parse(_.getItem(t.pathname+"."+e)),__md_set=(e,_,t=localStorage,a=__md_scope)=>{try{t.setItem(a.pathname+"."+e,JSON.stringify(_))}catch(e){}}</script><script id=__analytics>function __md_analytics(){function n(){dataLayer.push(arguments)}window.dataLayer=window.dataLayer||[],n("js",new Date),n("config","G-FTGMP2ZTLT"),document.addEventListener("DOMContentLoaded",function(){document.forms.search&&document.forms.search.query.addEventListener("blur",function(){this.value&&n("event","search",{search_term:this.value})}),document$.subscribe(function(){var a=document.forms.feedback;if(void 0!==a)for(var e of a.querySelectorAll("[type=submit]"))e.addEventListener("click",function(e){e.preventDefault();var t=document.location.pathname,e=this.getAttribute("data-md-value");n("event","feedback",{page:t,data:e}),a.firstElementChild.disabled=!0;e=a.querySelector(".md-feedback__note [data-md-value='"+e+"']");e&&(e.hidden=!1)}),a.hidden=!1}),location$.subscribe(function(e){n("config","G-FTGMP2ZTLT",{page_path:e.pathname})})});var e=document.createElement("script");e.async=!0,e.src="https://www.googletagmanager.com/gtag/js?id=G-FTGMP2ZTLT",document.getElementById("__analytics").insertAdjacentElement("afterEnd",e)}</script><script>"undefined"!=typeof __md_analytics&&__md_analytics()</script></head> <body dir=ltr data-md-color-scheme=default data-md-color-primary=indigo data-md-color-accent=indigo> <input class=md-toggle data-md-toggle=drawer type=checkbox id=__drawer autocomplete=off> <input class=md-toggle data-md-toggle=search type=checkbox id=__search autocomplete=off> <label class=md-overlay for=__drawer></label> <div data-md-component=skip> <a href=#spark-local class=md-skip> Skip to content </a> </div> <div data-md-component=announce> </div> <header class="md-header md-header--shadow md-header--lifted" data-md-component=header> <nav class="md-header__inner md-grid" aria-label=Header> <a href=.. title="The Internals of Spark Core" class="md-header__button md-logo" aria-label="The Internals of Spark Core" data-md-component=logo> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="m19 2-5 4.5v11l5-4.5V2M6.5 5C4.55 5 2.45 5.4 1 6.5v14.66c0 .25.25.5.5.5.1 0 .15-.07.25-.07 1.35-.65 3.3-1.09 4.75-1.09 1.95 0 4.05.4 5.5 1.5 1.35-.85 3.8-1.5 5.5-1.5 1.65 0 3.35.31 4.75 1.06.1.05.15.03.25.03.25 0 .5-.25.5-.5V6.5c-.6-.45-1.25-.75-2-1V19c-1.1-.35-2.3-.5-3.5-.5-1.7 0-4.15.65-5.5 1.5V6.5C10.55 5.4 8.45 5 6.5 5Z"/></svg> </a> <label class="md-header__button md-icon" for=__drawer> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M3 6h18v2H3V6m0 5h18v2H3v-2m0 5h18v2H3v-2Z"/></svg> </label> <div class=md-header__title data-md-component=header-title> <div class=md-header__ellipsis> <div class=md-header__topic> <span class=md-ellipsis> The Internals of Spark Core </span> </div> <div class=md-header__topic data-md-component=header-topic> <span class=md-ellipsis> Spark local </span> </div> </div> </div> <form class=md-header__option data-md-component=palette> <input class=md-option data-md-color-media data-md-color-scheme=default data-md-color-primary=indigo data-md-color-accent=indigo aria-label="Switch to dark mode" type=radio name=__palette id=__palette_0> <label class="md-header__button md-icon" title="Switch to dark mode" for=__palette_1 hidden> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M17 6H7c-3.31 0-6 2.69-6 6s2.69 6 6 6h10c3.31 0 6-2.69 6-6s-2.69-6-6-6zm0 10H7c-2.21 0-4-1.79-4-4s1.79-4 4-4h10c2.21 0 4 1.79 4 4s-1.79 4-4 4zM7 9c-1.66 0-3 1.34-3 3s1.34 3 3 3 3-1.34 3-3-1.34-3-3-3z"/></svg> </label> <input class=md-option data-md-color-media data-md-color-scheme=slate data-md-color-primary=blue data-md-color-accent=blue aria-label="Switch to light mode" type=radio name=__palette id=__palette_1> <label class="md-header__button md-icon" title="Switch to light mode" for=__palette_0 hidden> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M17 7H7a5 5 0 0 0-5 5 5 5 0 0 0 5 5h10a5 5 0 0 0 5-5 5 5 0 0 0-5-5m0 8a3 3 0 0 1-3-3 3 3 0 0 1 3-3 3 3 0 0 1 3 3 3 3 0 0 1-3 3Z"/></svg> </label> </form> <script>var media,input,key,value,palette=__md_get("__palette");if(palette&&palette.color){"(prefers-color-scheme)"===palette.color.media&&(media=matchMedia("(prefers-color-scheme: light)"),input=document.querySelector(media.matches?"[data-md-color-media='(prefers-color-scheme: light)']":"[data-md-color-media='(prefers-color-scheme: dark)']"),palette.color.media=input.getAttribute("data-md-color-media"),palette.color.scheme=input.getAttribute("data-md-color-scheme"),palette.color.primary=input.getAttribute("data-md-color-primary"),palette.color.accent=input.getAttribute("data-md-color-accent"));for([key,value]of Object.entries(palette.color))document.body.setAttribute("data-md-color-"+key,value)}</script> <label class="md-header__button md-icon" for=__search> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M9.5 3A6.5 6.5 0 0 1 16 9.5c0 1.61-.59 3.09-1.56 4.23l.27.27h.79l5 5-1.5 1.5-5-5v-.79l-.27-.27A6.516 6.516 0 0 1 9.5 16 6.5 6.5 0 0 1 3 9.5 6.5 6.5 0 0 1 9.5 3m0 2C7 5 5 7 5 9.5S7 14 9.5 14 14 12 14 9.5 12 5 9.5 5Z"/></svg> </label> <div class=md-search data-md-component=search role=dialog> <label class=md-search__overlay for=__search></label> <div class=md-search__inner role=search> <form class=md-search__form name=search> <input type=text class=md-search__input name=query aria-label=Search placeholder=Search autocapitalize=off autocorrect=off autocomplete=off spellcheck=false data-md-component=search-query required> <label class="md-search__icon md-icon" for=__search> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M9.5 3A6.5 6.5 0 0 1 16 9.5c0 1.61-.59 3.09-1.56 4.23l.27.27h.79l5 5-1.5 1.5-5-5v-.79l-.27-.27A6.516 6.516 0 0 1 9.5 16 6.5 6.5 0 0 1 3 9.5 6.5 6.5 0 0 1 9.5 3m0 2C7 5 5 7 5 9.5S7 14 9.5 14 14 12 14 9.5 12 5 9.5 5Z"/></svg> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M20 11v2H8l5.5 5.5-1.42 1.42L4.16 12l7.92-7.92L13.5 5.5 8 11h12Z"/></svg> </label> <nav class=md-search__options aria-label=Search> <a href=javascript:void(0) class="md-search__icon md-icon" title=Share aria-label=Share data-clipboard data-clipboard-text data-md-component=search-share tabindex=-1> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M18 16.08c-.76 0-1.44.3-1.96.77L8.91 12.7c.05-.23.09-.46.09-.7 0-.24-.04-.47-.09-.7l7.05-4.11c.54.5 1.25.81 2.04.81a3 3 0 0 0 3-3 3 3 0 0 0-3-3 3 3 0 0 0-3 3c0 .24.04.47.09.7L8.04 9.81C7.5 9.31 6.79 9 6 9a3 3 0 0 0-3 3 3 3 0 0 0 3 3c.79 0 1.5-.31 2.04-.81l7.12 4.15c-.05.21-.08.43-.08.66 0 1.61 1.31 2.91 2.92 2.91 1.61 0 2.92-1.3 2.92-2.91A2.92 2.92 0 0 0 18 16.08Z"/></svg> </a> <button type=reset class="md-search__icon md-icon" title=Clear aria-label=Clear tabindex=-1> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M19 6.41 17.59 5 12 10.59 6.41 5 5 6.41 10.59 12 5 17.59 6.41 19 12 13.41 17.59 19 19 17.59 13.41 12 19 6.41Z"/></svg> </button> </nav> <div class=md-search__suggest data-md-component=search-suggest></div> </form> <div class=md-search__output> <div class=md-search__scrollwrap data-md-scrollfix> <div class=md-search-result data-md-component=search-result> <div class=md-search-result__meta> Initializing search </div> <ol class=md-search-result__list role=presentation></ol> </div> </div> </div> </div> </div> <div class=md-header__source> <a href=https://github.com/japila-books/apache-spark-internals title="Go to repository" class=md-source data-md-component=source> <div class="md-source__icon md-icon"> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 496 512"><!-- Font Awesome Free 6.5.1 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2023 Fonticons, Inc.--><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> </div> <div class=md-source__repository> spark-internals </div> </a> </div> </nav> <nav class=md-tabs aria-label=Tabs data-md-component=tabs> <div class=md-grid> <ul class=md-tabs__list> <li class=md-tabs__item> <a href=.. class=md-tabs__link> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M10 20v-6h4v6h5v-8h3L12 3 2 12h3v8h5Z"/></svg> Spark Core </a> </li> <li class=md-tabs__item> <a href=../features/ class=md-tabs__link> Features </a> </li> <li class="md-tabs__item md-tabs__item--active"> <a href=../overview/ class=md-tabs__link> Internals </a> </li> <li class=md-tabs__item> <a href=../accumulators/ class=md-tabs__link> Shared Variables </a> </li> <li class=md-tabs__item> <a href=../ConsoleProgressBar/ class=md-tabs__link> Monitoring </a> </li> <li class=md-tabs__item> <a href=../tools/ class=md-tabs__link> Tools </a> </li> <li class=md-tabs__item> <a href=../rdd/ class=md-tabs__link> RDD </a> </li> <li class=md-tabs__item> <a href=../demo/ class=md-tabs__link> Demos </a> </li> <li class=md-tabs__item> <a href=../webui/ class=md-tabs__link> Web UIs </a> </li> </ul> </div> </nav> </header> <div class=md-container data-md-component=container> <main class=md-main data-md-component=main> <div class="md-main__inner md-grid"> <div class="md-sidebar md-sidebar--primary" data-md-component=sidebar data-md-type=navigation> <div class=md-sidebar__scrollwrap> <div class=md-sidebar__inner> <nav class="md-nav md-nav--primary md-nav--lifted" aria-label=Navigation data-md-level=0> <label class=md-nav__title for=__drawer> <a href=.. title="The Internals of Spark Core" class="md-nav__button md-logo" aria-label="The Internals of Spark Core" data-md-component=logo> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="m19 2-5 4.5v11l5-4.5V2M6.5 5C4.55 5 2.45 5.4 1 6.5v14.66c0 .25.25.5.5.5.1 0 .15-.07.25-.07 1.35-.65 3.3-1.09 4.75-1.09 1.95 0 4.05.4 5.5 1.5 1.35-.85 3.8-1.5 5.5-1.5 1.65 0 3.35.31 4.75 1.06.1.05.15.03.25.03.25 0 .5-.25.5-.5V6.5c-.6-.45-1.25-.75-2-1V19c-1.1-.35-2.3-.5-3.5-.5-1.7 0-4.15.65-5.5 1.5V6.5C10.55 5.4 8.45 5 6.5 5Z"/></svg> </a> The Internals of Spark Core </label> <div class=md-nav__source> <a href=https://github.com/japila-books/apache-spark-internals title="Go to repository" class=md-source data-md-component=source> <div class="md-source__icon md-icon"> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 496 512"><!-- Font Awesome Free 6.5.1 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2023 Fonticons, Inc.--><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> </div> <div class=md-source__repository> spark-internals </div> </a> </div> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=.. class=md-nav__link> <svg xmlns=http://www.w3.org/2000/svg viewbox="0 0 24 24"><path d="M10 20v-6h4v6h5v-8h3L12 3 2 12h3v8h5Z"/></svg> <span class=md-ellipsis> Spark Core </span> </a> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_2> <div class="md-nav__link md-nav__container"> <a href=../features/ class="md-nav__link "> <span class=md-ellipsis> Features </span> </a> <label class="md-nav__link " for=__nav_2 id=__nav_2_label tabindex> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_2_label aria-expanded=false> <label class=md-nav__title for=__nav_2> <span class="md-nav__icon md-icon"></span> Features </label> <ul class=md-nav__list data-md-scrollfix> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_2_2> <div class="md-nav__link md-nav__container"> <a href=../barrier-execution-mode/ class="md-nav__link "> <span class=md-ellipsis> Barrier Execution Mode <br> <small>Barrier Scheduling</small> </span> </a> <label class="md-nav__link " for=__nav_2_2 id=__nav_2_2_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_2_2_label aria-expanded=false> <label class=md-nav__title for=__nav_2_2> <span class="md-nav__icon md-icon"></span> Barrier Execution Mode </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../barrier-execution-mode/BarrierCoordinator/ class=md-nav__link> <span class=md-ellipsis> BarrierCoordinator </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/BarrierCoordinatorMessage/ class=md-nav__link> <span class=md-ellipsis> BarrierCoordinatorMessage </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/BarrierJobAllocationFailed/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> BarrierJobAllocationFailed </span> </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> BarrierJobSlotsNumberCheckFailed </span> </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/BarrierTaskContext/ class=md-nav__link> <span class=md-ellipsis> BarrierTaskContext </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/ContextBarrierState/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ContextBarrierState </span> </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/RDDBarrier/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> RDDBarrier </span> </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/RequestMethod/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> RequestMethod </span> </span> </a> </li> <li class=md-nav__item> <a href=../barrier-execution-mode/RequestToSync/ class=md-nav__link> <span class=md-ellipsis> RequestToSync </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../configuration-properties/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> Configuration Properties </span> </span> </a> </li> <li class=md-nav__item> <a href=../developer-api/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> Developer API </span> </span> </a> </li> <li class=md-nav__item> <a href=../spark-logging/ class=md-nav__link> <span class=md-ellipsis> Logging </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_2_6> <div class="md-nav__link md-nav__container"> <a href=../plugins/ class="md-nav__link "> <span class=md-ellipsis> Plugin Framework </span> </a> <label class="md-nav__link " for=__nav_2_6 id=__nav_2_6_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_2_6_label aria-expanded=false> <label class=md-nav__title for=__nav_2_6> <span class="md-nav__icon md-icon"></span> Plugin Framework </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../plugins/DriverPlugin/ class=md-nav__link> <span class=md-ellipsis> DriverPlugin </span> </a> </li> <li class=md-nav__item> <a href=../plugins/DriverPluginContainer/ class=md-nav__link> <span class=md-ellipsis> DriverPluginContainer </span> </a> </li> <li class=md-nav__item> <a href=../plugins/ExecutorPlugin/ class=md-nav__link> <span class=md-ellipsis> ExecutorPlugin </span> </a> </li> <li class=md-nav__item> <a href=../plugins/ExecutorPluginContainer/ class=md-nav__link> <span class=md-ellipsis> ExecutorPluginContainer </span> </a> </li> <li class=md-nav__item> <a href=../plugins/PluginContainer/ class=md-nav__link> <span class=md-ellipsis> PluginContainer </span> </a> </li> <li class=md-nav__item> <a href=../plugins/PluginContextImpl/ class=md-nav__link> <span class=md-ellipsis> PluginContextImpl </span> </a> </li> <li class=md-nav__item> <a href=../plugins/SparkPlugin/ class=md-nav__link> <span class=md-ellipsis> SparkPlugin </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../push-based-shuffle/ class=md-nav__link> <span class=md-ellipsis> Push-Based Shuffle </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_2_8> <div class="md-nav__link md-nav__container"> <a href=../stage-level-scheduling/ class="md-nav__link "> <span class=md-ellipsis> Stage-Level Scheduling </span> </a> <label class="md-nav__link " for=__nav_2_8 id=__nav_2_8_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_2_8_label aria-expanded=false> <label class=md-nav__title for=__nav_2_8> <span class="md-nav__icon md-icon"></span> Stage-Level Scheduling </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../stage-level-scheduling/ExecutorResourceInfo/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ExecutorResourceInfo </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ExecutorResourceRequest/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ExecutorResourceRequest </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ExecutorResourceRequests/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ExecutorResourceRequests </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ResourceAllocator/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ResourceAllocator </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ResourceID/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ResourceID </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ResourceProfile/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ResourceProfile </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ResourceProfileBuilder/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ResourceProfileBuilder </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ResourceProfileManager/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ResourceProfileManager </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/ResourceUtils/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> ResourceUtils </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/SparkListenerResourceProfileAdded/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkListenerResourceProfileAdded </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/TaskResourceProfile/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> TaskResourceProfile </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/TaskResourceRequest/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> TaskResourceRequest </span> </span> </a> </li> <li class=md-nav__item> <a href=../stage-level-scheduling/TaskResourceRequests/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> TaskResourceRequests </span> </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--active md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3 checked> <label class=md-nav__link for=__nav_3 id=__nav_3_label tabindex> <span class=md-ellipsis> Internals </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_3_label aria-expanded=true> <label class=md-nav__title for=__nav_3> <span class="md-nav__icon md-icon"></span> Internals </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../overview/ class=md-nav__link> <span class=md-ellipsis> Overview </span> </a> </li> <li class=md-nav__item> <a href=../SparkEnv/ class=md-nav__link> <span class=md-ellipsis> SparkEnv </span> </a> </li> <li class=md-nav__item> <a href=../SparkConf/ class=md-nav__link> <span class=md-ellipsis> SparkConf </span> </a> </li> <li class=md-nav__item> <a href=../SparkContext/ class=md-nav__link> <span class=md-ellipsis> SparkContext </span> </a> </li> <li class=md-nav__item> <a href=../SparkCoreErrors/ class=md-nav__link> <span class=md-ellipsis> SparkCoreErrors </span> </a> </li> <li class=md-nav__item> <a href=../local-properties/ class=md-nav__link> <span class=md-ellipsis> Local Properties </span> </a> </li> <li class=md-nav__item> <a href=../SparkContext-creating-instance-internals/ class=md-nav__link> <span class=md-ellipsis> Inside Creating SparkContext </span> </a> </li> <li class=md-nav__item> <a href=../SparkFiles/ class=md-nav__link> <span class=md-ellipsis> SparkFiles </span> </a> </li> <li class=md-nav__item> <a href=../spark-properties/ class=md-nav__link> <span class=md-ellipsis> Spark Properties </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_10> <div class="md-nav__link md-nav__container"> <a href=../executor/ class="md-nav__link "> <span class=md-ellipsis> Executor </span> </a> <label class="md-nav__link " for=__nav_3_10 id=__nav_3_10_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_10_label aria-expanded=false> <label class=md-nav__title for=__nav_3_10> <span class="md-nav__icon md-icon"></span> Executor </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../executor/CoarseGrainedExecutorBackend/ class=md-nav__link> <span class=md-ellipsis> CoarseGrainedExecutorBackend </span> </a> </li> <li class=md-nav__item> <a href=../executor/Executor/ class=md-nav__link> <span class=md-ellipsis> Executor </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorBackend/ class=md-nav__link> <span class=md-ellipsis> ExecutorBackend </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorLogUrlHandler/ class=md-nav__link> <span class=md-ellipsis> ExecutorLogUrlHandler </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorMetrics/ class=md-nav__link> <span class=md-ellipsis> ExecutorMetrics </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorMetricsPoller/ class=md-nav__link> <span class=md-ellipsis> ExecutorMetricsPoller </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorMetricsSource/ class=md-nav__link> <span class=md-ellipsis> ExecutorMetricsSource </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorMetricType/ class=md-nav__link> <span class=md-ellipsis> ExecutorMetricType </span> </a> </li> <li class=md-nav__item> <a href=../executor/ExecutorSource/ class=md-nav__link> <span class=md-ellipsis> ExecutorSource </span> </a> </li> <li class=md-nav__item> <a href=../executor/ShuffleReadMetrics/ class=md-nav__link> <span class=md-ellipsis> ShuffleReadMetrics </span> </a> </li> <li class=md-nav__item> <a href=../executor/ShuffleWriteMetrics/ class=md-nav__link> <span class=md-ellipsis> ShuffleWriteMetrics </span> </a> </li> <li class=md-nav__item> <a href=../executor/TaskMetrics/ class=md-nav__link> <span class=md-ellipsis> TaskMetrics </span> </a> </li> <li class=md-nav__item> <a href=../executor/TaskRunner/ class=md-nav__link> <span class=md-ellipsis> TaskRunner </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_11> <div class="md-nav__link md-nav__container"> <a href=../external-shuffle-service/ class="md-nav__link "> <span class=md-ellipsis> External Shuffle Service </span> </a> <label class="md-nav__link " for=__nav_3_11 id=__nav_3_11_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_11_label aria-expanded=false> <label class=md-nav__title for=__nav_3_11> <span class="md-nav__icon md-icon"></span> External Shuffle Service </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../external-shuffle-service/configuration-properties/ class=md-nav__link> <span class=md-ellipsis> Configuration Properties </span> </a> </li> <li class=md-nav__item> <a href=../external-shuffle-service/ExternalShuffleService/ class=md-nav__link> <span class=md-ellipsis> ExternalShuffleService </span> </a> </li> <li class=md-nav__item> <a href=../external-shuffle-service/ExternalBlockHandler/ class=md-nav__link> <span class=md-ellipsis> ExternalBlockHandler </span> </a> </li> <li class=md-nav__item> <a href=../external-shuffle-service/ExternalShuffleBlockResolver/ class=md-nav__link> <span class=md-ellipsis> ExternalShuffleBlockResolver </span> </a> </li> <li class=md-nav__item> <a href=../external-shuffle-service/ExecutorShuffleInfo/ class=md-nav__link> <span class=md-ellipsis> ExecutorShuffleInfo </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_12> <label class=md-nav__link for=__nav_3_12 id=__nav_3_12_label tabindex=0> <span class=md-ellipsis> MapOutputTracker </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_12_label aria-expanded=false> <label class=md-nav__title for=__nav_3_12> <span class="md-nav__icon md-icon"></span> MapOutputTracker </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../scheduler/MapOutputStatistics/ class=md-nav__link> <span class=md-ellipsis> MapOutputStatistics </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/MapOutputTracker/ class=md-nav__link> <span class=md-ellipsis> MapOutputTracker </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/MapOutputTrackerMaster/ class=md-nav__link> <span class=md-ellipsis> MapOutputTrackerMaster </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/MapOutputTrackerMasterEndpoint/ class=md-nav__link> <span class=md-ellipsis> MapOutputTrackerMasterEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/MapOutputTrackerWorker/ class=md-nav__link> <span class=md-ellipsis> MapOutputTrackerWorker </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ShuffleStatus/ class=md-nav__link> <span class=md-ellipsis> ShuffleStatus </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_13> <div class="md-nav__link md-nav__container"> <a href=../shuffle/ class="md-nav__link "> <span class=md-ellipsis> Shuffle System </span> </a> <label class="md-nav__link " for=__nav_3_13 id=__nav_3_13_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_13_label aria-expanded=false> <label class=md-nav__title for=__nav_3_13> <span class="md-nav__icon md-icon"></span> Shuffle System </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../shuffle/BaseShuffleHandle/ class=md-nav__link> <span class=md-ellipsis> BaseShuffleHandle </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/BlockStoreShuffleReader/ class=md-nav__link> <span class=md-ellipsis> BlockStoreShuffleReader </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/BypassMergeSortShuffleHandle/ class=md-nav__link> <span class=md-ellipsis> BypassMergeSortShuffleHandle </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/BypassMergeSortShuffleWriter/ class=md-nav__link> <span class=md-ellipsis> BypassMergeSortShuffleWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/DownloadFileManager/ class=md-nav__link> <span class=md-ellipsis> DownloadFileManager </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/FetchFailedException/ class=md-nav__link> <span class=md-ellipsis> FetchFailedException </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/IndexShuffleBlockResolver/ class=md-nav__link> <span class=md-ellipsis> IndexShuffleBlockResolver </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/MigratableResolver/ class=md-nav__link> <span class=md-ellipsis> MigratableResolver </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/SerializedShuffleHandle/ class=md-nav__link> <span class=md-ellipsis> SerializedShuffleHandle </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleBlockPusher/ class=md-nav__link> <span class=md-ellipsis> ShuffleBlockPusher </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleBlockResolver/ class=md-nav__link> <span class=md-ellipsis> ShuffleBlockResolver </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleExternalSorter/ class=md-nav__link> <span class=md-ellipsis> ShuffleExternalSorter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleHandle/ class=md-nav__link> <span class=md-ellipsis> ShuffleHandle </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleInMemorySorter/ class=md-nav__link> <span class=md-ellipsis> ShuffleInMemorySorter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleManager/ class=md-nav__link> <span class=md-ellipsis> ShuffleManager </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleReader/ class=md-nav__link> <span class=md-ellipsis> ShuffleReader </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleWriteMetricsReporter/ class=md-nav__link> <span class=md-ellipsis> ShuffleWriteMetricsReporter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleWriteProcessor/ class=md-nav__link> <span class=md-ellipsis> ShuffleWriteProcessor </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleWriter/ class=md-nav__link> <span class=md-ellipsis> ShuffleWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/SortShuffleManager/ class=md-nav__link> <span class=md-ellipsis> SortShuffleManager </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/SortShuffleWriter/ class=md-nav__link> <span class=md-ellipsis> SortShuffleWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/UnsafeShuffleWriter/ class=md-nav__link> <span class=md-ellipsis> UnsafeShuffleWriter </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_13_24> <label class=md-nav__link for=__nav_3_13_24 id=__nav_3_13_24_label tabindex=0> <span class=md-ellipsis> Spillable Collections </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_3_13_24_label aria-expanded=false> <label class=md-nav__title for=__nav_3_13_24> <span class="md-nav__icon md-icon"></span> Spillable Collections </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../shuffle/Spillable/ class=md-nav__link> <span class=md-ellipsis> Spillable </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ExternalAppendOnlyMap/ class=md-nav__link> <span class=md-ellipsis> ExternalAppendOnlyMap </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ExternalSorter/ class=md-nav__link> <span class=md-ellipsis> ExternalSorter </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_13_25> <label class=md-nav__link for=__nav_3_13_25 id=__nav_3_13_25_label tabindex=0> <span class=md-ellipsis> ShuffleDataIOs </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_3_13_25_label aria-expanded=false> <label class=md-nav__title for=__nav_3_13_25> <span class="md-nav__icon md-icon"></span> ShuffleDataIOs </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../shuffle/ShuffleDataIO/ class=md-nav__link> <span class=md-ellipsis> ShuffleDataIO </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/LocalDiskShuffleDataIO/ class=md-nav__link> <span class=md-ellipsis> LocalDiskShuffleDataIO </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleDriverComponents/ class=md-nav__link> <span class=md-ellipsis> ShuffleDriverComponents </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleExecutorComponents/ class=md-nav__link> <span class=md-ellipsis> ShuffleExecutorComponents </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleMapOutputWriter/ class=md-nav__link> <span class=md-ellipsis> ShuffleMapOutputWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/LocalDiskShuffleMapOutputWriter/ class=md-nav__link> <span class=md-ellipsis> LocalDiskShuffleMapOutputWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ExecutorDiskUtils/ class=md-nav__link> <span class=md-ellipsis> ExecutorDiskUtils </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/SingleSpillShuffleMapOutputWriter/ class=md-nav__link> <span class=md-ellipsis> SingleSpillShuffleMapOutputWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/LocalDiskShuffleExecutorComponents/ class=md-nav__link> <span class=md-ellipsis> LocalDiskShuffleExecutorComponents </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/LocalDiskSingleSpillMapOutputWriter/ class=md-nav__link> <span class=md-ellipsis> LocalDiskSingleSpillMapOutputWriter </span> </a> </li> <li class=md-nav__item> <a href=../shuffle/ShuffleDataIOUtils/ class=md-nav__link> <span class=md-ellipsis> ShuffleDataIOUtils </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_14> <div class="md-nav__link md-nav__container"> <a href=../dynamic-allocation/ class="md-nav__link "> <span class=md-ellipsis> Dynamic Resource Allocation </span> </a> <label class="md-nav__link " for=__nav_3_14 id=__nav_3_14_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_14_label aria-expanded=false> <label class=md-nav__title for=__nav_3_14> <span class="md-nav__icon md-icon"></span> Dynamic Resource Allocation </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../dynamic-allocation/configuration-properties/ class=md-nav__link> <span class=md-ellipsis> Configuration Properties </span> </a> </li> <li class=md-nav__item> <a href=../dynamic-allocation/ExecutorAllocationManager/ class=md-nav__link> <span class=md-ellipsis> ExecutorAllocationManager </span> </a> </li> <li class=md-nav__item> <a href=../dynamic-allocation/ExecutorMonitor/ class=md-nav__link> <span class=md-ellipsis> ExecutorMonitor </span> </a> </li> <li class=md-nav__item> <a href=../dynamic-allocation/Tracker/ class=md-nav__link> <span class=md-ellipsis> Tracker </span> </a> </li> <li class=md-nav__item> <a href=../dynamic-allocation/ExecutorAllocationClient/ class=md-nav__link> <span class=md-ellipsis> ExecutorAllocationClient </span> </a> </li> <li class=md-nav__item> <a href=../dynamic-allocation/ExecutorAllocationManagerSource/ class=md-nav__link> <span class=md-ellipsis> ExecutorAllocationManagerSource </span> </a> </li> <li class=md-nav__item> <a href=../dynamic-allocation/ExecutorAllocationListener/ class=md-nav__link> <span class=md-ellipsis> ExecutorAllocationListener </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_15> <label class=md-nav__link for=__nav_3_15 id=__nav_3_15_label tabindex=0> <span class=md-ellipsis> Core </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_15_label aria-expanded=false> <label class=md-nav__title for=__nav_3_15> <span class="md-nav__icon md-icon"></span> Core </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../core/ContextCleaner/ class=md-nav__link> <span class=md-ellipsis> ContextCleaner </span> </a> </li> <li class=md-nav__item> <a href=../core/CleanerListener/ class=md-nav__link> <span class=md-ellipsis> CleanerListener </span> </a> </li> <li class=md-nav__item> <a href=../core/BlockFetchingListener/ class=md-nav__link> <span class=md-ellipsis> BlockFetchingListener </span> </a> </li> <li class=md-nav__item> <a href=../core/RetryingBlockFetcher/ class=md-nav__link> <span class=md-ellipsis> RetryingBlockFetcher </span> </a> </li> <li class=md-nav__item> <a href=../core/BlockFetchStarter/ class=md-nav__link> <span class=md-ellipsis> BlockFetchStarter </span> </a> </li> <li class=md-nav__item> <a href=../core/KVStore/ class=md-nav__link> <span class=md-ellipsis> KVStore </span> </a> </li> <li class=md-nav__item> <a href=../core/InMemoryStore/ class=md-nav__link> <span class=md-ellipsis> InMemoryStore </span> </a> </li> <li class=md-nav__item> <a href=../core/LevelDB/ class=md-nav__link> <span class=md-ellipsis> LevelDB </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../OutputCommitCoordinator/ class=md-nav__link> <span class=md-ellipsis> OutputCommitCoordinator </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_17> <div class="md-nav__link md-nav__container"> <a href=../scheduler/ class="md-nav__link "> <span class=md-ellipsis> Scheduler </span> </a> <label class="md-nav__link " for=__nav_3_17 id=__nav_3_17_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_17_label aria-expanded=false> <label class=md-nav__title for=__nav_3_17> <span class="md-nav__icon md-icon"></span> Scheduler </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../scheduler/ActiveJob/ class=md-nav__link> <span class=md-ellipsis> ActiveJob </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/BlacklistTracker/ class=md-nav__link> <span class=md-ellipsis> BlacklistTracker </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/CoarseGrainedSchedulerBackend/ class=md-nav__link> <span class=md-ellipsis> CoarseGrainedSchedulerBackend </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/DAGScheduler/ class=md-nav__link> <span class=md-ellipsis> DAGScheduler </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/DAGSchedulerEvent/ class=md-nav__link> <span class=md-ellipsis> DAGSchedulerEvent </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/DAGSchedulerEventProcessLoop/ class=md-nav__link> <span class=md-ellipsis> DAGSchedulerEventProcessLoop </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/DAGSchedulerSource/ class=md-nav__link> <span class=md-ellipsis> DAGSchedulerSource </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/DriverEndpoint/ class=md-nav__link> <span class=md-ellipsis> DriverEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ExecutorData/ class=md-nav__link> <span class=md-ellipsis> ExecutorData </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ExternalClusterManager/ class=md-nav__link> <span class=md-ellipsis> ExternalClusterManager </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/FairSchedulableBuilder/ class=md-nav__link> <span class=md-ellipsis> FairSchedulableBuilder </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/FIFOSchedulableBuilder/ class=md-nav__link> <span class=md-ellipsis> FIFOSchedulableBuilder </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/JobListener/ class=md-nav__link> <span class=md-ellipsis> JobListener </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/JobWaiter/ class=md-nav__link> <span class=md-ellipsis> JobWaiter </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/LiveListenerBus/ class=md-nav__link> <span class=md-ellipsis> LiveListenerBus </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_17_17> <label class=md-nav__link for=__nav_3_17_17 id=__nav_3_17_17_label tabindex=0> <span class=md-ellipsis> MapStatuses </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_3_17_17_label aria-expanded=false> <label class=md-nav__title for=__nav_3_17_17> <span class="md-nav__icon md-icon"></span> MapStatuses </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../scheduler/MapStatus/ class=md-nav__link> <span class=md-ellipsis> MapStatus </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/CompressedMapStatus/ class=md-nav__link> <span class=md-ellipsis> CompressedMapStatus </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/HighlyCompressedMapStatus/ class=md-nav__link> <span class=md-ellipsis> HighlyCompressedMapStatus </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../scheduler/Pool/ class=md-nav__link> <span class=md-ellipsis> Pool </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ResultStage/ class=md-nav__link> <span class=md-ellipsis> ResultStage </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ResultTask/ class=md-nav__link> <span class=md-ellipsis> ResultTask </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/Schedulable/ class=md-nav__link> <span class=md-ellipsis> Schedulable </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/SchedulableBuilder/ class=md-nav__link> <span class=md-ellipsis> SchedulableBuilder </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/SchedulerBackend/ class=md-nav__link> <span class=md-ellipsis> SchedulerBackend </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/SchedulerBackendUtils/ class=md-nav__link> <span class=md-ellipsis> SchedulerBackendUtils </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/SchedulingMode/ class=md-nav__link> <span class=md-ellipsis> SchedulingMode </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ShuffleMapStage/ class=md-nav__link> <span class=md-ellipsis> ShuffleMapStage </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/ShuffleMapTask/ class=md-nav__link> <span class=md-ellipsis> ShuffleMapTask </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/Stage/ class=md-nav__link> <span class=md-ellipsis> Stage </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/StageInfo/ class=md-nav__link> <span class=md-ellipsis> StageInfo </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskScheduler/ class=md-nav__link> <span class=md-ellipsis> TaskScheduler </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskSchedulerImpl/ class=md-nav__link> <span class=md-ellipsis> TaskSchedulerImpl </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/Task/ class=md-nav__link> <span class=md-ellipsis> Task </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskContext/ class=md-nav__link> <span class=md-ellipsis> TaskContext </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskContextImpl/ class=md-nav__link> <span class=md-ellipsis> TaskContextImpl </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskDescription/ class=md-nav__link> <span class=md-ellipsis> TaskDescription </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskInfo/ class=md-nav__link> <span class=md-ellipsis> TaskInfo </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskLocation/ class=md-nav__link> <span class=md-ellipsis> TaskLocation </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskResult/ class=md-nav__link> <span class=md-ellipsis> TaskResult </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskResultGetter/ class=md-nav__link> <span class=md-ellipsis> TaskResultGetter </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskSet/ class=md-nav__link> <span class=md-ellipsis> TaskSet </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskSetBlacklist/ class=md-nav__link> <span class=md-ellipsis> TaskSetBlacklist </span> </a> </li> <li class=md-nav__item> <a href=../scheduler/TaskSetManager/ class=md-nav__link> <span class=md-ellipsis> TaskSetManager </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_18> <div class="md-nav__link md-nav__container"> <a href=../rpc/ class="md-nav__link "> <span class=md-ellipsis> RPC </span> </a> <label class="md-nav__link " for=__nav_3_18 id=__nav_3_18_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_18_label aria-expanded=false> <label class=md-nav__title for=__nav_3_18> <span class="md-nav__icon md-icon"></span> RPC </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../rpc/RpcEnv/ class=md-nav__link> <span class=md-ellipsis> RpcEnv </span> </a> </li> <li class=md-nav__item> <a href=../rpc/NettyRpcEnv/ class=md-nav__link> <span class=md-ellipsis> NettyRpcEnv </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcEnvConfig/ class=md-nav__link> <span class=md-ellipsis> RpcEnvConfig </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcEndpoint/ class=md-nav__link> <span class=md-ellipsis> RpcEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcEndpointRef/ class=md-nav__link> <span class=md-ellipsis> RpcEndpointRef </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcAddress/ class=md-nav__link> <span class=md-ellipsis> RpcAddress </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcEndpointAddress/ class=md-nav__link> <span class=md-ellipsis> RpcEndpointAddress </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcEnvFactory/ class=md-nav__link> <span class=md-ellipsis> RpcEnvFactory </span> </a> </li> <li class=md-nav__item> <a href=../rpc/NettyRpcEnvFactory/ class=md-nav__link> <span class=md-ellipsis> NettyRpcEnvFactory </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcEnvFileServer/ class=md-nav__link> <span class=md-ellipsis> RpcEnvFileServer </span> </a> </li> <li class=md-nav__item> <a href=../rpc/spark-rpc-netty/ class=md-nav__link> <span class=md-ellipsis> spark-rpc-netty </span> </a> </li> <li class=md-nav__item> <a href=../rpc/RpcUtils/ class=md-nav__link> <span class=md-ellipsis> RpcUtils </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_19> <div class="md-nav__link md-nav__container"> <a href=../memory/ class="md-nav__link "> <span class=md-ellipsis> Memory </span> </a> <label class="md-nav__link " for=__nav_3_19 id=__nav_3_19_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_19_label aria-expanded=false> <label class=md-nav__title for=__nav_3_19> <span class="md-nav__icon md-icon"></span> Memory </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../memory/ExecutionMemoryPool/ class=md-nav__link> <span class=md-ellipsis> ExecutionMemoryPool </span> </a> </li> <li class=md-nav__item> <a href=../memory/MemoryAllocator/ class=md-nav__link> <span class=md-ellipsis> MemoryAllocator </span> </a> </li> <li class=md-nav__item> <a href=../memory/MemoryConsumer/ class=md-nav__link> <span class=md-ellipsis> MemoryConsumer </span> </a> </li> <li class=md-nav__item> <a href=../memory/MemoryManager/ class=md-nav__link> <span class=md-ellipsis> MemoryManager </span> </a> </li> <li class=md-nav__item> <a href=../memory/MemoryPool/ class=md-nav__link> <span class=md-ellipsis> MemoryPool </span> </a> </li> <li class=md-nav__item> <a href=../memory/StorageMemoryPool/ class=md-nav__link> <span class=md-ellipsis> StorageMemoryPool </span> </a> </li> <li class=md-nav__item> <a href=../memory/TaskMemoryManager/ class=md-nav__link> <span class=md-ellipsis> TaskMemoryManager </span> </a> </li> <li class=md-nav__item> <a href=../memory/UnifiedMemoryManager/ class=md-nav__link> <span class=md-ellipsis> UnifiedMemoryManager </span> </a> </li> <li class=md-nav__item> <a href=../memory/UnsafeExternalSorter/ class=md-nav__link> <span class=md-ellipsis> UnsafeExternalSorter </span> </a> </li> <li class=md-nav__item> <a href=../memory/UnsafeInMemorySorter/ class=md-nav__link> <span class=md-ellipsis> UnsafeInMemorySorter </span> </a> </li> <li class=md-nav__item> <a href=../memory/UnsafeSorterSpillReader/ class=md-nav__link> <span class=md-ellipsis> UnsafeSorterSpillReader </span> </a> </li> <li class=md-nav__item> <a href=../memory/UnsafeSorterSpillWriter/ class=md-nav__link> <span class=md-ellipsis> UnsafeSorterSpillWriter </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_20> <div class="md-nav__link md-nav__container"> <a href=../storage/ class="md-nav__link "> <span class=md-ellipsis> Storage </span> </a> <label class="md-nav__link " for=__nav_3_20 id=__nav_3_20_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_20_label aria-expanded=false> <label class=md-nav__title for=__nav_3_20> <span class="md-nav__icon md-icon"></span> Storage </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../storage/BlockData/ class=md-nav__link> <span class=md-ellipsis> BlockData </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockDataManager/ class=md-nav__link> <span class=md-ellipsis> BlockDataManager </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockEvictionHandler/ class=md-nav__link> <span class=md-ellipsis> BlockEvictionHandler </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockId/ class=md-nav__link> <span class=md-ellipsis> BlockId </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockInfo/ class=md-nav__link> <span class=md-ellipsis> BlockInfo </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockInfoManager/ class=md-nav__link> <span class=md-ellipsis> BlockInfoManager </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManager/ class=md-nav__link> <span class=md-ellipsis> BlockManager </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerDecommissioner/ class=md-nav__link> <span class=md-ellipsis> BlockManagerDecommissioner </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerStorageEndpoint/ class=md-nav__link> <span class=md-ellipsis> BlockManagerStorageEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerId/ class=md-nav__link> <span class=md-ellipsis> BlockManagerId </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerInfo/ class=md-nav__link> <span class=md-ellipsis> BlockManagerInfo </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerMaster/ class=md-nav__link> <span class=md-ellipsis> BlockManagerMaster </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerMasterEndpoint/ class=md-nav__link> <span class=md-ellipsis> BlockManagerMasterEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerMasterHeartbeatEndpoint/ class=md-nav__link> <span class=md-ellipsis> BlockManagerMasterHeartbeatEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerSlaveEndpoint/ class=md-nav__link> <span class=md-ellipsis> BlockManagerSlaveEndpoint </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockManagerSource/ class=md-nav__link> <span class=md-ellipsis> BlockManagerSource </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockReplicationPolicy/ class=md-nav__link> <span class=md-ellipsis> BlockReplicationPolicy </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockStoreClient/ class=md-nav__link> <span class=md-ellipsis> BlockStoreClient </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockStoreUpdater/ class=md-nav__link> <span class=md-ellipsis> BlockStoreUpdater </span> </a> </li> <li class=md-nav__item> <a href=../storage/BlockTransferService/ class=md-nav__link> <span class=md-ellipsis> BlockTransferService </span> </a> </li> <li class=md-nav__item> <a href=../storage/ByteBufferBlockStoreUpdater/ class=md-nav__link> <span class=md-ellipsis> ByteBufferBlockStoreUpdater </span> </a> </li> <li class=md-nav__item> <a href=../storage/DiskBlockObjectWriter/ class=md-nav__link> <span class=md-ellipsis> DiskBlockObjectWriter </span> </a> </li> <li class=md-nav__item> <a href=../storage/DiskStore/ class=md-nav__link> <span class=md-ellipsis> DiskStore </span> </a> </li> <li class=md-nav__item> <a href=../storage/DiskBlockManager/ class=md-nav__link> <span class=md-ellipsis> DiskBlockManager </span> </a> </li> <li class=md-nav__item> <a href=../storage/ExternalBlockStoreClient/ class=md-nav__link> <span class=md-ellipsis> ExternalBlockStoreClient </span> </a> </li> <li class=md-nav__item> <a href=../storage/FallbackStorage/ class=md-nav__link> <span class=md-ellipsis> FallbackStorage </span> </a> </li> <li class=md-nav__item> <a href=../storage/MemoryStore/ class=md-nav__link> <span class=md-ellipsis> MemoryStore </span> </a> </li> <li class=md-nav__item> <a href=../storage/ShuffleMetricsSource/ class=md-nav__link> <span class=md-ellipsis> ShuffleMetricsSource </span> </a> </li> <li class=md-nav__item> <a href=../storage/NettyBlockTransferService/ class=md-nav__link> <span class=md-ellipsis> NettyBlockTransferService </span> </a> </li> <li class=md-nav__item> <a href=../storage/NettyBlockRpcServer/ class=md-nav__link> <span class=md-ellipsis> NettyBlockRpcServer </span> </a> </li> <li class=md-nav__item> <a href=../storage/OneForOneBlockFetcher/ class=md-nav__link> <span class=md-ellipsis> OneForOneBlockFetcher </span> </a> </li> <li class=md-nav__item> <a href=../storage/RandomBlockReplicationPolicy/ class=md-nav__link> <span class=md-ellipsis> RandomBlockReplicationPolicy </span> </a> </li> <li class=md-nav__item> <a href=../storage/RDDInfo/ class=md-nav__link> <span class=md-ellipsis> RDDInfo </span> </a> </li> <li class=md-nav__item> <a href=../storage/ShuffleBlockFetcherIterator/ class=md-nav__link> <span class=md-ellipsis> ShuffleBlockFetcherIterator </span> </a> </li> <li class=md-nav__item> <a href=../storage/ShuffleFetchCompletionListener/ class=md-nav__link> <span class=md-ellipsis> ShuffleFetchCompletionListener </span> </a> </li> <li class=md-nav__item> <a href=../storage/ShuffleMigrationRunnable/ class=md-nav__link> <span class=md-ellipsis> ShuffleMigrationRunnable </span> </a> </li> <li class=md-nav__item> <a href=../storage/StorageLevel/ class=md-nav__link> <span class=md-ellipsis> StorageLevel </span> </a> </li> <li class=md-nav__item> <a href=../storage/StorageStatus/ class=md-nav__link> <span class=md-ellipsis> StorageStatus </span> </a> </li> <li class=md-nav__item> <a href=../storage/StorageUtils/ class=md-nav__link> <span class=md-ellipsis> StorageUtils </span> </a> </li> <li class=md-nav__item> <a href=../storage/TempFileBasedBlockStoreUpdater/ class=md-nav__link> <span class=md-ellipsis> TempFileBasedBlockStoreUpdater </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_21> <div class="md-nav__link md-nav__container"> <a href=../serializer/ class="md-nav__link "> <span class=md-ellipsis> Serialization </span> </a> <label class="md-nav__link " for=__nav_3_21 id=__nav_3_21_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_21_label aria-expanded=false> <label class=md-nav__title for=__nav_3_21> <span class="md-nav__icon md-icon"></span> Serialization </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../serializer/DeserializationStream/ class=md-nav__link> <span class=md-ellipsis> DeserializationStream </span> </a> </li> <li class=md-nav__item> <a href=../serializer/JavaSerializerInstance/ class=md-nav__link> <span class=md-ellipsis> JavaSerializerInstance </span> </a> </li> <li class=md-nav__item> <a href=../serializer/KryoSerializer/ class=md-nav__link> <span class=md-ellipsis> KryoSerializer </span> </a> </li> <li class=md-nav__item> <a href=../serializer/KryoSerializerInstance/ class=md-nav__link> <span class=md-ellipsis> KryoSerializerInstance </span> </a> </li> <li class=md-nav__item> <a href=../serializer/SerializationStream/ class=md-nav__link> <span class=md-ellipsis> SerializationStream </span> </a> </li> <li class=md-nav__item> <a href=../serializer/Serializer/ class=md-nav__link> <span class=md-ellipsis> Serializer </span> </a> </li> <li class=md-nav__item> <a href=../serializer/SerializerInstance/ class=md-nav__link> <span class=md-ellipsis> SerializerInstance </span> </a> </li> <li class=md-nav__item> <a href=../serializer/SerializerManager/ class=md-nav__link> <span class=md-ellipsis> SerializerManager </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../speculative-execution-of-tasks/ class=md-nav__link> <span class=md-ellipsis> Speculative Execution of Tasks </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_23> <label class=md-nav__link for=__nav_3_23 id=__nav_3_23_label tabindex=0> <span class=md-ellipsis> Deployment Architecture </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_23_label aria-expanded=false> <label class=md-nav__title for=__nav_3_23> <span class="md-nav__icon md-icon"></span> Deployment Architecture </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../architecture/ class=md-nav__link> <span class=md-ellipsis> Architecture </span> </a> </li> <li class=md-nav__item> <a href=../driver/ class=md-nav__link> <span class=md-ellipsis> Driver </span> </a> </li> <li class=md-nav__item> <a href=../master/ class=md-nav__link> <span class=md-ellipsis> Master </span> </a> </li> <li class=md-nav__item> <a href=../workers/ class=md-nav__link> <span class=md-ellipsis> Workers </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_24> <label class=md-nav__link for=__nav_3_24 id=__nav_3_24_label tabindex=0> <span class=md-ellipsis> Internal IO </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_24_label aria-expanded=false> <label class=md-nav__title for=__nav_3_24> <span class="md-nav__icon md-icon"></span> Internal IO </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../SparkHadoopWriter/ class=md-nav__link> <span class=md-ellipsis> SparkHadoopWriter </span> </a> </li> <li class=md-nav__item> <a href=../HadoopWriteConfigUtil/ class=md-nav__link> <span class=md-ellipsis> HadoopWriteConfigUtil </span> </a> </li> <li class=md-nav__item> <a href=../FileCommitProtocol/ class=md-nav__link> <span class=md-ellipsis> FileCommitProtocol </span> </a> </li> <li class=md-nav__item> <a href=../HadoopMapReduceCommitProtocol/ class=md-nav__link> <span class=md-ellipsis> HadoopMapReduceCommitProtocol </span> </a> </li> <li class=md-nav__item> <a href=../HadoopMapRedCommitProtocol/ class=md-nav__link> <span class=md-ellipsis> HadoopMapRedCommitProtocol </span> </a> </li> <li class=md-nav__item> <a href=../HadoopMapReduceWriteConfigUtil/ class=md-nav__link> <span class=md-ellipsis> HadoopMapReduceWriteConfigUtil </span> </a> </li> <li class=md-nav__item> <a href=../HadoopMapRedWriteConfigUtil/ class=md-nav__link> <span class=md-ellipsis> HadoopMapRedWriteConfigUtil </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_25> <div class="md-nav__link md-nav__container"> <a href=../status/ class="md-nav__link "> <span class=md-ellipsis> Status </span> </a> <label class="md-nav__link " for=__nav_3_25 id=__nav_3_25_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_25_label aria-expanded=false> <label class=md-nav__title for=__nav_3_25> <span class="md-nav__icon md-icon"></span> Status </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../status/AppStatusListener/ class=md-nav__link> <span class=md-ellipsis> AppStatusListener </span> </a> </li> <li class=md-nav__item> <a href=../status/AppStatusSource/ class=md-nav__link> <span class=md-ellipsis> AppStatusSource </span> </a> </li> <li class=md-nav__item> <a href=../status/AppStatusStore/ class=md-nav__link> <span class=md-ellipsis> AppStatusStore </span> </a> </li> <li class=md-nav__item> <a href=../status/ElementTrackingStore/ class=md-nav__link> <span class=md-ellipsis> ElementTrackingStore </span> </a> </li> <li class=md-nav__item> <a href=../status/LiveEntity/ class=md-nav__link> <span class=md-ellipsis> LiveEntity </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_26> <div class="md-nav__link md-nav__container"> <a href=../network/ class="md-nav__link "> <span class=md-ellipsis> Network </span> </a> <label class="md-nav__link " for=__nav_3_26 id=__nav_3_26_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_26_label aria-expanded=false> <label class=md-nav__title for=__nav_3_26> <span class="md-nav__icon md-icon"></span> Network </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../network/SparkTransportConf/ class=md-nav__link> <span class=md-ellipsis> SparkTransportConf </span> </a> </li> <li class=md-nav__item> <a href=../network/TransportClientFactory/ class=md-nav__link> <span class=md-ellipsis> TransportClientFactory </span> </a> </li> <li class=md-nav__item> <a href=../network/TransportContext/ class=md-nav__link> <span class=md-ellipsis> TransportContext </span> </a> </li> <li class=md-nav__item> <a href=../network/TransportConf/ class=md-nav__link> <span class=md-ellipsis> TransportConf </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_27> <label class=md-nav__link for=__nav_3_27 id=__nav_3_27_label tabindex=0> <span class=md-ellipsis> Misc </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_27_label aria-expanded=false> <label class=md-nav__title for=__nav_3_27> <span class="md-nav__icon md-icon"></span> Misc </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../BytesToBytesMap/ class=md-nav__link> <span class=md-ellipsis> BytesToBytesMap </span> </a> </li> <li class=md-nav__item> <a href=../ExecutorDeadException/ class=md-nav__link> <span class=md-ellipsis> ExecutorDeadException </span> </a> </li> <li class=md-nav__item> <a href=../HeartbeatReceiver/ class=md-nav__link> <span class=md-ellipsis> HeartbeatReceiver </span> </a> </li> <li class=md-nav__item> <a href=../InterruptibleIterator/ class=md-nav__link> <span class=md-ellipsis> InterruptibleIterator </span> </a> </li> <li class=md-nav__item> <a href=../Utils/ class=md-nav__link> <span class=md-ellipsis> Utils </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_28> <label class=md-nav__link for=__nav_3_28 id=__nav_3_28_label tabindex=0> <span class=md-ellipsis> Spark Tips and Tricks </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_28_label aria-expanded=false> <label class=md-nav__title for=__nav_3_28> <span class="md-nav__icon md-icon"></span> Spark Tips and Tricks </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../spark-tips-and-tricks/ class=md-nav__link> <span class=md-ellipsis> Spark Tips and Tricks </span> </a> </li> <li class=md-nav__item> <a href=../spark-tips-and-tricks-access-private-members-spark-shell/ class=md-nav__link> <span class=md-ellipsis> Access private members in Scala in Spark shell </span> </a> </li> <li class=md-nav__item> <a href=../spark-tips-and-tricks-sparkexception-task-not-serializable/ class=md-nav__link> <span class=md-ellipsis> Task not serializable Exception </span> </a> </li> <li class=md-nav__item> <a href=../spark-tips-and-tricks-running-spark-windows/ class=md-nav__link> <span class=md-ellipsis> Running Spark Applications on Windows </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--active md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_3_29 checked> <div class="md-nav__link md-nav__container"> <a href=./ class="md-nav__link md-nav__link--active"> <span class=md-ellipsis> Spark Local </span> </a> <label class="md-nav__link md-nav__link--active" for=__nav_3_29 id=__nav_3_29_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_3_29_label aria-expanded=true> <label class=md-nav__title for=__nav_3_29> <span class="md-nav__icon md-icon"></span> Spark Local </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=LocalSchedulerBackend/ class=md-nav__link> <span class=md-ellipsis> LocalSchedulerBackend </span> </a> </li> <li class=md-nav__item> <a href=LocalEndpoint/ class=md-nav__link> <span class=md-ellipsis> LocalEndpoint </span> </a> </li> <li class=md-nav__item> <a href=LauncherBackend/ class=md-nav__link> <span class=md-ellipsis> LauncherBackend </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_4> <label class=md-nav__link for=__nav_4 id=__nav_4_label tabindex> <span class=md-ellipsis> Shared Variables </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_4_label aria-expanded=false> <label class=md-nav__title for=__nav_4> <span class="md-nav__icon md-icon"></span> Shared Variables </label> <ul class=md-nav__list data-md-scrollfix> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_4_1> <div class="md-nav__link md-nav__container"> <a href=../accumulators/ class="md-nav__link "> <span class=md-ellipsis> Accumulators </span> </a> <label class="md-nav__link " for=__nav_4_1 id=__nav_4_1_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_4_1_label aria-expanded=false> <label class=md-nav__title for=__nav_4_1> <span class="md-nav__icon md-icon"></span> Accumulators </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../accumulators/AccumulatorV2/ class=md-nav__link> <span class=md-ellipsis> AccumulatorV2 </span> </a> </li> <li class=md-nav__item> <a href=../accumulators/AccumulatorContext/ class=md-nav__link> <span class=md-ellipsis> AccumulatorContext </span> </a> </li> <li class=md-nav__item> <a href=../accumulators/InternalAccumulator/ class=md-nav__link> <span class=md-ellipsis> InternalAccumulator </span> </a> </li> <li class=md-nav__item> <a href=../accumulators/AccumulatorSource/ class=md-nav__link> <span class=md-ellipsis> AccumulatorSource </span> </a> </li> <li class=md-nav__item> <a href=../accumulators/AccumulableInfo/ class=md-nav__link> <span class=md-ellipsis> AccumulableInfo </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_4_2> <div class="md-nav__link md-nav__container"> <a href=../broadcast-variables/ class="md-nav__link "> <span class=md-ellipsis> Broadcast Variables </span> </a> <label class="md-nav__link " for=__nav_4_2 id=__nav_4_2_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_4_2_label aria-expanded=false> <label class=md-nav__title for=__nav_4_2> <span class="md-nav__icon md-icon"></span> Broadcast Variables </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../broadcast-variables/Broadcast/ class=md-nav__link> <span class=md-ellipsis> Broadcast </span> </a> </li> <li class=md-nav__item> <a href=../broadcast-variables/BroadcastFactory/ class=md-nav__link> <span class=md-ellipsis> BroadcastFactory </span> </a> </li> <li class=md-nav__item> <a href=../broadcast-variables/BroadcastManager/ class=md-nav__link> <span class=md-ellipsis> BroadcastManager </span> </a> </li> <li class=md-nav__item> <a href=../broadcast-variables/TorrentBroadcast/ class=md-nav__link> <span class=md-ellipsis> TorrentBroadcast </span> </a> </li> <li class=md-nav__item> <a href=../broadcast-variables/TorrentBroadcastFactory/ class=md-nav__link> <span class=md-ellipsis> TorrentBroadcastFactory </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5> <label class=md-nav__link for=__nav_5 id=__nav_5_label tabindex> <span class=md-ellipsis> Monitoring </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_5_label aria-expanded=false> <label class=md-nav__title for=__nav_5> <span class="md-nav__icon md-icon"></span> Monitoring </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../ConsoleProgressBar/ class=md-nav__link> <span class=md-ellipsis> ConsoleProgressBar </span> </a> </li> <li class=md-nav__item> <a href=../spark-debugging/ class=md-nav__link> <span class=md-ellipsis> Debugging Spark </span> </a> </li> <li class=md-nav__item> <a href=../DriverLogger/ class=md-nav__link> <span class=md-ellipsis> DriverLogger </span> </a> </li> <li class=md-nav__item> <a href=../ListenerBus/ class=md-nav__link> <span class=md-ellipsis> ListenerBus </span> </a> </li> <li class=md-nav__item> <a href=../SparkListener/ class=md-nav__link> <span class=md-ellipsis> SparkListener </span> </a> </li> <li class=md-nav__item> <a href=../SparkListenerBus/ class=md-nav__link> <span class=md-ellipsis> SparkListenerBus </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5_7> <label class=md-nav__link for=__nav_5_7 id=__nav_5_7_label tabindex=0> <span class=md-ellipsis> SparkListenerEvents </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_5_7_label aria-expanded=false> <label class=md-nav__title for=__nav_5_7> <span class="md-nav__icon md-icon"></span> SparkListenerEvents </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../SparkListenerEvent/ class=md-nav__link> <span class=md-ellipsis> SparkListenerEvent </span> </a> </li> <li class=md-nav__item> <a href=../SparkListenerTaskEnd/ class=md-nav__link> <span class=md-ellipsis> SparkListenerTaskEnd </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../SparkListenerInterface/ class=md-nav__link> <span class=md-ellipsis> SparkListenerInterface </span> </a> </li> <li class=md-nav__item> <a href=../SparkStatusTracker/ class=md-nav__link> <span class=md-ellipsis> SparkStatusTracker </span> </a> </li> <li class=md-nav__item> <a href=../SpillListener/ class=md-nav__link> <span class=md-ellipsis> SpillListener </span> </a> </li> <li class=md-nav__item> <a href=../StatsReportListener/ class=md-nav__link> <span class=md-ellipsis> StatsReportListener </span> </a> </li> <li class=md-nav__item> <a href=../TaskCompletionListener/ class=md-nav__link> <span class=md-ellipsis> TaskCompletionListener </span> </a> </li> <li class=md-nav__item> <a href=../TaskFailureListener/ class=md-nav__link> <span class=md-ellipsis> TaskFailureListener </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5_14> <div class="md-nav__link md-nav__container"> <a href=../history-server/ class="md-nav__link "> <span class=md-ellipsis> Spark History Server </span> </a> <label class="md-nav__link " for=__nav_5_14 id=__nav_5_14_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_5_14_label aria-expanded=false> <label class=md-nav__title for=__nav_5_14> <span class="md-nav__icon md-icon"></span> Spark History Server </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../history-server/configuration-properties/ class=md-nav__link> <span class=md-ellipsis> Configuration Properties </span> </a> </li> <li class=md-nav__item> <a href=../history-server/HistoryServer/ class=md-nav__link> <span class=md-ellipsis> HistoryServer </span> </a> </li> <li class=md-nav__item> <a href=../history-server/HistoryAppStatusStore/ class=md-nav__link> <span class=md-ellipsis> HistoryAppStatusStore </span> </a> </li> <li class=md-nav__item> <a href=../history-server/HistoryServerDiskManager/ class=md-nav__link> <span class=md-ellipsis> HistoryServerDiskManager </span> </a> </li> <li class=md-nav__item> <a href=../history-server/EventLoggingListener/ class=md-nav__link> <span class=md-ellipsis> EventLoggingListener </span> </a> </li> <li class=md-nav__item> <a href=../history-server/SQLHistoryListener/ class=md-nav__link> <span class=md-ellipsis> SQLHistoryListener </span> </a> </li> <li class=md-nav__item> <a href=../history-server/ApplicationHistoryProvider/ class=md-nav__link> <span class=md-ellipsis> ApplicationHistoryProvider </span> </a> </li> <li class=md-nav__item> <a href=../history-server/FsHistoryProvider/ class=md-nav__link> <span class=md-ellipsis> FsHistoryProvider </span> </a> </li> <li class=md-nav__item> <a href=../history-server/HistoryServerArguments/ class=md-nav__link> <span class=md-ellipsis> HistoryServerArguments </span> </a> </li> <li class=md-nav__item> <a href=../history-server/ApplicationCacheOperations/ class=md-nav__link> <span class=md-ellipsis> ApplicationCacheOperations </span> </a> </li> <li class=md-nav__item> <a href=../history-server/ApplicationCache/ class=md-nav__link> <span class=md-ellipsis> ApplicationCache </span> </a> </li> <li class=md-nav__item> <a href=../history-server/ReplayListenerBus/ class=md-nav__link> <span class=md-ellipsis> ReplayListenerBus </span> </a> </li> <li class=md-nav__item> <a href=../history-server/EventLogFileWriter/ class=md-nav__link> <span class=md-ellipsis> EventLogFileWriter </span> </a> </li> <li class=md-nav__item> <a href=../history-server/JsonProtocol/ class=md-nav__link> <span class=md-ellipsis> JsonProtocol </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5_15> <div class="md-nav__link md-nav__container"> <a href=../rest/ class="md-nav__link "> <span class=md-ellipsis> Status REST API </span> </a> <label class="md-nav__link " for=__nav_5_15 id=__nav_5_15_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_5_15_label aria-expanded=false> <label class=md-nav__title for=__nav_5_15> <span class="md-nav__icon md-icon"></span> Status REST API </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../rest/ApiRootResource/ class=md-nav__link> <span class=md-ellipsis> ApiRootResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/ApplicationListResource/ class=md-nav__link> <span class=md-ellipsis> ApplicationListResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/OneApplicationResource/ class=md-nav__link> <span class=md-ellipsis> OneApplicationResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/StagesResource/ class=md-nav__link> <span class=md-ellipsis> StagesResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/OneApplicationAttemptResource/ class=md-nav__link> <span class=md-ellipsis> OneApplicationAttemptResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/AbstractApplicationResource/ class=md-nav__link> <span class=md-ellipsis> AbstractApplicationResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/BaseAppResource/ class=md-nav__link> <span class=md-ellipsis> BaseAppResource </span> </a> </li> <li class=md-nav__item> <a href=../rest/ApiRequestContext/ class=md-nav__link> <span class=md-ellipsis> ApiRequestContext </span> </a> </li> <li class=md-nav__item> <a href=../rest/UIRoot/ class=md-nav__link> <span class=md-ellipsis> UIRoot </span> </a> </li> <li class=md-nav__item> <a href=../rest/UIRootFromServletContext/ class=md-nav__link> <span class=md-ellipsis> UIRootFromServletContext </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5_16> <div class="md-nav__link md-nav__container"> <a href=../metrics/ class="md-nav__link "> <span class=md-ellipsis> Metrics </span> </a> <label class="md-nav__link " for=__nav_5_16 id=__nav_5_16_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_5_16_label aria-expanded=false> <label class=md-nav__title for=__nav_5_16> <span class="md-nav__icon md-icon"></span> Metrics </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../metrics/configuration-properties/ class=md-nav__link> <span class=md-ellipsis> Configuration Properties </span> </a> </li> <li class=md-nav__item> <a href=../metrics/MetricsSystem/ class=md-nav__link> <span class=md-ellipsis> MetricsSystem </span> </a> </li> <li class=md-nav__item> <a href=../metrics/MetricsConfig/ class=md-nav__link> <span class=md-ellipsis> MetricsConfig </span> </a> </li> <li class=md-nav__item> <a href=../metrics/Source/ class=md-nav__link> <span class=md-ellipsis> Source </span> </a> </li> <li class=md-nav__item> <a href=../metrics/Sink/ class=md-nav__link> <span class=md-ellipsis> Sink </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5_16_7> <label class=md-nav__link for=__nav_5_16_7 id=__nav_5_16_7_label tabindex=0> <span class=md-ellipsis> Sources </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_5_16_7_label aria-expanded=false> <label class=md-nav__title for=__nav_5_16_7> <span class="md-nav__icon md-icon"></span> Sources </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../metrics/JvmSource/ class=md-nav__link> <span class=md-ellipsis> JvmSource </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_5_16_8> <label class=md-nav__link for=__nav_5_16_8 id=__nav_5_16_8_label tabindex=0> <span class=md-ellipsis> Sinks </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_5_16_8_label aria-expanded=false> <label class=md-nav__title for=__nav_5_16_8> <span class="md-nav__icon md-icon"></span> Sinks </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../metrics/MetricsServlet/ class=md-nav__link> <span class=md-ellipsis> MetricsServlet </span> </a> </li> <li class=md-nav__item> <a href=../metrics/PrometheusServlet/ class=md-nav__link> <span class=md-ellipsis> PrometheusServlet </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_6> <div class="md-nav__link md-nav__container"> <a href=../tools/ class="md-nav__link "> <span class=md-ellipsis> Tools </span> </a> <label class="md-nav__link " for=__nav_6 id=__nav_6_label tabindex> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_6_label aria-expanded=false> <label class=md-nav__title for=__nav_6> <span class="md-nav__icon md-icon"></span> Tools </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../tools/AbstractLauncher/ class=md-nav__link> <span class=md-ellipsis> AbstractLauncher </span> </a> </li> <li class=md-nav__item> <a href=../tools/Main/ class=md-nav__link> <span class=md-ellipsis> Main </span> </a> </li> <li class=md-nav__item> <a href=../tools/pyspark/ class=md-nav__link> <span class=md-ellipsis> pyspark </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_6_5> <div class="md-nav__link md-nav__container"> <a href=../tools/spark-submit/ class="md-nav__link "> <span class=md-ellipsis> spark-submit </span> </a> <label class="md-nav__link " for=__nav_6_5 id=__nav_6_5_label tabindex=0> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_6_5_label aria-expanded=false> <label class=md-nav__title for=__nav_6_5> <span class="md-nav__icon md-icon"></span> spark-submit </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmit/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmit </span> </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmitArguments/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmitArguments </span> </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmitCommandBuilder.OptionParser/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmitCommandBuilder.OptionParser </span> </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmitCommandBuilder/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmitCommandBuilder </span> </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmitOperation/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmitOperation </span> </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmitOptionParser/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmitOptionParser </span> </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-submit/SparkSubmitUtils/ class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> SparkSubmitUtils </span> </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../tools/SparkApplication/ class=md-nav__link> <span class=md-ellipsis> SparkApplication </span> </a> </li> <li class=md-nav__item> <a href=../tools/JavaMainApplication/ class=md-nav__link> <span class=md-ellipsis> JavaMainApplication </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-shell/ class=md-nav__link> <span class=md-ellipsis> spark-shell </span> </a> </li> <li class=md-nav__item> <a href=../tools/spark-class/ class=md-nav__link> <span class=md-ellipsis> spark-class </span> </a> </li> <li class=md-nav__item> <a href=../tools/SparkLauncher/ class=md-nav__link> <span class=md-ellipsis> SparkLauncher </span> </a> </li> <li class=md-nav__item> <a href=../tools/SparkClassCommandBuilder/ class=md-nav__link> <span class=md-ellipsis> SparkClassCommandBuilder </span> </a> </li> <li class=md-nav__item> <a href=../tools/DependencyUtils/ class=md-nav__link> <span class=md-ellipsis> DependencyUtils </span> </a> </li> <li class=md-nav__item> <a href=../tools/AbstractCommandBuilder/ class=md-nav__link> <span class=md-ellipsis> AbstractCommandBuilder </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_7> <div class="md-nav__link md-nav__container"> <a href=../rdd/ class="md-nav__link "> <span class=md-ellipsis> RDD </span> </a> <label class="md-nav__link " for=__nav_7 id=__nav_7_label tabindex> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_7_label aria-expanded=false> <label class=md-nav__title for=__nav_7> <span class="md-nav__icon md-icon"></span> RDD </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../rdd/CoalescedRDD/ class=md-nav__link> <span class=md-ellipsis> CoalescedRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/CheckpointRDD/ class=md-nav__link> <span class=md-ellipsis> CheckpointRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/CoGroupedRDD/ class=md-nav__link> <span class=md-ellipsis> CoGroupedRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/Dependency/ class=md-nav__link> <span class=md-ellipsis> Dependency </span> </a> </li> <li class=md-nav__item> <a href=../rdd/HadoopRDD/ class=md-nav__link> <span class=md-ellipsis> HadoopRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/LocalCheckpointRDD/ class=md-nav__link> <span class=md-ellipsis> LocalCheckpointRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/MapPartitionsRDD/ class=md-nav__link> <span class=md-ellipsis> MapPartitionsRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/NarrowDependency/ class=md-nav__link> <span class=md-ellipsis> NarrowDependency </span> </a> </li> <li class=md-nav__item> <a href=../rdd/NewHadoopRDD/ class=md-nav__link> <span class=md-ellipsis> NewHadoopRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/ParallelCollectionRDD/ class=md-nav__link> <span class=md-ellipsis> ParallelCollectionRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/RDD/ class=md-nav__link> <span class=md-ellipsis> RDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/ReliableCheckpointRDD/ class=md-nav__link> <span class=md-ellipsis> ReliableCheckpointRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/ShuffleDependency/ class=md-nav__link> <span class=md-ellipsis> ShuffleDependency </span> </a> </li> <li class=md-nav__item> <a href=../rdd/ShuffledRDD/ class=md-nav__link> <span class=md-ellipsis> ShuffledRDD </span> </a> </li> <li class=md-nav__item> <a href=../rdd/SubtractedRDD/ class=md-nav__link> <span class=md-ellipsis> SubtractedRDD </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_7_17> <label class=md-nav__link for=__nav_7_17 id=__nav_7_17_label tabindex=0> <span class=md-ellipsis> Operators </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_7_17_label aria-expanded=false> <label class=md-nav__title for=__nav_7_17> <span class="md-nav__icon md-icon"></span> Operators </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../rdd/spark-rdd-operations/ class=md-nav__link> <span class=md-ellipsis> Operators </span> </a> </li> <li class=md-nav__item> <a href=../rdd/spark-rdd-transformations/ class=md-nav__link> <span class=md-ellipsis> Transformations </span> </a> </li> <li class=md-nav__item> <a href=../rdd/OrderedRDDFunctions/ class=md-nav__link> <span class=md-ellipsis> OrderedRDDFunctions </span> </a> </li> <li class=md-nav__item> <a href=../rdd/PairRDDFunctions/ class=md-nav__link> <span class=md-ellipsis> PairRDDFunctions </span> </a> </li> <li class=md-nav__item> <a href=../rdd/AsyncRDDActions/ class=md-nav__link> <span class=md-ellipsis> AsyncRDDActions </span> </a> </li> <li class=md-nav__item> <a href=../rdd/spark-rdd-actions/ class=md-nav__link> <span class=md-ellipsis> Actions </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_7_18> <label class=md-nav__link for=__nav_7_18 id=__nav_7_18_label tabindex=0> <span class=md-ellipsis> Partitioners </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_7_18_label aria-expanded=false> <label class=md-nav__title for=__nav_7_18> <span class="md-nav__icon md-icon"></span> Partitioners </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../rdd/Partitioner/ class=md-nav__link> <span class=md-ellipsis> Partitioner </span> </a> </li> <li class=md-nav__item> <a href=../rdd/HashPartitioner/ class=md-nav__link> <span class=md-ellipsis> HashPartitioner </span> </a> </li> <li class=md-nav__item> <a href=../rdd/RangePartitioner/ class=md-nav__link> <span class=md-ellipsis> RangePartitioner </span> </a> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../rdd/lineage/ class=md-nav__link> <span class=md-ellipsis> RDD Lineage </span> </a> </li> <li class=md-nav__item> <a href=../rdd/spark-rdd-caching/ class=md-nav__link> <span class=md-ellipsis> Caching and Persistence </span> </a> </li> <li class=md-nav__item> <a href=../rdd/spark-rdd-partitions/ class=md-nav__link> <span class=md-ellipsis> Partitions and Partitioning </span> </a> </li> <li class=md-nav__item> <a href=../rdd/Partition/ class=md-nav__link> <span class=md-ellipsis> Partition </span> </a> </li> <li class=md-nav__item> <a href=../rdd/checkpointing/ class=md-nav__link> <span class=md-ellipsis> RDD Checkpointing </span> </a> </li> <li class=md-nav__item> <a href=../rdd/RDDCheckpointData/ class=md-nav__link> <span class=md-ellipsis> RDDCheckpointData </span> </a> </li> <li class=md-nav__item> <a href=../rdd/LocalRDDCheckpointData/ class=md-nav__link> <span class=md-ellipsis> LocalRDDCheckpointData </span> </a> </li> <li class=md-nav__item> <a href=../rdd/ReliableRDDCheckpointData/ class=md-nav__link> <span class=md-ellipsis> ReliableRDDCheckpointData </span> </a> </li> <li class=md-nav__item> <a href=../rdd/Aggregator/ class=md-nav__link> <span class=md-ellipsis> Aggregator </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_8> <div class="md-nav__link md-nav__container"> <a href=../demo/ class="md-nav__link "> <span class=md-ellipsis> Demos </span> </a> <label class="md-nav__link " for=__nav_8 id=__nav_8_label tabindex> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_8_label aria-expanded=false> <label class=md-nav__title for=__nav_8> <span class="md-nav__icon md-icon"></span> Demos </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../demo/diskblockmanager-and-block-data/ class=md-nav__link> <span class=md-ellipsis> DiskBlockManager and Block Data </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_8_3> <label class=md-nav__link for=__nav_8_3 id=__nav_8_3_label tabindex=0> <span class=md-ellipsis> Exercises </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_8_3_label aria-expanded=false> <label class=md-nav__title for=__nav_8_3> <span class="md-nav__icon md-icon"></span> Exercises </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../exercises/spark-exercise-pairrddfunctions-oneliners/ class=md-nav__link> <span class=md-ellipsis> One-liners using PairRDDFunctions </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-exercise-take-multiple-jobs/ class=md-nav__link> <span class=md-ellipsis> Learning Jobs and Partitions Using take Action </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-exercise-standalone-master-ha/ class=md-nav__link> <span class=md-ellipsis> Spark Standalone - Using ZooKeeper for High-Availability of Master </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-hello-world-using-spark-shell/ class=md-nav__link> <span class=md-ellipsis> Spark's Hello World using Spark shell and Scala </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-examples-wordcount-spark-shell/ class=md-nav__link> <span class=md-ellipsis> WordCount using Spark shell </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-first-app/ class=md-nav__link> <span class=md-ellipsis> Your first complete Spark application (using Scala and sbt) </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-sql-hive-orc-example/ class=md-nav__link> <span class=md-ellipsis> Using Spark SQL to update data in Hive using ORC files </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-exercise-custom-scheduler-listener/ class=md-nav__link> <span class=md-ellipsis> Developing Custom SparkListener to monitor DAGScheduler in Scala </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-exercise-dataframe-jdbc-postgresql/ class=md-nav__link> <span class=md-ellipsis> Working with Datasets from JDBC Data Sources (and PostgreSQL) </span> </a> </li> <li class=md-nav__item> <a href=../exercises/spark-exercise-failing-stage/ class=md-nav__link> <span class=md-ellipsis> Causing Stage to Fail </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--section md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9> <div class="md-nav__link md-nav__container"> <a href=../webui/ class="md-nav__link "> <span class=md-ellipsis> Web UIs </span> </a> <label class="md-nav__link " for=__nav_9 id=__nav_9_label tabindex> <span class="md-nav__icon md-icon"></span> </label> </div> <nav class=md-nav data-md-level=1 aria-labelledby=__nav_9_label aria-expanded=false> <label class=md-nav__title for=__nav_9> <span class="md-nav__icon md-icon"></span> Web UIs </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/configuration-properties/ class=md-nav__link> <span class=md-ellipsis> Configuration Properties </span> </a> </li> <li class=md-nav__item> <a href=../webui/WebUI/ class=md-nav__link> <span class=md-ellipsis> WebUI </span> </a> </li> <li class=md-nav__item> <a href=../webui/WebUIPage/ class=md-nav__link> <span class=md-ellipsis> WebUIPage </span> </a> </li> <li class=md-nav__item> <a href=../webui/WebUITab/ class=md-nav__link> <span class=md-ellipsis> WebUITab </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9_6> <label class=md-nav__link for=__nav_9_6 id=__nav_9_6_label tabindex=0> <span class=md-ellipsis> Spark UI </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=2 aria-labelledby=__nav_9_6_label aria-expanded=false> <label class=md-nav__title for=__nav_9_6> <span class="md-nav__icon md-icon"></span> Spark UI </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/SparkUI/ class=md-nav__link> <span class=md-ellipsis> SparkUI </span> </a> </li> <li class=md-nav__item> <a href=../webui/SparkUITab/ class=md-nav__link> <span class=md-ellipsis> SparkUITab </span> </a> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9_6_3> <label class=md-nav__link for=__nav_9_6_3 id=__nav_9_6_3_label tabindex=0> <span class=md-ellipsis> Environment </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_9_6_3_label aria-expanded=false> <label class=md-nav__title for=__nav_9_6_3> <span class="md-nav__icon md-icon"></span> Environment </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/EnvironmentTab/ class=md-nav__link> <span class=md-ellipsis> EnvironmentTab </span> </a> </li> <li class=md-nav__item> <a href=../webui/EnvironmentPage/ class=md-nav__link> <span class=md-ellipsis> EnvironmentPage </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9_6_4> <label class=md-nav__link for=__nav_9_6_4 id=__nav_9_6_4_label tabindex=0> <span class=md-ellipsis> Executors </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_9_6_4_label aria-expanded=false> <label class=md-nav__title for=__nav_9_6_4> <span class="md-nav__icon md-icon"></span> Executors </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/ExecutorsTab/ class=md-nav__link> <span class=md-ellipsis> ExecutorsTab </span> </a> </li> <li class=md-nav__item> <a href=../webui/ExecutorsPage/ class=md-nav__link> <span class=md-ellipsis> ExecutorsPage </span> </a> </li> <li class=md-nav__item> <a href=../webui/ExecutorThreadDumpPage/ class=md-nav__link> <span class=md-ellipsis> ExecutorThreadDumpPage </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9_6_5> <label class=md-nav__link for=__nav_9_6_5 id=__nav_9_6_5_label tabindex=0> <span class=md-ellipsis> Jobs </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_9_6_5_label aria-expanded=false> <label class=md-nav__title for=__nav_9_6_5> <span class="md-nav__icon md-icon"></span> Jobs </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/JobsTab/ class=md-nav__link> <span class=md-ellipsis> JobsTab </span> </a> </li> <li class=md-nav__item> <a href=../webui/AllJobsPage/ class=md-nav__link> <span class=md-ellipsis> AllJobsPage </span> </a> </li> <li class=md-nav__item> <a href=../webui/JobPage/ class=md-nav__link> <span class=md-ellipsis> JobPage </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9_6_6> <label class=md-nav__link for=__nav_9_6_6 id=__nav_9_6_6_label tabindex=0> <span class=md-ellipsis> Stages </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_9_6_6_label aria-expanded=false> <label class=md-nav__title for=__nav_9_6_6> <span class="md-nav__icon md-icon"></span> Stages </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/StagesTab/ class=md-nav__link> <span class=md-ellipsis> StagesTab </span> </a> </li> <li class=md-nav__item> <a href=../webui/AllStagesPage/ class=md-nav__link> <span class=md-ellipsis> AllStagesPage </span> </a> </li> <li class=md-nav__item> <a href=../webui/StagePage/ class=md-nav__link> <span class=md-ellipsis> StagePage </span> </a> </li> <li class=md-nav__item> <a href=../webui/PoolPage/ class=md-nav__link> <span class=md-ellipsis> PoolPage </span> </a> </li> </ul> </nav> </li> <li class="md-nav__item md-nav__item--nested"> <input class="md-nav__toggle md-toggle " type=checkbox id=__nav_9_6_7> <label class=md-nav__link for=__nav_9_6_7 id=__nav_9_6_7_label tabindex=0> <span class=md-ellipsis> Storage </span> <span class="md-nav__icon md-icon"></span> </label> <nav class=md-nav data-md-level=3 aria-labelledby=__nav_9_6_7_label aria-expanded=false> <label class=md-nav__title for=__nav_9_6_7> <span class="md-nav__icon md-icon"></span> Storage </label> <ul class=md-nav__list data-md-scrollfix> <li class=md-nav__item> <a href=../webui/StorageTab/ class=md-nav__link> <span class=md-ellipsis> StorageTab </span> </a> </li> <li class=md-nav__item> <a href=../webui/StoragePage/ class=md-nav__link> <span class=md-ellipsis> StoragePage </span> </a> </li> <li class=md-nav__item> <a href=../webui/RDDPage/ class=md-nav__link> <span class=md-ellipsis> RDDPage </span> </a> </li> </ul> </nav> </li> </ul> </nav> </li> <li class=md-nav__item> <a href=../webui/PrometheusResource/ class=md-nav__link> <span class=md-ellipsis> PrometheusResource </span> </a> </li> <li class=md-nav__item> <a href=../webui/UIUtils/ class=md-nav__link> <span class=md-ellipsis> UIUtils </span> </a> </li> <li class=md-nav__item> <a href=../webui/JettyUtils/ class=md-nav__link> <span class=md-ellipsis> JettyUtils </span> </a> </li> </ul> </nav> </li> </ul> </nav> </div> </div> </div> <div class="md-sidebar md-sidebar--secondary" data-md-component=sidebar data-md-type=toc> <div class=md-sidebar__scrollwrap> <div class=md-sidebar__inner> <nav class="md-nav md-nav--secondary" aria-label="Table of contents"> <label class=md-nav__title for=__toc> <span class="md-nav__icon md-icon"></span> Table of contents </label> <ul class=md-nav__list data-md-component=toc data-md-scrollfix> <li class=md-nav__item> <a href=#master-url class=md-nav__link> <span class=md-ellipsis> <span class=md-typeset> Master URL </span> </span> </a> </li> </ul> </nav> </div> </div> </div> <div class=md-content data-md-component=content> <nav class=md-path aria-label=Navigation> <ol class=md-path__list> <li class=md-path__item> <a href=.. class=md-path__link> <span class=md-ellipsis> Spark Core </span> </a> </li> <li class=md-path__item> <a href=../overview/ class=md-path__link> <span class=md-ellipsis> Internals </span> </a> </li> <li class=md-path__item> <a href=./ class=md-path__link> <span class=md-ellipsis> Spark Local </span> </a> </li> </ol> </nav> <article class="md-content__inner md-typeset"> <h1 id=spark-local>Spark local<a class=headerlink href=#spark-local title="Permanent link">&para;</a></h1> <p><em>Spark local</em> is one of the available runtime environments in Apache Spark. It is the only available runtime with no need for a proper cluster manager (and hence many call it a <em>pseudo-cluster</em>, however such concept do exist in Spark and is a bit different).</p> <p>Spark local is used for the following <em>master URLs</em> (as specified using &lt;&lt;../SparkConf.md#, SparkConf.setMaster&gt;&gt; method or &lt;&lt;../configuration-properties.md#spark.master, spark.master&gt;&gt; configuration property):</p> <ul> <li> <p><em>local</em> (with exactly 1 CPU core)</p> </li> <li> <p><em>local[n]</em> (with exactly <code>n</code> CPU cores)</p> </li> <li> <p><em>local[*]</em> (with the total number of CPU cores that is the number of available CPU cores on the local machine)</p> </li> <li> <p><em>local[n, m]</em> (with exactly <code>n</code> CPU cores and <code>m</code> retries when a task fails)</p> </li> <li> <p><em>local[*, m]</em> (with the total number of CPU cores that is the number of available CPU cores on the local machine)</p> </li> </ul> <p>Internally, Spark local uses &lt;<spark-localschedulerbackend.md#, localschedulerbackend>&gt; as the &lt;&lt;../SchedulerBackend.md#, SchedulerBackend&gt;&gt; and executor:ExecutorBackend.md[].</p> <p><img alt="Architecture of Spark local" src=../images/diagrams/spark-local-architecture.png></p> <p>In this non-distributed multi-threaded runtime environment, Spark spawns all the main execution components - the spark-driver.md[driver] and an executor:Executor.md[] - in the same single JVM.</p> <p>The default parallelism is the number of threads as specified in the &lt;<masterurl, master url>&gt;. This is the only mode where a driver is used for execution (as it acts both as the driver and the only executor).</p> <p>The local mode is very convenient for testing, debugging or demonstration purposes as it requires no earlier setup to launch Spark applications.</p> <p>This mode of operation is also called <a href=http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark[Spark>http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark[Spark</a> in-process] or (less commonly) <em>a local version of Spark</em>.</p> <p><code>SparkContext.isLocal</code> returns <code>true</code> when Spark runs in local mode.</p> <div class=highlight><pre><span></span><code>scala&gt; sc.isLocal
 res0: Boolean = true
 </code></pre></div> <p><a href=../tools/spark-shell/ >Spark shell</a> defaults to local mode with <code>local[*]</code> as the the master URL.</p> <div class=highlight><pre><span></span><code>scala&gt; sc.master
 res0: String = local[*]
diff --git a/search/search_index.json b/search/search_index.json
index 46ab4ef5b0..6cf4d0ef51 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"The Internals of Spark Core (Apache Spark 3.5.0)","text":"<p>Welcome to The Internals of Spark Core online book! \ud83e\udd19</p> <p>I'm Jacek Laskowski, a Freelance Data Engineer specializing in Apache Spark (incl. Spark SQL and Spark Structured Streaming), Delta Lake, Databricks, and Apache Kafka (incl. Kafka Streams) with brief forays into a wider data engineering space (e.g., Trino, Dask and dbt, mostly during Warsaw Data Engineering meetups).</p> <p>I'm very excited to have you here and hope you will enjoy exploring the internals of Spark Core as much as I have.</p> <p>Flannery O'Connor</p> <p>I write to discover what I know.</p> <p>\"The Internals Of\" series</p> <p>I'm also writing other online books in the \"The Internals Of\" series. Please visit \"The Internals Of\" Online Books home page.</p> <p>Expect text and code snippets from a variety of public sources. Attribution follows.</p> <p>Now, let's take a deep dive into Spark Core \ud83d\udd25</p> <p>Last update: 2023-12-28</p>"},{"location":"BytesToBytesMap/","title":"BytesToBytesMap","text":"<p><code>BytesToBytesMap</code> is a memory consumer that supports spilling.</p> <p>Spark SQL</p> <p><code>BytesToBytesMap</code> is used in Spark SQL only in the following:</p> <ul> <li>UnsafeFixedWidthAggregationMap</li> <li>UnsafeHashedRelation</li> </ul>"},{"location":"BytesToBytesMap/#creating-instance","title":"Creating Instance","text":"<p><code>BytesToBytesMap</code> takes the following to be created:</p> <ul> <li> TaskMemoryManager <li> BlockManager <li> SerializerManager <li> Initial Capacity <li> Load Factor (default: <code>0.5</code>) <li> Page Size (bytes) <p><code>BytesToBytesMap</code> is created when:</p> <ul> <li><code>UnsafeFixedWidthAggregationMap</code> (Spark SQL) is created</li> <li><code>UnsafeHashedRelation</code> (Spark SQL) is created</li> </ul>"},{"location":"BytesToBytesMap/#destructive-mapiterator","title":"Destructive MapIterator <pre><code>MapIterator destructiveIterator\n</code></pre> <p><code>BytesToBytesMap</code> defines a reference to a \"destructive\" <code>MapIterator</code> (if ever created for <code>UnsafeFixedWidthAggregationMap</code> (Spark SQL)).</p> <p>The <code>destructiveIterator</code> reference is in two states:</p> <ul> <li>Undefined (<code>null</code>) initially when <code>BytesToBytesMap</code> is created</li> <li>The <code>MapIterator</code> if created</li> </ul>","text":""},{"location":"BytesToBytesMap/#creating-destructive-mapiterator","title":"Creating Destructive MapIterator <pre><code>MapIterator destructiveIterator()\n</code></pre> <p><code>destructiveIterator</code> updatePeakMemoryUsed and then creates a <code>MapIterator</code> with the following:</p> <ul> <li>numValues for the number of records</li> <li>A new <code>Location</code></li> <li>Destructive flag enabled (<code>true</code>)</li> </ul>  <p><code>destructiveIterator</code> is used when:</p> <ul> <li><code>UnsafeFixedWidthAggregationMap</code> (Spark SQL) is created</li> </ul>","text":""},{"location":"BytesToBytesMap/#spilling","title":"Spilling <pre><code>long spill(\n  long size,\n  MemoryConsumer trigger)\n</code></pre> <p><code>spill</code> is part of the MemoryConsumer abstraction.</p>  <p>Only when the given MemoryConsumer is not this <code>BytesToBytesMap</code> and the destructive MapIterator has been used, <code>spill</code> requests the destructive <code>MapIterator</code> to <code>spill</code> (the given <code>size</code> bytes).</p> <p><code>spill</code> returns <code>0</code> when the <code>trigger</code> is this <code>BytesToBytesMap</code> or there is no destructiveIterator in use. Otherwise, <code>spill</code> returns how much bytes the destructiveIterator managed to release.</p>","text":""},{"location":"BytesToBytesMap/#numvalues","title":"numValues <p><code>numValues</code> registry is <code>0</code> after reset.</p> <p><code>numValues</code> is incremented when <code>Location</code> is requested to <code>append</code></p> <p><code>numValues</code> can never be bigger than maximum capacity of this <code>BytesToBytesMap</code> or growthThreshold.</p>","text":""},{"location":"BytesToBytesMap/#maximum-capacity","title":"Maximum Capacity <p><code>BytesToBytesMap</code> supports up to <code>1 &lt;&lt; 29</code> keys.</p> <p><code>BytesToBytesMap</code> makes sure that the initialCapacity is not bigger when creted.</p>","text":""},{"location":"BytesToBytesMap/#allocating-memory","title":"Allocating Memory <pre><code>void allocate(\n  int capacity)\n</code></pre> <p><code>allocate</code>...FIXME</p>  <p><code>allocate</code> is used when:</p> <ul> <li><code>BytesToBytesMap</code> is created, reset, growAndRehash</li> </ul>","text":""},{"location":"BytesToBytesMap/#growing-memory-and-rehashing","title":"Growing Memory And Rehashing <pre><code>void growAndRehash()\n</code></pre> <p><code>growAndRehash</code>...FIXME</p>  <p><code>growAndRehash</code> is used when:</p> <ul> <li><code>Location</code> is requested to <code>append</code> (a new value for a key)</li> </ul>","text":""},{"location":"ConsoleProgressBar/","title":"ConsoleProgressBar","text":"<p><code>ConsoleProgressBar</code> shows the progress of active stages to standard error, i.e. <code>stderr</code>. It uses SparkStatusTracker to poll the status of stages periodically and print out active stages with more than one task. It keeps overwriting itself to hold in one line for at most 3 first concurrent stages at a time.</p> <pre><code>[Stage 0:====&gt;          (316 + 4) / 1000][Stage 1:&gt;                (0 + 0) / 1000][Stage 2:&gt;                (0 + 0) / 1000]]]\n</code></pre> <p>The progress includes the stage id, the number of completed, active, and total tasks.</p> <p>TIP: <code>ConsoleProgressBar</code> may be useful when you <code>ssh</code> to workers and want to see the progress of active stages.</p> <p>&lt;ConsoleProgressBar is created&gt;&gt; when SparkContext is created with spark.ui.showConsoleProgress enabled and the logging level of SparkContext.md[org.apache.spark.SparkContext] logger as <code>WARN</code> or higher (i.e. less messages are printed out and so there is a \"space\" for <code>ConsoleProgressBar</code>)."},{"location":"ConsoleProgressBar/#source-scala","title":"[source, scala]","text":"<p>import org.apache.log4j._ Logger.getLogger(\"org.apache.spark.SparkContext\").setLevel(Level.WARN)</p> <p>To print the progress nicely <code>ConsoleProgressBar</code> uses <code>COLUMNS</code> environment variable to know the width of the terminal. It assumes <code>80</code> columns.</p> <p>The progress bar prints out the status after a stage has ran at least <code>500</code> milliseconds every spark-webui-properties.md#spark.ui.consoleProgress.update.interval[spark.ui.consoleProgress.update.interval] milliseconds.</p> <p>NOTE: The initial delay of <code>500</code> milliseconds before <code>ConsoleProgressBar</code> show the progress is not configurable.</p> <p>See the progress bar in Spark shell with the following:</p>"},{"location":"ConsoleProgressBar/#source","title":"[source]","text":"<p>$ ./bin/spark-shell --conf spark.ui.showConsoleProgress=true  # &lt;1&gt;</p> <p>scala&gt; sc.setLogLevel(\"OFF\")  // &lt;2&gt;</p> <p>import org.apache.log4j._ scala&gt; Logger.getLogger(\"org.apache.spark.SparkContext\").setLevel(Level.WARN)  // &lt;3&gt;</p> <p>scala&gt; sc.parallelize(1 to 4, 4).map { n =&gt; Thread.sleep(500 + 200 * n); n }.count  // &lt;4&gt; [Stage 2:&gt;                                                          (0 + 4) / 4] [Stage 2:==============&gt;                                            (1 + 3) / 4] [Stage 2:=============================&gt;                             (2 + 2) / 4] [Stage 2:============================================&gt;              (3 + 1) / 4]</p> <p>&lt;1&gt; Make sure <code>spark.ui.showConsoleProgress</code> is <code>true</code>. It is by default. &lt;2&gt; Disable (<code>OFF</code>) the root logger (that includes Spark's logger) &lt;3&gt; Make sure <code>org.apache.spark.SparkContext</code> logger is at least <code>WARN</code>. &lt;4&gt; Run a job with 4 tasks with 500ms initial sleep and 200ms sleep chunks to see the progress bar.</p> <p>TIP: https://youtu.be/uEmcGo8rwek[Watch the short video] that show ConsoleProgressBar in action.</p> <p>You may want to use the following example to see the progress bar in full glory - all 3 concurrent stages in console (borrowed from https://github.com/apache/spark/pull/3029#issuecomment-63244719[a comment to [SPARK-4017] show progress bar in console #3029]):</p> <pre><code>&gt; ./bin/spark-shell\nscala&gt; val a = sc.makeRDD(1 to 1000, 10000).map(x =&gt; (x, x)).reduceByKey(_ + _)\nscala&gt; val b = sc.makeRDD(1 to 1000, 10000).map(x =&gt; (x, x)).reduceByKey(_ + _)\nscala&gt; a.union(b).count()\n</code></pre> <p>=== [[creating-instance]] Creating ConsoleProgressBar Instance</p> <p><code>ConsoleProgressBar</code> requires a SparkContext.md[SparkContext].</p> <p>When being created, <code>ConsoleProgressBar</code> reads spark-webui-properties.md#spark.ui.consoleProgress.update.interval[spark.ui.consoleProgress.update.interval] configuration property to set up the update interval and <code>COLUMNS</code> environment variable for the terminal width (or assumes <code>80</code> columns).</p> <p><code>ConsoleProgressBar</code> starts the internal timer <code>refresh progress</code> that does &lt;&gt; and shows progress. <p>NOTE: <code>ConsoleProgressBar</code> is created when SparkContext is created, spark.ui.showConsoleProgress configuration property is enabled, and the logging level of SparkContext.md[org.apache.spark.SparkContext] logger is <code>WARN</code> or higher (i.e. less messages are printed out and so there is a \"space\" for <code>ConsoleProgressBar</code>).</p> <p>NOTE: Once created, <code>ConsoleProgressBar</code> is available internally as <code>_progressBar</code>.</p> <p>=== [[finishAll]] <code>finishAll</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[stop]] <code>stop</code> Method</p>"},{"location":"ConsoleProgressBar/#source-scala_1","title":"[source, scala]","text":""},{"location":"ConsoleProgressBar/#stop-unit","title":"stop(): Unit","text":"<p><code>stop</code> cancels (stops) the internal timer.</p> <p>NOTE: <code>stop</code> is executed when SparkContext.md#stop[<code>SparkContext</code> stops].</p> <p>=== [[refresh]] <code>refresh</code> Internal Method</p>"},{"location":"ConsoleProgressBar/#source-scala_2","title":"[source, scala]","text":""},{"location":"ConsoleProgressBar/#refresh-unit","title":"refresh(): Unit","text":"<p><code>refresh</code>...FIXME</p> <p>NOTE: <code>refresh</code> is used when...FIXME</p>"},{"location":"DriverLogger/","title":"DriverLogger","text":"<p><code>DriverLogger</code> runs on the driver (in <code>client</code> deploy mode) to copy driver logs to Hadoop DFS periodically.</p>"},{"location":"DriverLogger/#creating-instance","title":"Creating Instance","text":"<p><code>DriverLogger</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>DriverLogger</code> is created using apply utility.</p>"},{"location":"DriverLogger/#creating-driverlogger","title":"Creating DriverLogger <pre><code>apply(\n  conf: SparkConf): Option[DriverLogger]\n</code></pre> <p><code>apply</code> creates a DriverLogger when the following hold:</p> <ol> <li>spark.driver.log.persistToDfs.enabled configuration property is enabled</li> <li>The Spark application runs in <code>client</code> deploy mode (and spark.submit.deployMode is <code>client</code>)</li> <li>spark.driver.log.dfsDir is specified</li> </ol> <p><code>apply</code> prints out the following WARN message to the logs with no spark.driver.log.dfsDir specified:</p> <pre><code>Driver logs are not persisted because spark.driver.log.dfsDir is not configured\n</code></pre> <p><code>apply</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"DriverLogger/#starting-dfsasyncwriter","title":"Starting DfsAsyncWriter <pre><code>startSync(\n  hadoopConf: Configuration): Unit\n</code></pre> <p><code>startSync</code> creates and starts a <code>DfsAsyncWriter</code> (with the spark.app.id configuration property).</p> <p><code>startSync</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is requested to postApplicationStart</li> </ul>","text":""},{"location":"ExecutorDeadException/","title":"ExecutorDeadException","text":"<p><code>ExecutorDeadException</code> is a <code>SparkException</code>.</p>"},{"location":"ExecutorDeadException/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorDeadException</code> takes the following to be created:</p> <ul> <li> Error message <p><code>ExecutorDeadException</code> is created\u00a0when:</p> <ul> <li><code>NettyBlockTransferService</code> is requested to fetch blocks</li> </ul>"},{"location":"FileCommitProtocol/","title":"FileCommitProtocol","text":"<p><code>FileCommitProtocol</code> is an abstraction of file committers that can setup, commit or abort a Spark job or task (while writing out a pair RDD and partitions).</p> <p><code>FileCommitProtocol</code> is used for RDD.saveAsNewAPIHadoopDataset and RDD.saveAsHadoopDataset transformations (that use <code>SparkHadoopWriter</code> utility to write a key-value RDD out).</p> <p><code>FileCommitProtocol</code> is created using FileCommitProtocol.instantiate utility.</p>"},{"location":"FileCommitProtocol/#contract","title":"Contract","text":""},{"location":"FileCommitProtocol/#aborting-job","title":"Aborting Job <pre><code>abortJob(\n  jobContext: JobContext): Unit\n</code></pre> <p>Aborts a job</p> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> utility is used to write a key-value RDD (and writing fails)</li> <li>(Spark SQL) <code>FileFormatWriter</code> utility is used to write a result of a structured query (and writing fails)</li> <li>(Spark SQL) <code>FileBatchWrite</code> is requested to <code>abort</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#aborting-task","title":"Aborting Task <pre><code>abortTask(\n  taskContext: TaskAttemptContext): Unit\n</code></pre> <p>Abort a task</p> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> utility is used to write an RDD partition</li> <li>(Spark SQL) <code>FileFormatDataWriter</code> is requested to <code>abort</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#committing-job","title":"Committing Job <pre><code>commitJob(\n  jobContext: JobContext,\n  taskCommits: Seq[TaskCommitMessage]): Unit\n</code></pre> <p>Commits a job after the writes succeed</p> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> utility is used to write a key-value RDD</li> <li>(Spark SQL) <code>FileFormatWriter</code> utility is used to write a result of a structured query</li> <li>(Spark SQL) <code>FileBatchWrite</code> is requested to <code>commit</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#committing-task","title":"Committing Task <pre><code>commitTask(\n  taskContext: TaskAttemptContext): TaskCommitMessage\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> utility is used to write an RDD partition</li> <li>(Spark SQL) <code>FileFormatDataWriter</code> is requested to <code>commit</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#deleting-path-with-job","title":"Deleting Path with Job <pre><code>deleteWithJob(\n  fs: FileSystem,\n  path: Path,\n  recursive: Boolean): Boolean\n</code></pre> <p><code>deleteWithJob</code> requests the given Hadoop FileSystem to delete a <code>path</code> directory.</p> <p>Used when <code>InsertIntoHadoopFsRelationCommand</code> logical command (Spark SQL) is executed</p>","text":""},{"location":"FileCommitProtocol/#newtasktempfile","title":"newTaskTempFile <pre><code>newTaskTempFile(\n  taskContext: TaskAttemptContext,\n  dir: Option[String],\n  ext: String): String\n</code></pre> <p>Used when:</p> <ul> <li>(Spark SQL) <code>SingleDirectoryDataWriter</code> and <code>DynamicPartitionDataWriter</code> are requested to <code>write</code> (and in turn <code>newOutputWriter</code>)</li> </ul>","text":""},{"location":"FileCommitProtocol/#newtasktempfileabspath","title":"newTaskTempFileAbsPath <pre><code>newTaskTempFileAbsPath(\n  taskContext: TaskAttemptContext,\n  absoluteDir: String,\n  ext: String): String\n</code></pre> <p>Used when:</p> <ul> <li>(Spark SQL) <code>DynamicPartitionDataWriter</code> is requested to <code>write</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#on-task-committed","title":"On Task Committed <pre><code>onTaskCommit(\n  taskCommit: TaskCommitMessage): Unit\n</code></pre> <p>Used when:</p> <ul> <li>(Spark SQL) <code>FileFormatWriter</code> is requested to <code>write</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#setting-up-job","title":"Setting Up Job <pre><code>setupJob(\n  jobContext: JobContext): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> utility is used to write an RDD partition (while writing out a key-value RDD)</li> <li>(Spark SQL) <code>FileFormatWriter</code> utility is used to write a result of a structured query</li> <li>(Spark SQL) <code>FileWriteBuilder</code> is requested to <code>buildForBatch</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#setting-up-task","title":"Setting Up Task <pre><code>setupTask(\n  taskContext: TaskAttemptContext): Unit\n</code></pre> <p>Sets up the task with the Hadoop TaskAttemptContext</p> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> is requested to write an RDD partition (while writing out a key-value RDD)</li> <li>(Spark SQL) <code>FileFormatWriter</code> utility is used to write out a RDD partition (while writing out a result of a structured query)</li> <li>(Spark SQL) <code>FileWriterFactory</code> is requested to <code>createWriter</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#implementations","title":"Implementations","text":"<ul> <li>HadoopMapReduceCommitProtocol</li> <li><code>ManifestFileCommitProtocol</code> (qv. Spark Structured Streaming)</li> </ul>"},{"location":"FileCommitProtocol/#instantiating-filecommitprotocol-committer","title":"Instantiating FileCommitProtocol Committer <pre><code>instantiate(\n  className: String,\n  jobId: String,\n  outputPath: String,\n  dynamicPartitionOverwrite: Boolean = false): FileCommitProtocol\n</code></pre> <p><code>instantiate</code> prints out the following DEBUG message to the logs:</p> <pre><code>Creating committer [className]; job [jobId]; output=[outputPath]; dynamic=[dynamicPartitionOverwrite]\n</code></pre> <p><code>instantiate</code> tries to find a constructor method that takes three arguments (two of type <code>String</code> and one <code>Boolean</code>) for the given <code>jobId</code>, <code>outputPath</code> and <code>dynamicPartitionOverwrite</code> flag. If found, <code>instantiate</code> prints out the following DEBUG message to the logs:</p> <pre><code>Using (String, String, Boolean) constructor\n</code></pre> <p>In case of <code>NoSuchMethodException</code>, <code>instantiate</code> prints out the following DEBUG message to the logs:</p> <pre><code>Falling back to (String, String) constructor\n</code></pre> <p><code>instantiate</code> tries to find a constructor method that takes two arguments (two of type <code>String</code>) for the given <code>jobId</code> and <code>outputPath</code>.</p> <p>With two <code>String</code> arguments, <code>instantiate</code> requires that the given <code>dynamicPartitionOverwrite</code> flag is disabled (<code>false</code>) or throws an <code>IllegalArgumentException</code>:</p> <pre><code>requirement failed: Dynamic Partition Overwrite is enabled but the committer [className] does not have the appropriate constructor\n</code></pre> <p><code>instantiate</code> is used when:</p> <ul> <li>HadoopMapRedWriteConfigUtil and HadoopMapReduceWriteConfigUtil are requested to create a HadoopMapReduceCommitProtocol committer</li> <li>(Spark SQL) <code>InsertIntoHadoopFsRelationCommand</code>, <code>InsertIntoHiveDirCommand</code>, and <code>InsertIntoHiveTable</code> logical commands are executed</li> <li>(Spark Structured Streaming) <code>FileStreamSink</code> is requested to write out a micro-batch data</li> </ul>","text":""},{"location":"FileCommitProtocol/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.internal.io.FileCommitProtocol</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.internal.io.FileCommitProtocol=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"HadoopMapRedCommitProtocol/","title":"HadoopMapRedCommitProtocol","text":"<p><code>HadoopMapRedCommitProtocol</code> is...FIXME</p>"},{"location":"HadoopMapRedWriteConfigUtil/","title":"HadoopMapRedWriteConfigUtil","text":"<p><code>HadoopMapRedWriteConfigUtil</code> is...FIXME</p> <p>== [[createCommitter]] <code>createCommitter</code> Method</p>"},{"location":"HadoopMapRedWriteConfigUtil/#source-scala","title":"[source, scala]","text":"<p>createCommitter(   jobId: Int): HadoopMapReduceCommitProtocol</p> <p>NOTE: <code>createCommitter</code> is part of the &lt;&gt; contract to...FIXME. <p><code>createCommitter</code>...FIXME</p>"},{"location":"HadoopMapReduceCommitProtocol/","title":"HadoopMapReduceCommitProtocol","text":"<p><code>HadoopMapReduceCommitProtocol</code> is...FIXME</p>"},{"location":"HadoopMapReduceWriteConfigUtil/","title":"HadoopMapReduceWriteConfigUtil","text":"<p><code>HadoopMapReduceWriteConfigUtil</code> is...FIXME</p> <p>== [[createCommitter]] <code>createCommitter</code> Method</p>"},{"location":"HadoopMapReduceWriteConfigUtil/#source-scala","title":"[source, scala]","text":"<p>createCommitter(   jobId: Int): HadoopMapReduceCommitProtocol</p> <p>NOTE: <code>createCommitter</code> is part of the &lt;&gt; contract to...FIXME. <p><code>createCommitter</code>...FIXME</p>"},{"location":"HadoopWriteConfigUtil/","title":"HadoopWriteConfigUtil","text":"<p><code>HadoopWriteConfigUtil[K, V]</code> is an abstraction of writer configurers for SparkHadoopWriter to write a key-value RDD (for RDD.saveAsNewAPIHadoopDataset and RDD.saveAsHadoopDataset operators).</p>"},{"location":"HadoopWriteConfigUtil/#contract","title":"Contract","text":""},{"location":"HadoopWriteConfigUtil/#assertconf","title":"assertConf <pre><code>assertConf(\n  jobContext: JobContext,\n  conf: SparkConf): Unit\n</code></pre>","text":""},{"location":"HadoopWriteConfigUtil/#closewriter","title":"closeWriter <pre><code>closeWriter(\n  taskContext: TaskAttemptContext): Unit\n</code></pre>","text":""},{"location":"HadoopWriteConfigUtil/#createcommitter","title":"createCommitter <pre><code>createCommitter(\n  jobId: Int): HadoopMapReduceCommitProtocol\n</code></pre> <p>Creates a HadoopMapReduceCommitProtocol committer</p> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> is requested to write data out</li> </ul>","text":""},{"location":"HadoopWriteConfigUtil/#createjobcontext","title":"createJobContext <pre><code>createJobContext(\n  jobTrackerId: String,\n  jobId: Int): JobContext\n</code></pre>","text":""},{"location":"HadoopWriteConfigUtil/#createtaskattemptcontext","title":"createTaskAttemptContext <pre><code>createTaskAttemptContext(\n  jobTrackerId: String,\n  jobId: Int,\n  splitId: Int,\n  taskAttemptId: Int): TaskAttemptContext\n</code></pre> <p>Creates a Hadoop TaskAttemptContext</p>","text":""},{"location":"HadoopWriteConfigUtil/#initoutputformat","title":"initOutputFormat <pre><code>initOutputFormat(\n  jobContext: JobContext): Unit\n</code></pre>","text":""},{"location":"HadoopWriteConfigUtil/#initwriter","title":"initWriter <pre><code>initWriter(\n  taskContext: TaskAttemptContext,\n  splitId: Int): Unit\n</code></pre>","text":""},{"location":"HadoopWriteConfigUtil/#write","title":"write <pre><code>write(\n  pair: (K, V)): Unit\n</code></pre> <p>Writes out the key-value pair</p> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> is requested to executeTask</li> </ul>","text":""},{"location":"HadoopWriteConfigUtil/#implementations","title":"Implementations","text":"<ul> <li>HadoopMapReduceWriteConfigUtil</li> <li>HadoopMapRedWriteConfigUtil</li> </ul>"},{"location":"HeartbeatReceiver/","title":"HeartbeatReceiver RPC Endpoint","text":"<p><code>HeartbeatReceiver</code> is a ThreadSafeRpcEndpoint that is registered on the driver as HeartbeatReceiver.</p> <p><code>HeartbeatReceiver</code> receives Heartbeat messages from executors for accumulator updates (with task metrics and a Spark application's accumulators) and pass them along to TaskScheduler.</p> <p></p> <p><code>HeartbeatReceiver</code> is registered immediately after a Spark application is started (i.e. when SparkContext is created).</p> <p><code>HeartbeatReceiver</code> is a SparkListener to get notified about new executors or executors that are no longer available.</p>"},{"location":"HeartbeatReceiver/#creating-instance","title":"Creating Instance","text":"<p><code>HeartbeatReceiver</code> takes the following to be created:</p> <ul> <li> SparkContext <li> <code>Clock</code> (default: <code>SystemClock</code>) <p><code>HeartbeatReceiver</code> is created\u00a0when <code>SparkContext</code> is created</p>"},{"location":"HeartbeatReceiver/#taskscheduler","title":"TaskScheduler <p><code>HeartbeatReceiver</code> manages a reference to TaskScheduler.</p>","text":""},{"location":"HeartbeatReceiver/#rpc-messages","title":"RPC Messages","text":""},{"location":"HeartbeatReceiver/#executorremoved","title":"ExecutorRemoved <p>Attributes:</p> <ul> <li>Executor ID</li> </ul> <p>Posted when <code>HeartbeatReceiver</code> is notified that an executor is no longer available</p> <p>When received, <code>HeartbeatReceiver</code> removes the executor (from executorLastSeen internal registry).</p>","text":""},{"location":"HeartbeatReceiver/#executorregistered","title":"ExecutorRegistered <p>Attributes:</p> <ul> <li>Executor ID</li> </ul> <p>Posted when <code>HeartbeatReceiver</code> is notified that a new executor has been registered</p> <p>When received, <code>HeartbeatReceiver</code> registers the executor and the current time (in executorLastSeen internal registry).</p>","text":""},{"location":"HeartbeatReceiver/#expiredeadhosts","title":"ExpireDeadHosts <p>No attributes</p> <p>When received, <code>HeartbeatReceiver</code> prints out the following TRACE message to the logs:</p> <pre><code>Checking for hosts with no recent heartbeats in HeartbeatReceiver.\n</code></pre> <p>Each executor (in executorLastSeen internal registry) is checked whether the time it was last seen is not past spark.network.timeout.</p> <p>For any such executor, <code>HeartbeatReceiver</code> prints out the following WARN message to the logs:</p> <pre><code>Removing executor [executorId] with no recent heartbeats: [time] ms exceeds timeout [timeout] ms\n</code></pre> <p><code>HeartbeatReceiver</code> TaskScheduler.executorLost (with <code>SlaveLost(\"Executor heartbeat timed out after [timeout] ms\"</code>).</p> <p><code>SparkContext.killAndReplaceExecutor</code> is asynchronously called for the executor (i.e. on killExecutorThread).</p> <p>The executor is removed from the executorLastSeen internal registry.</p>","text":""},{"location":"HeartbeatReceiver/#heartbeat","title":"Heartbeat <p>Attributes:</p> <ul> <li>Executor ID</li> <li>AccumulatorV2 updates (by task ID)</li> <li>BlockManagerId</li> <li><code>ExecutorMetrics</code> peaks (by stage and stage attempt IDs)</li> </ul> <p>Posted when <code>Executor</code> informs that it is alive and reports task metrics.</p> <p>When received, <code>HeartbeatReceiver</code> finds the <code>executorId</code> executor (in executorLastSeen internal registry).</p> <p>When the executor is found, <code>HeartbeatReceiver</code> updates the time the heartbeat was received (in executorLastSeen internal registry).</p> <p><code>HeartbeatReceiver</code> uses the Clock to know the current time.</p> <p><code>HeartbeatReceiver</code> then submits an asynchronous task to notify <code>TaskScheduler</code> that the heartbeat was received from the executor (using TaskScheduler internal reference). <code>HeartbeatReceiver</code> posts a <code>HeartbeatResponse</code> back to the executor (with the response from <code>TaskScheduler</code> whether the executor has been registered already or not so it may eventually need to re-register).</p> <p>If however the executor was not found (in executorLastSeen internal registry), i.e. the executor was not registered before, you should see the following DEBUG message in the logs and the response is to notify the executor to re-register.</p> <pre><code>Received heartbeat from unknown executor [executorId]\n</code></pre> <p>In a very rare case, when TaskScheduler is not yet assigned to <code>HeartbeatReceiver</code>, you should see the following WARN message in the logs and the response is to notify the executor to re-register.</p> <pre><code>Dropping [heartbeat] because TaskScheduler is not ready yet\n</code></pre>","text":""},{"location":"HeartbeatReceiver/#taskschedulerisset","title":"TaskSchedulerIsSet <p>No attributes</p> <p>Posted when <code>SparkContext</code> informs that <code>TaskScheduler</code> is available.</p> <p>When received, <code>HeartbeatReceiver</code> sets the internal reference to <code>TaskScheduler</code>.</p>","text":""},{"location":"HeartbeatReceiver/#onexecutoradded","title":"onExecutorAdded <pre><code>onExecutorAdded(\n  executorAdded: SparkListenerExecutorAdded): Unit\n</code></pre> <p><code>onExecutorAdded</code> sends an ExecutorRegistered message to itself.</p> <p><code>onExecutorAdded</code>\u00a0is part of the SparkListener abstraction.</p>","text":""},{"location":"HeartbeatReceiver/#addexecutor","title":"addExecutor <pre><code>addExecutor(\n  executorId: String): Option[Future[Boolean]]\n</code></pre> <p><code>addExecutor</code>...FIXME</p>","text":""},{"location":"HeartbeatReceiver/#onexecutorremoved","title":"onExecutorRemoved <pre><code>onExecutorRemoved(\n  executorRemoved: SparkListenerExecutorRemoved): Unit\n</code></pre> <p><code>onExecutorRemoved</code> removes the executor.</p> <p><code>onExecutorRemoved</code>\u00a0is part of the SparkListener abstraction.</p>","text":""},{"location":"HeartbeatReceiver/#removeexecutor","title":"removeExecutor <pre><code>removeExecutor(\n  executorId: String): Option[Future[Boolean]]\n</code></pre> <p><code>removeExecutor</code>...FIXME</p>","text":""},{"location":"HeartbeatReceiver/#starting-heartbeatreceiver","title":"Starting HeartbeatReceiver <pre><code>onStart(): Unit\n</code></pre> <p><code>onStart</code> sends a blocking ExpireDeadHosts every spark.network.timeoutInterval on eventLoopThread.</p> <p><code>onStart</code>\u00a0is part of the RpcEndpoint abstraction.</p>","text":""},{"location":"HeartbeatReceiver/#stopping-heartbeatreceiver","title":"Stopping HeartbeatReceiver <pre><code>onStop(): Unit\n</code></pre> <p><code>onStop</code> shuts down the eventLoopThread and killExecutorThread thread pools.</p> <p><code>onStop</code>\u00a0is part of the RpcEndpoint abstraction.</p>","text":""},{"location":"HeartbeatReceiver/#handling-two-way-messages","title":"Handling Two-Way Messages <pre><code>receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n</code></pre> <p><code>receiveAndReply</code>...FIXME</p> <p><code>receiveAndReply</code>\u00a0is part of the RpcEndpoint abstraction.</p>","text":""},{"location":"HeartbeatReceiver/#thread-pools","title":"Thread Pools","text":""},{"location":"HeartbeatReceiver/#kill-executor-thread","title":"kill-executor-thread <p><code>killExecutorThread</code> is a daemon ScheduledThreadPoolExecutor with a single thread.</p> <p>The name of the thread pool is kill-executor-thread.</p>","text":""},{"location":"HeartbeatReceiver/#heartbeat-receiver-event-loop-thread","title":"heartbeat-receiver-event-loop-thread <p><code>eventLoopThread</code> is a daemon ScheduledThreadPoolExecutor with a single thread.</p> <p>The name of the thread pool is heartbeat-receiver-event-loop-thread.</p>","text":""},{"location":"HeartbeatReceiver/#expiring-dead-hosts","title":"Expiring Dead Hosts <pre><code>expireDeadHosts(): Unit\n</code></pre> <p><code>expireDeadHosts</code>...FIXME</p> <p><code>expireDeadHosts</code>\u00a0is used when <code>HeartbeatReceiver</code> is requested to receives an ExpireDeadHosts message.</p>","text":""},{"location":"HeartbeatReceiver/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.HeartbeatReceiver</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.HeartbeatReceiver=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"InterruptibleIterator/","title":"InterruptibleIterator","text":"<p>== [[InterruptibleIterator]] InterruptibleIterator -- Iterator With Support For Task Cancellation</p> <p><code>InterruptibleIterator</code> is a custom Scala https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator[Iterator] that supports task cancellation, i.e. &lt;&gt;. <p>Quoting the official Scala https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator[Iterator] documentation:</p> <p>Iterators are data structures that allow to iterate over a sequence of elements. They have a <code>hasNext</code> method for checking if there is a next element available, and a <code>next</code> method which returns the next element and discards it from the iterator.</p> <p><code>InterruptibleIterator</code> is &lt;&gt; when: <ul> <li> <p><code>RDD</code> is requested to rdd:RDD.md#getOrCompute[get or compute a RDD partition]</p> </li> <li> <p>CoGroupedRDD, rdd:HadoopRDD.md#compute[HadoopRDD], rdd:NewHadoopRDD.md#compute[NewHadoopRDD], rdd:ParallelCollectionRDD.md#compute[ParallelCollectionRDD] are requested to <code>compute</code> a partition</p> </li> <li> <p><code>BlockStoreShuffleReader</code> is requested to shuffle:BlockStoreShuffleReader.md#read[read combined key-value records for a reduce task]</p> </li> <li> <p><code>PairRDDFunctions</code> is requested to rdd:PairRDDFunctions.md#combineByKeyWithClassTag[combineByKeyWithClassTag]</p> </li> <li> <p>Spark SQL's <code>DataSourceRDD</code> and <code>JDBCRDD</code> are requested to <code>compute</code> a partition</p> </li> <li> <p>Spark SQL's <code>RangeExec</code> physical operator is requested to <code>doExecute</code></p> </li> <li> <p>PySpark's <code>BasePythonRunner</code> is requested to <code>compute</code></p> </li> </ul> <p>[[creating-instance]] <code>InterruptibleIterator</code> takes the following when created:</p> <ul> <li>[[context]] TaskContext</li> <li>[[delegate]] Scala <code>Iterator[T]</code></li> </ul> <p>NOTE: <code>InterruptibleIterator</code> is a Developer API which is a lower-level, unstable API intended for Spark developers that may change or be removed in minor versions of Apache Spark.</p> <p>=== [[hasNext]] <code>hasNext</code> Method</p>"},{"location":"InterruptibleIterator/#source-scala","title":"[source, scala]","text":""},{"location":"InterruptibleIterator/#hasnext-boolean","title":"hasNext: Boolean","text":"<p>NOTE: <code>hasNext</code> is part of ++https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator@hasNext:Boolean++[Iterator Contract] to test whether this iterator can provide another element.</p> <p><code>hasNext</code> requests the &lt;&gt; to kill the task if interrupted (that simply throws a <code>TaskKilledException</code> that in turn breaks the task execution). <p>In the end, <code>hasNext</code> requests the &lt;&gt; to <code>hasNext</code>. <p>=== [[next]] <code>next</code> Method</p>"},{"location":"InterruptibleIterator/#source-scala_1","title":"[source, scala]","text":""},{"location":"InterruptibleIterator/#next-t","title":"next(): T","text":"<p>NOTE: <code>next</code> is part of ++https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator@next():A++[Iterator Contract] to produce the next element of this iterator.</p> <p><code>next</code> simply requests the &lt;&gt; to <code>next</code>."},{"location":"ListenerBus/","title":"ListenerBus","text":"<p><code>ListenerBus</code> is an abstraction of event buses that can notify listeners about scheduling events.</p>"},{"location":"ListenerBus/#contract","title":"Contract","text":""},{"location":"ListenerBus/#notifying-listener-about-event","title":"Notifying Listener about Event <pre><code>doPostEvent(\n  listener: L,\n  event: E): Unit\n</code></pre> <p>Used when <code>ListenerBus</code> is requested to postToAll</p>","text":""},{"location":"ListenerBus/#implementations","title":"Implementations","text":"<ul> <li>ExecutionListenerBus</li> <li>ExternalCatalogWithListener</li> <li>SparkListenerBus</li> <li>StreamingListenerBus</li> <li>StreamingQueryListenerBus</li> </ul>"},{"location":"ListenerBus/#posting-event-to-all-listeners","title":"Posting Event To All Listeners <pre><code>postToAll(\n  event: E): Unit\n</code></pre> <p><code>postToAll</code>...FIXME</p> <p><code>postToAll</code>\u00a0is used when:</p> <ul> <li><code>AsyncEventQueue</code> is requested to dispatch an event</li> <li><code>ReplayListenerBus</code> is requested to replay events</li> </ul>","text":""},{"location":"ListenerBus/#registering-listener","title":"Registering Listener <pre><code>addListener(\n  listener: L): Unit\n</code></pre> <p><code>addListener</code>...FIXME</p> <p><code>addListener</code>\u00a0is used when:</p> <ul> <li><code>LiveListenerBus</code> is requested to addToQueue</li> <li><code>EventLogFileCompactor</code> is requested to <code>initializeBuilders</code></li> <li><code>FsHistoryProvider</code> is requested to doMergeApplicationListing and rebuildAppStore</li> </ul>","text":""},{"location":"OutputCommitCoordinator/","title":"OutputCommitCoordinator","text":"<p>From the scaladoc (it's a <code>private[spark]</code> class so no way to find it outside the code):</p> <p>Authority that decides whether tasks can commit output to HDFS. Uses a \"first committer wins\" policy.</p> <p>OutputCommitCoordinator is instantiated in both the drivers and executors. On executors, it is configured with a reference to the driver's OutputCommitCoordinatorEndpoint, so requests to commit output will be forwarded to the driver's OutputCommitCoordinator.</p> <p>This class was introduced in SPARK-4879; see that JIRA issue (and the associated pull requests) for an extensive design discussion.</p>"},{"location":"OutputCommitCoordinator/#creating-instance","title":"Creating Instance","text":"<p><code>OutputCommitCoordinator</code> takes the following to be created:</p> <ul> <li> SparkConf <li> <code>isDriver</code> flag <p><code>OutputCommitCoordinator</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv on the driver</li> </ul>"},{"location":"OutputCommitCoordinator/#outputcommitcoordinator-rpc-endpoint","title":"OutputCommitCoordinator RPC Endpoint <pre><code>coordinatorRef: Option[RpcEndpointRef]\n</code></pre> <p><code>OutputCommitCoordinator</code> is registered as OutputCommitCoordinator (with <code>OutputCommitCoordinatorEndpoint</code> RPC Endpoint) in the RPC Environment on the driver (when <code>SparkEnv</code> utility is used to create \"base\" SparkEnv). Executors have an RpcEndpointRef to the endpoint on the driver.</p> <p><code>coordinatorRef</code> is used to post an <code>AskPermissionToCommitOutput</code> (by executors) to the <code>OutputCommitCoordinator</code> (when canCommit).</p> <p><code>coordinatorRef</code> is used to stop the <code>OutputCommitCoordinator</code> on the driver (when stop).</p>","text":""},{"location":"OutputCommitCoordinator/#cancommit","title":"canCommit <pre><code>canCommit(\n  stage: Int,\n  stageAttempt: Int,\n  partition: Int,\n  attemptNumber: Int): Boolean\n</code></pre> <p><code>canCommit</code> creates a <code>AskPermissionToCommitOutput</code> message and sends it (asynchronously) to the OutputCommitCoordinator RPC Endpoint.</p> <p><code>canCommit</code>\u00a0is used when:</p> <ul> <li><code>SparkHadoopMapRedUtil</code> is requested to <code>commitTask</code> (with <code>spark.hadoop.outputCommitCoordination.enabled</code> configuration property enabled)</li> <li><code>DataWritingSparkTask</code> (Spark SQL) utility is used to <code>run</code></li> </ul>","text":""},{"location":"OutputCommitCoordinator/#handleaskpermissiontocommit","title":"handleAskPermissionToCommit <pre><code>handleAskPermissionToCommit(\n  stage: Int,\n  stageAttempt: Int,\n  partition: Int,\n  attemptNumber: Int): Boolean\n</code></pre> <p><code>handleAskPermissionToCommit</code>...FIXME</p> <p><code>handleAskPermissionToCommit</code>\u00a0is used when:</p> <ul> <li><code>OutputCommitCoordinatorEndpoint</code> is requested to handle a <code>AskPermissionToCommitOutput</code> message (that happens after it was sent out in canCommit)</li> </ul>","text":""},{"location":"OutputCommitCoordinator/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.OutputCommitCoordinator</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.OutputCommitCoordinator=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"SparkConf/","title":"SparkConf","text":"<p><code>SparkConf</code> is <code>Serializable</code> (Java).</p>"},{"location":"SparkConf/#creating-instance","title":"Creating Instance","text":"<p><code>SparkConf</code> takes the following to be created:</p> <ul> <li>loadDefaults flag</li> </ul>"},{"location":"SparkConf/#loaddefaults-flag","title":"loadDefaults Flag <p><code>SparkConf</code> can be given <code>loadDefaults</code> flag when created.</p> <p>Default: <code>true</code></p> <p>When <code>true</code>, <code>SparkConf</code> loads spark properties (with <code>silent</code> flag disabled) when created.</p>","text":""},{"location":"SparkConf/#getallwithprefix","title":"getAllWithPrefix <pre><code>getAllWithPrefix(\n  prefix: String): Array[(String, String)]\n</code></pre> <p><code>getAllWithPrefix</code> collects the keys with the given <code>prefix</code> in getAll.</p> <p>In the end, <code>getAllWithPrefix</code> removes the given <code>prefix</code> from the keys.</p>  <p><code>getAllWithPrefix</code> is used when:</p> <ul> <li><code>SparkConf</code> is requested to getExecutorEnv (<code>spark.executorEnv.</code> prefix), fillMissingMagicCommitterConfsIfNeeded (<code>spark.hadoop.fs.s3a.bucket.</code> prefix)</li> <li><code>ExecutorPluginContainer</code> is requested for the executorPlugins (<code>spark.plugins.internal.conf.</code> prefix)</li> <li><code>ResourceUtils</code> is requested to parseResourceRequest, listResourceIds, addTaskResourceRequests, parseResourceRequirements</li> <li><code>SortShuffleManager</code> is requested to loadShuffleExecutorComponents (<code>spark.shuffle.plugin.__config__.</code> prefix)</li> <li><code>ServerInfo</code> is requested to <code>addFilters</code></li> </ul>","text":""},{"location":"SparkConf/#loading-spark-properties","title":"Loading Spark Properties <pre><code>loadFromSystemProperties(\n  silent: Boolean): SparkConf\n</code></pre> <p><code>loadFromSystemProperties</code> records all the <code>spark.</code>-prefixed system properties in this <code>SparkConf</code>.</p>  <p>Silently loading system properties</p> <p>Loading system properties silently is possible using the following:</p> <pre><code>new SparkConf(loadDefaults = false).loadFromSystemProperties(silent = true)\n</code></pre>   <p><code>loadFromSystemProperties</code> is used when:</p> <ul> <li><code>SparkConf</code> is created (with loadDefaults enabled)</li> <li><code>SparkHadoopUtil</code> is created</li> </ul>","text":""},{"location":"SparkConf/#executor-settings","title":"Executor Settings <p><code>SparkConf</code> uses <code>spark.executorEnv.</code> prefix for executor settings.</p>","text":""},{"location":"SparkConf/#getexecutorenv","title":"getExecutorEnv <pre><code>getExecutorEnv: Seq[(String, String)]\n</code></pre> <p><code>getExecutorEnv</code> gets all the settings with spark.executorEnv. prefix.</p>  <p><code>getExecutorEnv</code> is used when:</p> <ul> <li><code>SparkContext</code> is created (and requested for executorEnvs)</li> </ul>","text":""},{"location":"SparkConf/#setexecutorenv","title":"setExecutorEnv <pre><code>setExecutorEnv(\n  variables: Array[(String, String)]): SparkConf\nsetExecutorEnv(\n  variables: Seq[(String, String)]): SparkConf\nsetExecutorEnv(\n  variable: String, value: String): SparkConf\n</code></pre> <p><code>setExecutorEnv</code> sets the given (key-value) variables with the keys with spark.executorEnv. prefix added.</p>  <p><code>setExecutorEnv</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to updatedConf</li> </ul>","text":""},{"location":"SparkConf/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.SparkConf</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.SparkConf=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"SparkContext-creating-instance-internals/","title":"Inside Creating SparkContext","text":"<p>This document describes the internals of what happens when a new <code>SparkContext</code> is created.</p> <pre><code>import org.apache.spark.{SparkConf, SparkContext}\n\n// 1. Create Spark configuration\nval conf = new SparkConf()\n  .setAppName(\"SparkMe Application\")\n  .setMaster(\"local[*]\")\n\n// 2. Create Spark context\nval sc = new SparkContext(conf)\n</code></pre>"},{"location":"SparkContext-creating-instance-internals/#creationsite","title":"creationSite <pre><code>creationSite: CallSite\n</code></pre> <p><code>SparkContext</code> determines call site.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#assertondriver","title":"assertOnDriver <p><code>SparkContext</code>...FIXME</p>","text":""},{"location":"SparkContext-creating-instance-internals/#markpartiallyconstructed","title":"markPartiallyConstructed <p><code>SparkContext</code>...FIXME</p>","text":""},{"location":"SparkContext-creating-instance-internals/#starttime","title":"startTime <pre><code>startTime: Long\n</code></pre> <p><code>SparkContext</code> records the current time (in ms).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#stopped","title":"stopped <pre><code>stopped: AtomicBoolean\n</code></pre> <p><code>SparkContext</code> initializes <code>stopped</code> flag to <code>false</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#printing-out-spark-version","title":"Printing Out Spark Version <p><code>SparkContext</code> prints out the following INFO message to the logs:</p> <pre><code>Running Spark version [SPARK_VERSION]\n</code></pre>","text":""},{"location":"SparkContext-creating-instance-internals/#sparkuser","title":"sparkUser <pre><code>sparkUser: String\n</code></pre> <p><code>SparkContext</code> determines Spark user.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#sparkconf","title":"SparkConf <pre><code>_conf: SparkConf\n</code></pre> <p><code>SparkContext</code> clones the SparkConf and requests it to validateSettings.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#enforcing-mandatory-configuration-properties","title":"Enforcing Mandatory Configuration Properties <p><code>SparkContext</code> asserts that spark.master and spark.app.name are defined (in the SparkConf).</p> <pre><code>A master URL must be set in your configuration\n</code></pre> <pre><code>An application name must be set in your configuration\n</code></pre>","text":""},{"location":"SparkContext-creating-instance-internals/#driverlogger","title":"DriverLogger <pre><code>_driverLogger: Option[DriverLogger]\n</code></pre> <p><code>SparkContext</code> creates a <code>DriverLogger</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#resourceinformation","title":"ResourceInformation <pre><code>_resources: Map[String, ResourceInformation]\n</code></pre> <p><code>SparkContext</code> uses spark.driver.resourcesFile configuration property to discovery driver resources and prints out the following INFO message to the logs:</p> <pre><code>==============================================================\nResources for [componentName]:\n[resources]\n==============================================================\n</code></pre>","text":""},{"location":"SparkContext-creating-instance-internals/#submitted-application","title":"Submitted Application <p><code>SparkContext</code> prints out the following INFO message to the logs (with the value of spark.app.name configuration property):</p> <pre><code>Submitted application: [appName]\n</code></pre>","text":""},{"location":"SparkContext-creating-instance-internals/#spark-on-yarn-and-sparkyarnappid","title":"Spark on YARN and spark.yarn.app.id <p>For Spark on YARN in cluster deploy mode], <code>SparkContext</code> checks whether <code>spark.yarn.app.id</code> configuration property is defined. <code>SparkException</code> is thrown if it does not exist.</p> <pre><code>Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.\n</code></pre>","text":""},{"location":"SparkContext-creating-instance-internals/#displaying-spark-configuration","title":"Displaying Spark Configuration <p>With spark.logConf configuration property enabled, <code>SparkContext</code> prints out the following INFO message to the logs:</p> <pre><code>Spark configuration:\n[conf.toDebugString]\n</code></pre>  <p>Note</p> <p><code>SparkConf.toDebugString</code> is used very early in the initialization process and other settings configured afterwards are not included. Use <code>SparkContext.getConf.toDebugString</code> once <code>SparkContext</code> is initialized.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#setting-configuration-properties","title":"Setting Configuration Properties <ul> <li>spark.driver.host to the current value of the property (to override the default)</li> <li>spark.driver.port to <code>0</code> unless defined already</li> <li>spark.executor.id to <code>driver</code></li> </ul>","text":""},{"location":"SparkContext-creating-instance-internals/#user-defined-jar-files","title":"User-Defined Jar Files <pre><code>_jars: Seq[String]\n</code></pre> <p><code>SparkContext</code> sets the <code>_jars</code> to spark.jars configuration property.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#user-defined-files","title":"User-Defined Files <pre><code>_files: Seq[String]\n</code></pre> <p><code>SparkContext</code> sets the <code>_files</code> to spark.files configuration property.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#sparkeventlogdir","title":"spark.eventLog.dir <pre><code>_eventLogDir: Option[URI]\n</code></pre> <p>If spark-history-server:EventLoggingListener.md[event logging] is enabled, i.e. EventLoggingListener.md#spark_eventLog_enabled[spark.eventLog.enabled] flag is <code>true</code>, the internal field <code>_eventLogDir</code> is set to the value of EventLoggingListener.md#spark_eventLog_dir[spark.eventLog.dir] setting or the default value <code>/tmp/spark-events</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#sparkeventlogcompress","title":"spark.eventLog.compress <pre><code>_eventLogCodec: Option[String]\n</code></pre> <p>Also, if spark-history-server:EventLoggingListener.md#spark_eventLog_compress[spark.eventLog.compress] is enabled (it is not by default), the short name of the <code>CompressionCodec</code> is assigned to <code>_eventLogCodec</code>. The config key is spark.io.compression.codec (default: <code>lz4</code>).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-livelistenerbus","title":"Creating LiveListenerBus <pre><code>_listenerBus: LiveListenerBus\n</code></pre> <p><code>SparkContext</code> creates a LiveListenerBus.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-appstatusstore-and-appstatussource","title":"Creating AppStatusStore (and AppStatusSource) <pre><code>_statusStore: AppStatusStore\n</code></pre> <p><code>SparkContext</code> creates an in-memory store (with an optional AppStatusSource if enabled) and requests the LiveListenerBus to register the AppStatusListener with the status queue.</p> <p>The <code>AppStatusStore</code> is available using the statusStore property of the <code>SparkContext</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-sparkenv","title":"Creating SparkEnv <pre><code>_env: SparkEnv\n</code></pre> <p><code>SparkContext</code> creates a SparkEnv and requests <code>SparkEnv</code> to use the instance as the default SparkEnv.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#sparkreplclassuri","title":"spark.repl.class.uri <p>With spark.repl.class.outputDir configuration property defined, <code>SparkContext</code> sets spark.repl.class.uri configuration property to be...FIXME</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-sparkstatustracker","title":"Creating SparkStatusTracker <pre><code>_statusTracker: SparkStatusTracker\n</code></pre> <p><code>SparkContext</code> creates a SparkStatusTracker (with itself and the AppStatusStore).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-consoleprogressbar","title":"Creating ConsoleProgressBar <pre><code>_progressBar: Option[ConsoleProgressBar]\n</code></pre> <p><code>SparkContext</code> creates a ConsoleProgressBar only when spark.ui.showConsoleProgress configuration property is enabled.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-sparkui","title":"Creating SparkUI <pre><code>_ui: Option[SparkUI]\n</code></pre> <p><code>SparkContext</code> creates a SparkUI only when spark.ui.enabled configuration property is enabled.</p> <p><code>SparkContext</code> requests the <code>SparkUI</code> to bind.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#hadoop-configuration","title":"Hadoop Configuration <pre><code>_hadoopConfiguration: Configuration\n</code></pre> <p><code>SparkContext</code> creates a new Hadoop <code>Configuration</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#adding-user-defined-jar-files","title":"Adding User-Defined Jar Files <p>If there are jars given through the <code>SparkContext</code> constructor, they are added using <code>addJar</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#adding-user-defined-files","title":"Adding User-Defined Files <p><code>SparkContext</code> adds the files in spark.files configuration property.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#_executormemory","title":"_executorMemory <pre><code>_executorMemory: Int\n</code></pre> <p><code>SparkContext</code> determines the amount of memory to allocate to each executor. It is the value of executor:Executor.md#spark.executor.memory[spark.executor.memory] setting, or SparkContext.md#environment-variables[SPARK_EXECUTOR_MEMORY] environment variable (or currently-deprecated <code>SPARK_MEM</code>), or defaults to <code>1024</code>.</p> <p><code>_executorMemory</code> is later available as <code>sc.executorMemory</code> and used for LOCAL_CLUSTER_REGEX, <code>SparkDeploySchedulerBackend</code>, to set <code>executorEnvs(\"SPARK_EXECUTOR_MEMORY\")</code>, MesosSchedulerBackend, CoarseMesosSchedulerBackend.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#spark_prepend_classes-environment-variable","title":"SPARK_PREPEND_CLASSES Environment Variable <p>The value of <code>SPARK_PREPEND_CLASSES</code> environment variable is included in <code>executorEnvs</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#for-mesos-schedulerbackend-only","title":"For Mesos SchedulerBackend Only <p>The Mesos scheduler backend's configuration is included in <code>executorEnvs</code>, i.e. SparkContext.md#environment-variables[SPARK_EXECUTOR_MEMORY], <code>_conf.getExecutorEnv</code>, and <code>SPARK_USER</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#shuffledrivercomponents","title":"ShuffleDriverComponents <pre><code>_shuffleDriverComponents: ShuffleDriverComponents\n</code></pre> <p><code>SparkContext</code>...FIXME</p>","text":""},{"location":"SparkContext-creating-instance-internals/#registering-heartbeatreceiver","title":"Registering HeartbeatReceiver <p><code>SparkContext</code> registers HeartbeatReceiver RPC endpoint.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#plugincontainer","title":"PluginContainer <pre><code>_plugins: Option[PluginContainer]\n</code></pre> <p><code>SparkContext</code> creates a PluginContainer (with itself and the _resources).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-schedulerbackend-and-taskscheduler","title":"Creating SchedulerBackend and TaskScheduler <p><code>SparkContext</code> object is requested to SparkContext.md#createTaskScheduler[create the SchedulerBackend with the TaskScheduler] (for the given master URL) and the result becomes the internal <code>_schedulerBackend</code> and <code>_taskScheduler</code>.</p> <p>scheduler:DAGScheduler.md#creating-instance[DAGScheduler is created] (as <code>_dagScheduler</code>).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#sending-blocking-taskschedulerisset","title":"Sending Blocking TaskSchedulerIsSet <p><code>SparkContext</code> sends a blocking <code>TaskSchedulerIsSet</code> message to HeartbeatReceiver RPC endpoint (to inform that the <code>TaskScheduler</code> is now available).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#executormetricssource","title":"ExecutorMetricsSource <p><code>SparkContext</code> creates an ExecutorMetricsSource when the spark.metrics.executorMetricsSource.enabled is enabled.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#heartbeater","title":"Heartbeater <p><code>SparkContext</code> creates a <code>Heartbeater</code> and starts it.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#starting-taskscheduler","title":"Starting TaskScheduler <p><code>SparkContext</code> requests the TaskScheduler to start.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#setting-spark-applications-and-execution-attempts-ids","title":"Setting Spark Application's and Execution Attempt's IDs <p><code>SparkContext</code> sets the internal fields -- <code>_applicationId</code> and <code>_applicationAttemptId</code> -- (using <code>applicationId</code> and <code>applicationAttemptId</code> methods from the scheduler:TaskScheduler.md#contract[TaskScheduler Contract]).</p> <p>NOTE: <code>SparkContext</code> requests <code>TaskScheduler</code> for the scheduler:TaskScheduler.md#applicationId[unique identifier of a Spark application] (that is currently only implemented by scheduler:TaskSchedulerImpl.md#applicationId[TaskSchedulerImpl] that uses <code>SchedulerBackend</code> to scheduler:SchedulerBackend.md#applicationId[request the identifier]).</p> <p>NOTE: The unique identifier of a Spark application is used to initialize spark-webui-SparkUI.md#setAppId[SparkUI] and storage:BlockManager.md#initialize[BlockManager].</p> <p>NOTE: <code>_applicationAttemptId</code> is used when <code>SparkContext</code> is requested for the SparkContext.md#applicationAttemptId[unique identifier of execution attempt of a Spark application] and when <code>EventLoggingListener</code> spark-history-server:EventLoggingListener.md#creating-instance[is created].</p>","text":""},{"location":"SparkContext-creating-instance-internals/#setting-sparkappid-spark-property-in-sparkconf","title":"Setting spark.app.id Spark Property in SparkConf <p><code>SparkContext</code> sets SparkConf.md#spark.app.id[spark.app.id] property to be the &lt;&lt;_applicationId, unique identifier of a Spark application&gt;&gt; and, if enabled, spark-webui-SparkUI.md#setAppId[passes it on to <code>SparkUI</code>].</p>","text":""},{"location":"SparkContext-creating-instance-internals/#sparkuiproxybase","title":"spark.ui.proxyBase","text":""},{"location":"SparkContext-creating-instance-internals/#initializing-sparkui","title":"Initializing SparkUI <p><code>SparkContext</code> requests the SparkUI (if defined) to setAppId with the _applicationId.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#initializing-blockmanager","title":"Initializing BlockManager <p>The storage:BlockManager.md#initialize[BlockManager (for the driver) is initialized] (with <code>_applicationId</code>).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#starting-metricssystem","title":"Starting MetricsSystem <p><code>SparkContext</code> requests the <code>MetricsSystem</code> to start (with the value of thespark.metrics.staticSources.enabled configuration property).</p>  <p>Note</p> <p><code>SparkContext</code> starts the <code>MetricsSystem</code> after &lt;&gt; as <code>MetricsSystem</code> uses it to build unique identifiers fo metrics sources.","text":""},{"location":"SparkContext-creating-instance-internals/#attaching-servlet-handlers-to-web-ui","title":"Attaching Servlet Handlers to web UI <p><code>SparkContext</code> requests the <code>MetricsSystem</code> for servlet handlers and requests the SparkUI to attach them.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#starting-eventlogginglistener-with-event-log-enabled","title":"Starting EventLoggingListener (with Event Log Enabled) <pre><code>_eventLogger: Option[EventLoggingListener]\n</code></pre> <p>With spark.eventLog.enabled configuration property enabled, <code>SparkContext</code> creates an EventLoggingListener and requests it to start.</p> <p><code>SparkContext</code> requests the LiveListenerBus to add the <code>EventLoggingListener</code> to <code>eventLog</code> event queue.</p> <p>With <code>spark.eventLog.enabled</code> disabled, <code>_eventLogger</code> is <code>None</code> (undefined).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#contextcleaner","title":"ContextCleaner <pre><code>_cleaner: Option[ContextCleaner]\n</code></pre> <p>With spark.cleaner.referenceTracking configuration property enabled, <code>SparkContext</code> creates a ContextCleaner (with itself and the _shuffleDriverComponents).</p> <p><code>SparkContext</code> requests the <code>ContextCleaner</code> to start</p>","text":""},{"location":"SparkContext-creating-instance-internals/#executorallocationmanager","title":"ExecutorAllocationManager <pre><code>_executorAllocationManager: Option[ExecutorAllocationManager]\n</code></pre> <p><code>SparkContext</code> initializes <code>_executorAllocationManager</code> internal registry.</p> <p><code>SparkContext</code> creates an ExecutorAllocationManager when:</p> <ul> <li> <p>Dynamic Allocation of Executors is enabled (based on spark.dynamicAllocation.enabled configuration property and the master URL)</p> </li> <li> <p>SchedulerBackend is an ExecutorAllocationClient</p> </li> </ul> <p>The <code>ExecutorAllocationManager</code> is requested to start.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#registering-user-defined-sparklisteners","title":"Registering User-Defined SparkListeners <p><code>SparkContext</code> registers user-defined listeners and starts <code>SparkListenerEvent</code> event delivery to the listeners.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#postenvironmentupdate","title":"postEnvironmentUpdate <p><code>postEnvironmentUpdate</code> is called that posts SparkListener.md#SparkListenerEnvironmentUpdate[SparkListenerEnvironmentUpdate] message on scheduler:LiveListenerBus.md[] with information about Task Scheduler's scheduling mode, added jar and file paths, and other environmental details.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#postapplicationstart","title":"postApplicationStart <p>SparkListener.md#SparkListenerApplicationStart[SparkListenerApplicationStart] message is posted to scheduler:LiveListenerBus.md[] (using the internal <code>postApplicationStart</code> method).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#poststarthook","title":"postStartHook <p><code>TaskScheduler</code> scheduler:TaskScheduler.md#postStartHook[is notified that <code>SparkContext</code> is almost fully initialized].</p> <p>NOTE: scheduler:TaskScheduler.md#postStartHook[TaskScheduler.postStartHook] does nothing by default, but custom implementations offer more advanced features, i.e. <code>TaskSchedulerImpl</code> scheduler:TaskSchedulerImpl.md#postStartHook[blocks the current thread until <code>SchedulerBackend</code> is ready]. There is also <code>YarnClusterScheduler</code> for Spark on YARN in <code>cluster</code> deploy mode.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#registering-metrics-sources","title":"Registering Metrics Sources <p><code>SparkContext</code> requests <code>MetricsSystem</code> to register metrics sources for the following services:</p> <ul> <li>DAGScheduler</li> <li>BlockManager</li> <li>ExecutorAllocationManager</li> </ul>","text":""},{"location":"SparkContext-creating-instance-internals/#adding-shutdown-hook","title":"Adding Shutdown Hook <p><code>SparkContext</code> adds a shutdown hook (using <code>ShutdownHookManager.addShutdownHook()</code>).</p> <p><code>SparkContext</code> prints out the following DEBUG message to the logs:</p> <pre><code>Adding shutdown hook\n</code></pre> <p>CAUTION: FIXME ShutdownHookManager.addShutdownHook()</p> <p>Any non-fatal Exception leads to termination of the Spark context instance.</p> <p>CAUTION: FIXME What does <code>NonFatal</code> represent in Scala?</p> <p>CAUTION: FIXME Finish me</p>","text":""},{"location":"SparkContext-creating-instance-internals/#initializing-nextshuffleid-and-nextrddid-internal-counters","title":"Initializing nextShuffleId and nextRddId Internal Counters <p><code>nextShuffleId</code> and <code>nextRddId</code> start with <code>0</code>.</p> <p>CAUTION: FIXME Where are <code>nextShuffleId</code> and <code>nextRddId</code> used?</p> <p>A new instance of Spark context is created and ready for operation.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#loading-external-cluster-manager-for-url-getclustermanager-method","title":"Loading External Cluster Manager for URL (getClusterManager method) <pre><code>getClusterManager(\n  url: String): Option[ExternalClusterManager]\n</code></pre> <p><code>getClusterManager</code> loads scheduler:ExternalClusterManager.md[] that scheduler:ExternalClusterManager.md#canCreate[can handle the input <code>url</code>].</p> <p>If there are two or more external cluster managers that could handle <code>url</code>, a <code>SparkException</code> is thrown:</p> <pre><code>Multiple Cluster Managers ([serviceLoaders]) registered for the url [url].\n</code></pre> <p>NOTE: <code>getClusterManager</code> uses Java's ++https://docs.oracle.com/javase/8/docs/api/java/util/ServiceLoader.html#load-java.lang.Class-java.lang.ClassLoader-++[ServiceLoader.load] method.</p> <p>NOTE: <code>getClusterManager</code> is used to find a cluster manager for a master URL when SparkContext.md#createTaskScheduler[creating a <code>SchedulerBackend</code> and a <code>TaskScheduler</code> for the driver].</p>","text":""},{"location":"SparkContext-creating-instance-internals/#setupandstartlistenerbus","title":"setupAndStartListenerBus <pre><code>setupAndStartListenerBus(): Unit\n</code></pre> <p><code>setupAndStartListenerBus</code> is an internal method that reads configuration-properties.md#spark.extraListeners[spark.extraListeners] configuration property from the current SparkConf.md[SparkConf] to create and register SparkListenerInterface listeners.</p> <p>It expects that the class name represents a <code>SparkListenerInterface</code> listener with one of the following constructors (in this order):</p> <ul> <li>a single-argument constructor that accepts SparkConf.md[SparkConf]</li> <li>a zero-argument constructor</li> </ul> <p><code>setupAndStartListenerBus</code> scheduler:LiveListenerBus.md#ListenerBus-addListener[registers every listener class].</p> <p>You should see the following INFO message in the logs:</p> <pre><code>INFO Registered listener [className]\n</code></pre> <p>It scheduler:LiveListenerBus.md#start[starts LiveListenerBus] and records it in the internal <code>_listenerBusStarted</code>.</p> <p>When no single-<code>SparkConf</code> or zero-argument constructor could be found for a class name in configuration-properties.md#spark.extraListeners[spark.extraListeners] configuration property, a <code>SparkException</code> is thrown with the message:</p> <pre><code>[className] did not have a zero-argument constructor or a single-argument constructor that accepts SparkConf. Note: if the class is defined inside of another Scala class, then its constructors may accept an implicit parameter that references the enclosing class; in this case, you must define the listener as a top-level class in order to prevent this extra parameter from breaking Spark's ability to find a valid constructor.\n</code></pre> <p>Any exception while registering a SparkListenerInterface listener stops the SparkContext and a <code>SparkException</code> is thrown and the source exception's message.</p> <pre><code>Exception when registering SparkListener\n</code></pre>  <p>Tip</p> <p>Set <code>INFO</code> logging level for <code>org.apache.spark.SparkContext</code> logger to see the extra listeners being registered.</p> <pre><code>Registered listener pl.japila.spark.CustomSparkListener\n</code></pre>","text":""},{"location":"SparkContext/","title":"SparkContext","text":"<p><code>SparkContext</code> is the entry point to all of the components of Apache Spark (execution engine) and so the heart of a Spark application. In fact, you can consider an application a Spark application only when it uses a SparkContext (directly or indirectly).</p> <p></p> <p>Important</p> <p>There should be one active <code>SparkContext</code> per JVM and Spark developers should use SparkContext.getOrCreate utility for sharing it (e.g. across threads).</p>"},{"location":"SparkContext/#creating-instance","title":"Creating Instance","text":"<p><code>SparkContext</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>SparkContext</code> is created (directly or indirectly using getOrCreate utility).</p> <p>While being created, <code>SparkContext</code> sets up core services and establishes a connection to a cluster manager.</p>"},{"location":"SparkContext/#checkpoint-directory","title":"Checkpoint Directory <p><code>SparkContext</code> defines <code>checkpointDir</code> internal registry for the path to a checkpoint directory.</p> <p><code>checkpointDir</code> is undefined (<code>None</code>) when <code>SparkContext</code> is created and is set using setCheckpointDir.</p> <p><code>checkpointDir</code> is required for Reliable Checkpointing.</p> <p><code>checkpointDir</code> is available using getCheckpointDir.</p>","text":""},{"location":"SparkContext/#getcheckpointdir","title":"getCheckpointDir <pre><code>getCheckpointDir: Option[String]\n</code></pre> <p><code>getCheckpointDir</code> returns the checkpointDir.</p> <p><code>getCheckpointDir</code> is used when:</p> <ul> <li><code>ReliableRDDCheckpointData</code> is requested for the checkpoint path</li> </ul>","text":""},{"location":"SparkContext/#submitting-mapstage-for-execution","title":"Submitting MapStage for Execution <pre><code>submitMapStage[K, V, C](\n  dependency: ShuffleDependency[K, V, C]): SimpleFutureAction[MapOutputStatistics]\n</code></pre> <p><code>submitMapStage</code> requests the DAGScheduler to submit the given ShuffleDependency for execution (that eventually produces a MapOutputStatistics).</p> <p><code>submitMapStage</code> is used when:</p> <ul> <li><code>ShuffleExchangeExec</code> (Spark SQL) unary physical operator is executed</li> </ul>","text":""},{"location":"SparkContext/#executormetricssource","title":"ExecutorMetricsSource <p><code>SparkContext</code> creates an ExecutorMetricsSource when created with spark.metrics.executorMetricsSource.enabled enabled.</p> <p><code>SparkContext</code> requests the <code>ExecutorMetricsSource</code> to register with the MetricsSystem.</p> <p><code>SparkContext</code> uses the <code>ExecutorMetricsSource</code> to create the Heartbeater.</p>","text":""},{"location":"SparkContext/#services","title":"Services <ul> <li> <p> ExecutorAllocationManager (optional)  <li> <p> SchedulerBackend","text":""},{"location":"SparkContext/#resourceprofilemanager","title":"ResourceProfileManager <p><code>SparkContext</code> creates a ResourceProfileManager when created.</p>","text":""},{"location":"SparkContext/#resourceprofilemanager_1","title":"resourceProfileManager <pre><code>resourceProfileManager: ResourceProfileManager\n</code></pre> <p><code>resourceProfileManager</code> returns the ResourceProfileManager.</p> <p><code>resourceProfileManager</code> is used when:</p> <ul> <li><code>KubernetesClusterSchedulerBackend</code> (Spark on Kubernetes) is created</li> <li>others</li> </ul>","text":""},{"location":"SparkContext/#driverlogger","title":"DriverLogger <p><code>SparkContext</code> can create a DriverLogger when created.</p> <p><code>SparkContext</code> requests the <code>DriverLogger</code> to startSync in postApplicationStart.</p>","text":""},{"location":"SparkContext/#appstatussource","title":"AppStatusSource <p><code>SparkContext</code> can create an AppStatusSource when created (based on the spark.metrics.appStatusSource.enabled configuration property).</p> <p><code>SparkContext</code> uses the <code>AppStatusSource</code> to create the AppStatusStore.</p> <p>If configured, <code>SparkContext</code> registers the <code>AppStatusSource</code> with the MetricsSystem.</p>","text":""},{"location":"SparkContext/#appstatusstore","title":"AppStatusStore <p><code>SparkContext</code> creates an AppStatusStore when created (with itself and the AppStatusSource).</p> <p><code>SparkContext</code> requests <code>AppStatusStore</code> for the AppStatusListener and requests the LiveListenerBus to add it to the application status queue.</p> <p><code>SparkContext</code> uses the <code>AppStatusStore</code> to create the following:</p> <ul> <li>SparkStatusTracker</li> <li>SparkUI</li> </ul> <p><code>AppStatusStore</code> is requested to status/AppStatusStore.md#close in stop.</p>","text":""},{"location":"SparkContext/#statusstore","title":"statusStore <pre><code>statusStore: AppStatusStore\n</code></pre> <p><code>statusStore</code> returns the AppStatusStore.</p> <p><code>statusStore</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to getRDDStorageInfo</li> <li><code>ConsoleProgressBar</code> is requested to refresh</li> <li><code>HiveThriftServer2</code> is requested to <code>createListenerAndUI</code></li> <li><code>SharedState</code> (Spark SQL) is requested for a <code>SQLAppStatusStore</code> and a <code>StreamingQueryStatusListener</code></li> </ul>","text":""},{"location":"SparkContext/#sparkstatustracker","title":"SparkStatusTracker <p><code>SparkContext</code> creates a SparkStatusTracker when created (with itself and the AppStatusStore).</p>","text":""},{"location":"SparkContext/#statustracker","title":"statusTracker <pre><code>statusTracker: SparkStatusTracker\n</code></pre> <p><code>statusTracker</code> returns the SparkStatusTracker.</p>","text":""},{"location":"SparkContext/#local-properties","title":"Local Properties <pre><code>localProperties: InheritableThreadLocal[Properties]\n</code></pre> <p><code>SparkContext</code> uses an <code>InheritableThreadLocal</code> (Java) of key-value pairs of thread-local properties to pass extra information from a parent thread (on the driver) to child threads.</p> <p><code>localProperties</code> is meant to be used by developers using SparkContext.setLocalProperty and SparkContext.getLocalProperty.</p> <p>Local Properties are available using TaskContext.getLocalProperty.</p> <p>Local Properties are available to SparkListeners using the following events:</p> <ul> <li>SparkListenerJobStart</li> <li>SparkListenerStageSubmitted</li> </ul> <p><code>localProperties</code> are passed down when <code>SparkContext</code> is requested for the following:</p> <ul> <li>Running Job (that in turn makes the local properties available to the DAGScheduler to run a job)</li> <li>Running Approximate Job</li> <li>Submitting Job</li> <li>Submitting MapStage</li> </ul> <p><code>DAGScheduler</code> passes down local properties when scheduling:</p> <ul> <li>ShuffleMapTasks</li> <li>ResultTasks</li> <li>TaskSets</li> </ul> <p>Spark (Core) defines the following local properties.</p>    Name Default Value Setter      <code>callSite.long</code>      <code>callSite.short</code>  <code>SparkContext.setCallSite</code>    <code>spark.job.description</code> callSite.short <code>SparkContext.setJobDescription</code>  (<code>SparkContext.setJobGroup</code>)    <code>spark.job.interruptOnCancel</code>  <code>SparkContext.setJobGroup</code>    <code>spark.jobGroup.id</code>  <code>SparkContext.setJobGroup</code>    <code>spark.scheduler.pool</code>","text":""},{"location":"SparkContext/#shuffledrivercomponents","title":"ShuffleDriverComponents <p><code>SparkContext</code> creates a ShuffleDriverComponents when created.</p> <p><code>SparkContext</code> loads the ShuffleDataIO that is in turn requested for the ShuffleDriverComponents. <code>SparkContext</code> requests the <code>ShuffleDriverComponents</code> to initialize.</p> <p>The <code>ShuffleDriverComponents</code> is used when:</p> <ul> <li><code>ShuffleDependency</code> is created</li> <li><code>SparkContext</code> creates the ContextCleaner (if enabled)</li> </ul> <p><code>SparkContext</code> requests the <code>ShuffleDriverComponents</code> to clean up when stopping.</p>","text":""},{"location":"SparkContext/#static-files","title":"Static Files","text":""},{"location":"SparkContext/#addfile","title":"addFile <pre><code>addFile(\n  path: String,\n  recursive: Boolean): Unit\n// recursive = false\naddFile(\n  path: String): Unit\n</code></pre> <p><code>addFile</code> creates a Hadoop <code>Path</code> from the given <code>path</code>. For a no-schema path, <code>addFile</code> converts it to a canonical form.</p> <p><code>addFile</code> prints out the following WARN message to the logs and exits.</p> <pre><code>File with 'local' scheme is not supported to add to file server, since it is already available on every node.\n</code></pre> <p><code>addFile</code>...FIXME</p> <p>In the end, <code>addFile</code> adds the file to the addedFiles internal registry (with the current timestamp):</p> <ul> <li> <p>For new files, <code>addFile</code> prints out the following INFO message to the logs, fetches the file (to the root directory and without using the cache) and postEnvironmentUpdate.</p> <pre><code>Added file [path] at [key] with timestamp [timestamp]\n</code></pre> </li> <li> <p>For files that were already added, <code>addFile</code> prints out the following WARN message to the logs:</p> <pre><code>The path [path] has been added already. Overwriting of added paths is not supported in the current version.\n</code></pre> </li> </ul> <p><code>addFile</code> is used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"SparkContext/#listfiles","title":"listFiles <pre><code>listFiles(): Seq[String]\n</code></pre> <p><code>listFiles</code> is the files added.</p>","text":""},{"location":"SparkContext/#addedfiles-internal-registry","title":"addedFiles Internal Registry <pre><code>addedFiles: Map[String, Long]\n</code></pre> <p><code>addedFiles</code> is a collection of static files by the timestamp the were added at.</p> <p><code>addedFiles</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to postEnvironmentUpdate and listFiles</li> <li><code>TaskSetManager</code> is created (and resourceOffer)</li> </ul>","text":""},{"location":"SparkContext/#files","title":"files <pre><code>files: Seq[String]\n</code></pre> <p><code>files</code> is a collection of file paths defined by spark.files configuration property.</p>","text":""},{"location":"SparkContext/#posting-sparklistenerenvironmentupdate-event","title":"Posting SparkListenerEnvironmentUpdate Event <pre><code>postEnvironmentUpdate(): Unit\n</code></pre> <p><code>postEnvironmentUpdate</code>...FIXME</p> <p><code>postEnvironmentUpdate</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to addFile and addJar</li> </ul>","text":""},{"location":"SparkContext/#getorcreate-utility","title":"getOrCreate Utility <pre><code>getOrCreate(): SparkContext\ngetOrCreate(\n  config: SparkConf): SparkContext\n</code></pre> <p><code>getOrCreate</code>...FIXME</p>","text":""},{"location":"SparkContext/#plugincontainer","title":"PluginContainer <p><code>SparkContext</code> creates a PluginContainer when created.</p> <p><code>PluginContainer</code> is created (for the driver where <code>SparkContext</code> lives) using PluginContainer.apply utility.</p> <p><code>PluginContainer</code> is then requested to registerMetrics with the applicationId.</p> <p><code>PluginContainer</code> is requested to shutdown when <code>SparkContext</code> is requested to stop.</p>","text":""},{"location":"SparkContext/#creating-schedulerbackend-and-taskscheduler","title":"Creating SchedulerBackend and TaskScheduler <pre><code>createTaskScheduler(\n  sc: SparkContext,\n  master: String,\n  deployMode: String): (SchedulerBackend, TaskScheduler)\n</code></pre> <p><code>createTaskScheduler</code> creates a SchedulerBackend and a TaskScheduler for the given master URL and deployment mode.</p> <p></p> <p>Internally, <code>createTaskScheduler</code> branches off per the given master URL to select the requested implementations.</p> <p><code>createTaskScheduler</code> accepts the following master URLs:</p> <ul> <li><code>local</code> - local mode with 1 thread only</li> <li><code>local[n]</code> or <code>local[*]</code> - local mode with <code>n</code> threads</li> <li><code>local[n, m]</code> or <code>local[*, m]</code> -- local mode with <code>n</code> threads and <code>m</code> number of failures</li> <li><code>spark://hostname:port</code> for Spark Standalone</li> <li><code>local-cluster[n, m, z]</code> -- local cluster with <code>n</code> workers, <code>m</code> cores per worker, and <code>z</code> memory per worker</li> <li>Other URLs are simply handed over to getClusterManager to load an external cluster manager if available</li> </ul> <p><code>createTaskScheduler</code> is used when <code>SparkContext</code> is created.</p>","text":""},{"location":"SparkContext/#loading-externalclustermanager","title":"Loading ExternalClusterManager <pre><code>getClusterManager(\n  url: String): Option[ExternalClusterManager]\n</code></pre> <p><code>getClusterManager</code> uses Java's ServiceLoader to find and load an ExternalClusterManager that supports the given master URL.</p>  <p>ExternalClusterManager Service Discovery</p> <p>For ServiceLoader to find ExternalClusterManagers, they have to be registered using the following file:</p> <pre><code>META-INF/services/org.apache.spark.scheduler.ExternalClusterManager\n</code></pre>  <p><code>getClusterManager</code> throws a <code>SparkException</code> when multiple cluster managers were found:</p> <pre><code>Multiple external cluster managers registered for the url [url]: [serviceLoaders]\n</code></pre> <p><code>getClusterManager</code>\u00a0is used when <code>SparkContext</code> is requested for a SchedulerBackend and TaskScheduler.</p>","text":""},{"location":"SparkContext/#running-job-synchronously","title":"Running Job Synchronously <pre><code>runJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) =&gt; U): Array[U]\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  processPartition: (TaskContext, Iterator[T]) =&gt; U,\n  resultHandler: (Int, U) =&gt; Unit): Unit\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) =&gt; U,\n  partitions: Seq[Int]): Array[U]\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) =&gt; U,\n  partitions: Seq[Int],\n  resultHandler: (Int, U) =&gt; Unit): Unit\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: Iterator[T] =&gt; U): Array[U]\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  processPartition: Iterator[T] =&gt; U,\n  resultHandler: (Int, U) =&gt; Unit): Unit\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: Iterator[T] =&gt; U,\n  partitions: Seq[Int]): Array[U]\n</code></pre>  <p><code>runJob</code> finds the call site and cleans up the given <code>func</code> function.</p> <p><code>runJob</code> prints out the following INFO message to the logs:</p> <pre><code>Starting job: [callSite]\n</code></pre> <p>With spark.logLineage enabled, <code>runJob</code> requests the given <code>RDD</code> for the recursive dependencies and prints out the following INFO message to the logs:</p> <pre><code>RDD's recursive dependencies:\n[toDebugString]\n</code></pre> <p><code>runJob</code> requests the DAGScheduler to run a job.</p> <p><code>runJob</code> requests the ConsoleProgressBar to finishAll if defined.</p> <p>In the end, <code>runJob</code> requests the given <code>RDD</code> to doCheckpoint.</p> <p><code>runJob</code> throws an <code>IllegalStateException</code> when <code>SparkContext</code> is stopped:</p> <pre><code>SparkContext has been shutdown\n</code></pre>","text":""},{"location":"SparkContext/#demo","title":"Demo <p><code>runJob</code> is essentially executing a <code>func</code> function on all or a subset of partitions of an RDD and returning the result as an array (with elements being the results per partition).</p> <pre><code>sc.setLocalProperty(\"callSite.short\", \"runJob Demo\")\n\nval partitionsNumber = 4\nval rdd = sc.parallelize(\n  Seq(\"hello world\", \"nice to see you\"),\n  numSlices = partitionsNumber)\n\nimport org.apache.spark.TaskContext\nval func = (t: TaskContext, ss: Iterator[String]) =&gt; 1\nval result = sc.runJob(rdd, func)\nassert(result.length == partitionsNumber)\n\nsc.clearCallSite()\n</code></pre>","text":""},{"location":"SparkContext/#call-site","title":"Call Site <pre><code>getCallSite(): CallSite\n</code></pre> <p><code>getCallSite</code>...FIXME</p> <p><code>getCallSite</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is requested to broadcast, runJob, runApproximateJob, submitJob and submitMapStage</li> <li><code>AsyncRDDActions</code> is requested to takeAsync</li> <li><code>RDD</code> is created</li> </ul>","text":""},{"location":"SparkContext/#closure-cleaning","title":"Closure Cleaning <pre><code>clean(\n  f: F,\n  checkSerializable: Boolean = true): F\n</code></pre> <p><code>clean</code> cleans up the given <code>f</code> closure (using <code>ClosureCleaner.clean</code> utility).</p>  <p>Tip</p> <p>Enable <code>DEBUG</code> logging level for <code>org.apache.spark.util.ClosureCleaner</code> logger to see what happens inside the class.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.util.ClosureCleaner=DEBUG\n</code></pre> <p>Refer to Logging.</p>  <p>With <code>DEBUG</code> logging level you should see the following messages in the logs:</p> <pre><code>+++ Cleaning closure [func] ([func.getClass.getName]) +++\n + declared fields: [declaredFields.size]\n     [field]\n ...\n+++ closure [func] ([func.getClass.getName]) is now cleaned +++\n</code></pre>","text":""},{"location":"SparkContext/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks <pre><code>maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n</code></pre> <p><code>maxNumConcurrentTasks</code> requests the SchedulerBackend for the maximum number of tasks that can be launched concurrently (with the given ResourceProfile).</p>  <p><code>maxNumConcurrentTasks</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to checkBarrierStageWithNumSlots</li> </ul>","text":""},{"location":"SparkContext/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.SparkContext</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.SparkContext.name = org.apache.spark.SparkContext\nlogger.SparkContext.level = all\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"SparkCoreErrors/","title":"SparkCoreErrors","text":""},{"location":"SparkCoreErrors/#numPartitionsGreaterThanMaxNumConcurrentTasksError","title":"numPartitionsGreaterThanMaxNumConcurrentTasksError","text":"<pre><code>numPartitionsGreaterThanMaxNumConcurrentTasksError(\n  numPartitions: Int,\n  maxNumConcurrentTasks: Int): Throwable\n</code></pre> <p><code>numPartitionsGreaterThanMaxNumConcurrentTasksError</code> creates a BarrierJobSlotsNumberCheckFailed with the given input arguments.</p> <p><code>numPartitionsGreaterThanMaxNumConcurrentTasksError</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to checkBarrierStageWithNumSlots</li> </ul>"},{"location":"SparkEnv/","title":"SparkEnv","text":"<p><code>SparkEnv</code> is a handle to Spark Execution Environment with the core services of Apache Spark (that interact with each other to establish a distributed computing platform for a Spark application).</p> <p>There are two separate <code>SparkEnv</code>s of the driver and executors.</p>","tags":["DeveloperApi"]},{"location":"SparkEnv/#core-services","title":"Core Services    Property Service      blockManager BlockManager    broadcastManager BroadcastManager    closureSerializer Serializer    conf SparkConf    mapOutputTracker MapOutputTracker    memoryManager MemoryManager    metricsSystem MetricsSystem    outputCommitCoordinator OutputCommitCoordinator    rpcEnv RpcEnv    securityManager SecurityManager    serializer Serializer    serializerManager SerializerManager    shuffleManager ShuffleManager","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-instance","title":"Creating Instance <p><code>SparkEnv</code> takes the following to be created:</p> <ul> <li> Executor ID <li>RpcEnv</li> <li>Serializer</li> <li>Serializer</li> <li>SerializerManager</li> <li>MapOutputTracker</li> <li>ShuffleManager</li> <li>BroadcastManager</li> <li>BlockManager</li> <li>SecurityManager</li> <li>MetricsSystem</li> <li>MemoryManager</li> <li>OutputCommitCoordinator</li> <li>SparkConf</li>  <p><code>SparkEnv</code> is created using create utility.</p>","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#drivers-temporary-directory","title":"Driver's Temporary Directory <pre><code>driverTmpDir: Option[String]\n</code></pre> <p><code>SparkEnv</code> defines <code>driverTmpDir</code> internal registry for the driver to be used as the root directory of files added using SparkContext.addFile.</p> <p><code>driverTmpDir</code> is undefined initially and is defined for the driver only when <code>SparkEnv</code> utility is used to create a \"base\" SparkEnv.</p>","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#demo","title":"Demo <pre><code>import org.apache.spark.SparkEnv\n</code></pre> <pre><code>// :pa -raw\n// BEGIN\npackage org.apache.spark\nobject BypassPrivateSpark {\n  def driverTmpDir(sparkEnv: SparkEnv) = {\n    sparkEnv.driverTmpDir\n  }\n}\n// END\n</code></pre> <pre><code>val driverTmpDir = org.apache.spark.BypassPrivateSpark.driverTmpDir(SparkEnv.get).get\n</code></pre> <p>The above is equivalent to the following snippet.</p> <pre><code>import org.apache.spark.SparkFiles\nSparkFiles.getRootDirectory\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-sparkenv-for-driver","title":"Creating SparkEnv for Driver <pre><code>createDriverEnv(\n  conf: SparkConf,\n  isLocal: Boolean,\n  listenerBus: LiveListenerBus,\n  numCores: Int,\n  mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv\n</code></pre> <p><code>createDriverEnv</code> creates a SparkEnv execution environment for the driver.</p> <p></p> <p><code>createDriverEnv</code> accepts an instance of SparkConf, whether it runs in local mode or not, scheduler:LiveListenerBus.md[], the number of cores to use for execution in local mode or <code>0</code> otherwise, and a OutputCommitCoordinator (default: none).</p> <p><code>createDriverEnv</code> ensures that spark-driver.md#spark_driver_host[spark.driver.host] and spark-driver.md#spark_driver_port[spark.driver.port] settings are defined.</p> <p>It then passes the call straight on to the &lt;&gt; (with <code>driver</code> executor id, <code>isDriver</code> enabled, and the input parameters). <p><code>createDriverEnv</code> is used when <code>SparkContext</code> is created.</p>","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-sparkenv-for-executor","title":"Creating SparkEnv for Executor <pre><code>createExecutorEnv(\n  conf: SparkConf,\n  executorId: String,\n  hostname: String,\n  numCores: Int,\n  ioEncryptionKey: Option[Array[Byte]],\n  isLocal: Boolean): SparkEnv\ncreateExecutorEnv(\n  conf: SparkConf,\n  executorId: String,\n  bindAddress: String,\n  hostname: String,\n  numCores: Int,\n  ioEncryptionKey: Option[Array[Byte]],\n  isLocal: Boolean): SparkEnv\n</code></pre> <p><code>createExecutorEnv</code> creates an executor's (execution) environment that is the Spark execution environment for an executor.</p> <p></p> <p><code>createExecutorEnv</code> simply &lt;&gt; (passing in all the input parameters) and &lt;&gt;. <p>NOTE: The number of cores <code>numCores</code> is configured using <code>--cores</code> command-line option of <code>CoarseGrainedExecutorBackend</code> and is specific to a cluster manager.</p> <p><code>createExecutorEnv</code> is used when <code>CoarseGrainedExecutorBackend</code> utility is requested to <code>run</code>.</p>","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-base-sparkenv","title":"Creating \"Base\" SparkEnv <pre><code>create(\n  conf: SparkConf,\n  executorId: String,\n  bindAddress: String,\n  advertiseAddress: String,\n  port: Option[Int],\n  isLocal: Boolean,\n  numUsableCores: Int,\n  ioEncryptionKey: Option[Array[Byte]],\n  listenerBus: LiveListenerBus = null,\n  mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv\n</code></pre> <p><code>create</code> creates the \"base\" <code>SparkEnv</code> (that is common across the driver and executors).</p> <p><code>create</code> creates a RpcEnv as sparkDriver on the driver and sparkExecutor on executors.</p> <p><code>create</code> creates a Serializer (based on spark.serializer configuration property). <code>create</code> prints out the following DEBUG message to the logs:</p> <pre><code>Using serializer: [serializer]\n</code></pre> <p><code>create</code> creates a SerializerManager.</p> <p><code>create</code> creates a <code>JavaSerializer</code> as the closure serializer.</p> <p><code>creates</code> creates a BroadcastManager.</p> <p><code>creates</code> creates a MapOutputTrackerMaster (on the driver) or a MapOutputTrackerWorker (on executors). <code>creates</code> registers or looks up a MapOutputTrackerMasterEndpoint under the name of MapOutputTracker. <code>creates</code> prints out the following INFO message to the logs (on the driver only):</p> <pre><code>Registering MapOutputTracker\n</code></pre> <p><code>creates</code> creates a ShuffleManager (based on spark.shuffle.manager configuration property).</p> <p><code>create</code> creates a UnifiedMemoryManager.</p> <p>With spark.shuffle.service.enabled configuration property enabled, <code>create</code> creates an ExternalBlockStoreClient.</p> <p><code>create</code> creates a BlockManagerMaster.</p> <p><code>create</code> creates a NettyBlockTransferService.</p> <p></p> <p></p> <p><code>create</code> creates a BlockManager.</p> <p><code>create</code> creates a MetricsSystem.</p> <p><code>create</code> creates a OutputCommitCoordinator and registers or looks up a <code>OutputCommitCoordinatorEndpoint</code> under the name of OutputCommitCoordinator.</p> <p><code>create</code> creates a SparkEnv (with all the services \"stitched\" together).</p>","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.SparkEnv</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.SparkEnv=ALL\n</code></pre> <p>Refer to Logging.</p>","text":"","tags":["DeveloperApi"]},{"location":"SparkFiles/","title":"SparkFiles","text":"<p><code>SparkFiles</code> is an utility to work with files added using SparkContext.addFile.</p>"},{"location":"SparkFiles/#absolute-path-of-added-file","title":"Absolute Path of Added File <pre><code>get(\n  filename: String): String\n</code></pre> <p><code>get</code> gets the absolute path of the given file in the root directory.</p>","text":""},{"location":"SparkFiles/#root-directory","title":"Root Directory <pre><code>getRootDirectory(): String\n</code></pre> <p><code>getRootDirectory</code> requests the current <code>SparkEnv</code> for driverTmpDir (if defined) or defaults to the current directory (<code>.</code>).</p> <p><code>getRootDirectory</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is requested to addFile</li> <li><code>Executor</code> is requested to updateDependencies</li> <li><code>SparkFiles</code> utility is requested to get the absolute path of a file</li> </ul>","text":""},{"location":"SparkHadoopWriter/","title":"SparkHadoopWriter Utility","text":""},{"location":"SparkHadoopWriter/#writing-key-value-rdd-out-as-hadoop-outputformat","title":"Writing Key-Value RDD Out (As Hadoop OutputFormat) <pre><code>write[K, V: ClassTag](\n  rdd: RDD[(K, V)],\n  config: HadoopWriteConfigUtil[K, V]): Unit\n</code></pre> <p><code>write</code> runs a Spark job to write out partition records (for all partitions of the given key-value <code>RDD</code>) with the given HadoopWriteConfigUtil and a HadoopMapReduceCommitProtocol committer.</p> <p>The number of writer tasks (parallelism) is the number of the partitions in the given key-value <code>RDD</code>.</p>","text":""},{"location":"SparkHadoopWriter/#internals","title":"Internals <p> Internally, <code>write</code> uses the id of the given RDD as the <code>commitJobId</code>. <p> <code>write</code> creates a <code>jobTrackerId</code> with the current date. <p> <code>write</code> requests the given <code>HadoopWriteConfigUtil</code> to create a Hadoop JobContext (for the jobTrackerId and commitJobId). <p><code>write</code> requests the given <code>HadoopWriteConfigUtil</code> to initOutputFormat with the Hadoop JobContext.</p> <p><code>write</code> requests the given <code>HadoopWriteConfigUtil</code> to assertConf.</p> <p><code>write</code> requests the given <code>HadoopWriteConfigUtil</code> to create a HadoopMapReduceCommitProtocol committer for the commitJobId.</p> <p><code>write</code> requests the <code>HadoopMapReduceCommitProtocol</code> to setupJob (with the jobContext).</p> <p> <code>write</code> uses the <code>SparkContext</code> (of the given RDD) to run a Spark job asynchronously for the given RDD with the executeTask partition function. <p> In the end, <code>write</code> requests the HadoopMapReduceCommitProtocol to commit the job and prints out the following INFO message to the logs: <pre><code>Job [getJobID] committed.\n</code></pre>","text":""},{"location":"SparkHadoopWriter/#throwables","title":"Throwables <p>In case of any <code>Throwable</code>, <code>write</code> prints out the following ERROR message to the logs:</p> <pre><code>Aborting job [getJobID].\n</code></pre> <p> <code>write</code> requests the HadoopMapReduceCommitProtocol to abort the job and throws a <code>SparkException</code>: <pre><code>Job aborted.\n</code></pre>","text":""},{"location":"SparkHadoopWriter/#usage","title":"Usage <p><code>write</code>\u00a0is used when:</p> <ul> <li>PairRDDFunctions.saveAsNewAPIHadoopDataset</li> <li>PairRDDFunctions.saveAsHadoopDataset</li> </ul>","text":""},{"location":"SparkHadoopWriter/#writing-rdd-partition","title":"Writing RDD Partition <pre><code>executeTask[K, V: ClassTag](\n  context: TaskContext,\n  config: HadoopWriteConfigUtil[K, V],\n  jobTrackerId: String,\n  commitJobId: Int,\n  sparkPartitionId: Int,\n  sparkAttemptNumber: Int,\n  committer: FileCommitProtocol,\n  iterator: Iterator[(K, V)]): TaskCommitMessage\n</code></pre>  <p>Fixme</p> <p>Review Me</p>  <p><code>executeTask</code> requests the given <code>HadoopWriteConfigUtil</code> to create a TaskAttemptContext.</p> <p><code>executeTask</code> requests the given <code>FileCommitProtocol</code> to set up a task with the <code>TaskAttemptContext</code>.</p> <p><code>executeTask</code> requests the given <code>HadoopWriteConfigUtil</code> to initWriter (with the <code>TaskAttemptContext</code> and the given <code>sparkPartitionId</code>).</p> <p><code>executeTask</code> initHadoopOutputMetrics.</p> <p><code>executeTask</code> writes all rows of the RDD partition (from the given <code>Iterator[(K, V)]</code>). <code>executeTask</code> requests the given <code>HadoopWriteConfigUtil</code> to write. In the end, <code>executeTask</code> requests the given <code>HadoopWriteConfigUtil</code> to closeWriter and the given <code>FileCommitProtocol</code> to commit the task.</p> <p><code>executeTask</code> updates metrics about writing data to external systems (bytesWritten and recordsWritten) every few records and at the end.</p> <p>In case of any errors, <code>executeTask</code> requests the given <code>HadoopWriteConfigUtil</code> to closeWriter and the given <code>FileCommitProtocol</code> to abort the task. In the end, <code>executeTask</code> prints out the following ERROR message to the logs:</p> <pre><code>Task [taskAttemptID] aborted.\n</code></pre> <p><code>executeTask</code> is used when:</p> <ul> <li><code>SparkHadoopWriter</code> utility is used to write</li> </ul>","text":""},{"location":"SparkHadoopWriter/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.internal.io.SparkHadoopWriter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.internal.io.SparkHadoopWriter=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"SparkListener/","title":"SparkListener","text":"<p><code>SparkListener</code>\u00a0is an extension of the SparkListenerInterface abstraction for event listeners with a no-op implementation for callback methods.</p>","tags":["DeveloperApi"]},{"location":"SparkListener/#implementations","title":"Implementations","text":"<ul> <li>BarrierCoordinator</li> <li>SparkSession (Spark SQL)</li> <li>AppListingListener (Spark History Server)</li> <li>AppStatusListener</li> <li>BasicEventFilterBuilder (Spark History Server)</li> <li>EventLoggingListener (Spark History Server)</li> <li>ExecutionListenerBus</li> <li>ExecutorAllocationListener</li> <li>ExecutorMonitor</li> <li>HeartbeatReceiver</li> <li>HiveThriftServer2Listener (Spark Thrift Server)</li> <li>SpillListener</li> <li>SQLAppStatusListener (Spark SQL)</li> <li>SQLEventFilterBuilder</li> <li>StatsReportListener</li> <li>StreamingQueryListenerBus (Spark Structured Streaming)</li> </ul>","tags":["DeveloperApi"]},{"location":"SparkListenerBus/","title":"SparkListenerBus","text":"<p><code>SparkListenerBus</code>\u00a0is an extension of the ListenerBus abstraction for event buses for SparkListenerInterfaces to be notified about SparkListenerEvents.</p>"},{"location":"SparkListenerBus/#posting-event-to-sparklistener","title":"Posting Event to SparkListener <pre><code>doPostEvent(\n  listener: SparkListenerInterface,\n  event: SparkListenerEvent): Unit\n</code></pre> <p><code>doPostEvent</code>\u00a0is part of the ListenerBus abstraction.</p> <p><code>doPostEvent</code> notifies the given SparkListenerInterface about the SparkListenerEvent.</p> <p><code>doPostEvent</code> calls an event-specific method of SparkListenerInterface or falls back to onOtherEvent.</p>","text":""},{"location":"SparkListenerBus/#implementations","title":"Implementations","text":"<ul> <li><code>AsyncEventQueue</code></li> <li>ReplayListenerBus</li> </ul>"},{"location":"SparkListenerEvent/","title":"SparkListenerEvent","text":"<p><code>SparkListenerEvent</code> is an abstraction of scheduling events.</p>"},{"location":"SparkListenerEvent/#dispatching-sparklistenerevents","title":"Dispatching SparkListenerEvents","text":"<p>SparkListenerBus in general (and <code>AsyncEventQueue</code> are event buses used to dispatch <code>SparkListenerEvent</code>s to registered SparkListeners.</p> <p>LiveListenerBus is an event bus to dispatch <code>SparkListenerEvent</code>s to registered SparkListeners.</p>"},{"location":"SparkListenerEvent/#spark-history-server","title":"Spark History Server","text":"<p>Once logged, Spark History Server uses <code>JsonProtocol</code> utility to sparkEventFromJson.</p>"},{"location":"SparkListenerEvent/#contract","title":"Contract","text":""},{"location":"SparkListenerEvent/#logevent","title":"logEvent <pre><code>logEvent: Boolean\n</code></pre> <p><code>logEvent</code> controls whether EventLoggingListener should save the event to an event log.</p> <p>Default: <code>true</code></p> <p><code>logEvent</code>\u00a0is used when:</p> <ul> <li><code>EventLoggingListener</code> is requested to handle \"other\" events</li> </ul>","text":""},{"location":"SparkListenerEvent/#implementations","title":"Implementations","text":""},{"location":"SparkListenerEvent/#sparklistenerapplicationend","title":"SparkListenerApplicationEnd","text":""},{"location":"SparkListenerEvent/#sparklistenerapplicationstart","title":"SparkListenerApplicationStart","text":""},{"location":"SparkListenerEvent/#sparklistenerblockmanageradded","title":"SparkListenerBlockManagerAdded","text":""},{"location":"SparkListenerEvent/#sparklistenerblockmanagerremoved","title":"SparkListenerBlockManagerRemoved","text":""},{"location":"SparkListenerEvent/#sparklistenerblockupdated","title":"SparkListenerBlockUpdated","text":""},{"location":"SparkListenerEvent/#sparklistenerenvironmentupdate","title":"SparkListenerEnvironmentUpdate","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutoradded","title":"SparkListenerExecutorAdded","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorblacklisted","title":"SparkListenerExecutorBlacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorblacklistedforstage","title":"SparkListenerExecutorBlacklistedForStage","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutormetricsupdate","title":"SparkListenerExecutorMetricsUpdate","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorremoved","title":"SparkListenerExecutorRemoved","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorunblacklisted","title":"SparkListenerExecutorUnblacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenerjobend","title":"SparkListenerJobEnd","text":""},{"location":"SparkListenerEvent/#sparklistenerjobstart","title":"SparkListenerJobStart","text":""},{"location":"SparkListenerEvent/#sparklistenerlogstart","title":"SparkListenerLogStart","text":""},{"location":"SparkListenerEvent/#sparklistenernodeblacklisted","title":"SparkListenerNodeBlacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenernodeblacklistedforstage","title":"SparkListenerNodeBlacklistedForStage","text":""},{"location":"SparkListenerEvent/#sparklistenernodeunblacklisted","title":"SparkListenerNodeUnblacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenerspeculativetasksubmitted","title":"SparkListenerSpeculativeTaskSubmitted","text":""},{"location":"SparkListenerEvent/#sparklistenerstagecompleted","title":"SparkListenerStageCompleted","text":""},{"location":"SparkListenerEvent/#sparklistenerstageexecutormetrics","title":"SparkListenerStageExecutorMetrics","text":""},{"location":"SparkListenerEvent/#sparklistenerstagesubmitted","title":"SparkListenerStageSubmitted","text":""},{"location":"SparkListenerEvent/#sparklistenertaskend","title":"SparkListenerTaskEnd <p>SparkListenerTaskEnd</p>","text":""},{"location":"SparkListenerEvent/#sparklistenertaskgettingresult","title":"SparkListenerTaskGettingResult","text":""},{"location":"SparkListenerEvent/#sparklistenertaskstart","title":"SparkListenerTaskStart","text":""},{"location":"SparkListenerEvent/#sparklistenerunpersistrdd","title":"SparkListenerUnpersistRDD","text":""},{"location":"SparkListenerInterface/","title":"SparkListenerInterface","text":"<p><code>SparkListenerInterface</code> is an abstraction of event listeners (that <code>SparkListenerBus</code> notifies about scheduling events).</p> <p><code>SparkListenerInterface</code> is a way to intercept scheduling events from the Spark Scheduler that are emitted over the course of execution of a Spark application.</p> <p><code>SparkListenerInterface</code> is used heavily to manage communication between internal components in the distributed environment for a Spark application (e.g. web UI, event persistence for History Server, dynamic allocation of executors, keeping track of executors).</p> <p><code>SparkListenerInterface</code> can be registered in a Spark application using SparkContext.addSparkListener method or spark.extraListeners configuration property.</p> <p>Tip</p> <p>Enable <code>INFO</code> logging level for org.apache.spark.SparkContext logger to see what and when custom Spark listeners are registered.</p>"},{"location":"SparkListenerInterface/#onapplicationend","title":"onApplicationEnd <pre><code>onApplicationEnd(\n  applicationEnd: SparkListenerApplicationEnd): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerApplicationEnd event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onapplicationstart","title":"onApplicationStart <pre><code>onApplicationStart(\n  applicationStart: SparkListenerApplicationStart): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerApplicationStart event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onblockmanageradded","title":"onBlockManagerAdded <pre><code>onBlockManagerAdded(\n  blockManagerAdded: SparkListenerBlockManagerAdded): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerBlockManagerAdded event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onblockmanagerremoved","title":"onBlockManagerRemoved <pre><code>onBlockManagerRemoved(\n  blockManagerRemoved: SparkListenerBlockManagerRemoved): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerBlockManagerRemoved event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onblockupdated","title":"onBlockUpdated <pre><code>onBlockUpdated(\n  blockUpdated: SparkListenerBlockUpdated): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerBlockUpdated event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onenvironmentupdate","title":"onEnvironmentUpdate <pre><code>onEnvironmentUpdate(\n  environmentUpdate: SparkListenerEnvironmentUpdate): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerEnvironmentUpdate event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onexecutoradded","title":"onExecutorAdded <pre><code>onExecutorAdded(\n  executorAdded: SparkListenerExecutorAdded): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerExecutorAdded event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onexecutorblacklisted","title":"onExecutorBlacklisted <pre><code>onExecutorBlacklisted(\n  executorBlacklisted: SparkListenerExecutorBlacklisted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerExecutorBlacklisted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onexecutorblacklistedforstage","title":"onExecutorBlacklistedForStage <pre><code>onExecutorBlacklistedForStage(\n  executorBlacklistedForStage: SparkListenerExecutorBlacklistedForStage): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerExecutorBlacklistedForStage event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onexecutormetricsupdate","title":"onExecutorMetricsUpdate <pre><code>onExecutorMetricsUpdate(\n  executorMetricsUpdate: SparkListenerExecutorMetricsUpdate): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerExecutorMetricsUpdate event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onexecutorremoved","title":"onExecutorRemoved <pre><code>onExecutorRemoved(\n  executorRemoved: SparkListenerExecutorRemoved): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerExecutorRemoved event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onexecutorunblacklisted","title":"onExecutorUnblacklisted <pre><code>onExecutorUnblacklisted(\n  executorUnblacklisted: SparkListenerExecutorUnblacklisted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerExecutorUnblacklisted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onjobend","title":"onJobEnd <pre><code>onJobEnd(\n  jobEnd: SparkListenerJobEnd): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerJobEnd event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onjobstart","title":"onJobStart <pre><code>onJobStart(\n  jobStart: SparkListenerJobStart): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerJobStart event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onnodeblacklisted","title":"onNodeBlacklisted <pre><code>onNodeBlacklisted(\n  nodeBlacklisted: SparkListenerNodeBlacklisted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerNodeBlacklisted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onnodeblacklistedforstage","title":"onNodeBlacklistedForStage <pre><code>onNodeBlacklistedForStage(\n  nodeBlacklistedForStage: SparkListenerNodeBlacklistedForStage): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerNodeBlacklistedForStage event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onnodeunblacklisted","title":"onNodeUnblacklisted <pre><code>onNodeUnblacklisted(\n  nodeUnblacklisted: SparkListenerNodeUnblacklisted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerNodeUnblacklisted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onotherevent","title":"onOtherEvent <pre><code>onOtherEvent(\n  event: SparkListenerEvent): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a custom SparkListenerEvent</li> </ul>","text":""},{"location":"SparkListenerInterface/#onspeculativetasksubmitted","title":"onSpeculativeTaskSubmitted <pre><code>onSpeculativeTaskSubmitted(\n  speculativeTask: SparkListenerSpeculativeTaskSubmitted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerSpeculativeTaskSubmitted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onstagecompleted","title":"onStageCompleted <pre><code>onStageCompleted(\n  stageCompleted: SparkListenerStageCompleted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerStageCompleted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onstageexecutormetrics","title":"onStageExecutorMetrics <pre><code>onStageExecutorMetrics(\n  executorMetrics: SparkListenerStageExecutorMetrics): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerStageExecutorMetrics event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onstagesubmitted","title":"onStageSubmitted <pre><code>onStageSubmitted(\n  stageSubmitted: SparkListenerStageSubmitted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerStageSubmitted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#ontaskend","title":"onTaskEnd <pre><code>onTaskEnd(\n  taskEnd: SparkListenerTaskEnd): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerTaskEnd event</li> </ul>","text":""},{"location":"SparkListenerInterface/#ontaskgettingresult","title":"onTaskGettingResult <pre><code>onTaskGettingResult(\n  taskGettingResult: SparkListenerTaskGettingResult): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerTaskGettingResult event</li> </ul>","text":""},{"location":"SparkListenerInterface/#ontaskstart","title":"onTaskStart <pre><code>onTaskStart(\n  taskStart: SparkListenerTaskStart): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerTaskStart event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onunpersistrdd","title":"onUnpersistRDD <pre><code>onUnpersistRDD(\n  unpersistRDD: SparkListenerUnpersistRDD): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerUnpersistRDD event</li> </ul>","text":""},{"location":"SparkListenerInterface/#implementations","title":"Implementations <ul> <li>EventFilterBuilder</li> <li>SparkFirehoseListener</li> <li>SparkListener</li> </ul>","text":""},{"location":"SparkListenerTaskEnd/","title":"SparkListenerTaskEnd","text":"<p><code>SparkListenerTaskEnd</code> is a SparkListenerEvent.</p> <p><code>SparkListenerTaskEnd</code> is posted (and created) when:</p> <ul> <li><code>DAGScheduler</code> is requested to postTaskEnd</li> </ul> <p><code>SparkListenerTaskEnd</code> is intercepted using SparkListenerInterface.onTaskEnd</p>"},{"location":"SparkListenerTaskEnd/#creating-instance","title":"Creating Instance","text":"<p><code>SparkListenerTaskEnd</code> takes the following to be created:</p> <ul> <li> Stage ID <li> Stage Attempt ID <li> Task Type <li> <code>TaskEndReason</code> <li> TaskInfo <li> <code>ExecutorMetrics</code> <li> TaskMetrics"},{"location":"SparkStatusTracker/","title":"SparkStatusTracker","text":"<p><code>SparkStatusTracker</code> is created for SparkContext for Spark developers to access the AppStatusStore and the following:</p> <ul> <li>All active job IDs</li> <li>All active stage IDs</li> <li>All known job IDs (and possibly limited to a particular job group)</li> <li><code>SparkExecutorInfo</code>s of all known executors</li> <li><code>SparkJobInfo</code> of a job ID</li> <li><code>SparkStageInfo</code> of a stage ID</li> </ul>"},{"location":"SparkStatusTracker/#creating-instance","title":"Creating Instance","text":"<p><code>SparkStatusTracker</code> takes the following to be created:</p> <ul> <li> SparkContext (unused) <li> AppStatusStore <p><code>SparkStatusTracker</code> is created\u00a0when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>"},{"location":"SpillListener/","title":"SpillListener","text":"<p><code>SpillListener</code> is a SparkListener that intercepts (listens to) the following events for detecting spills in jobs:</p> <ul> <li>onTaskEnd</li> <li>onStageCompleted</li> </ul> <p><code>SpillListener</code> is used for testing only.</p>"},{"location":"SpillListener/#creating-instance","title":"Creating Instance","text":"<p><code>SpillListener</code> takes no input arguments to be created.</p> <p><code>SpillListener</code> is created when <code>TestUtils</code> is requested to <code>assertSpilled</code> and <code>assertNotSpilled</code>.</p>"},{"location":"SpillListener/#ontaskend-callback","title":"onTaskEnd Callback <pre><code>onTaskEnd(\n  taskEnd: SparkListenerTaskEnd): Unit\n</code></pre> <p><code>onTaskEnd</code>...FIXME</p> <p><code>onTaskEnd</code> is part of the SparkListener abstraction.</p>","text":""},{"location":"SpillListener/#onstagecompleted-callback","title":"onStageCompleted Callback <pre><code>onStageCompleted(\n  stageComplete: SparkListenerStageCompleted): Unit\n</code></pre> <p><code>onStageCompleted</code>...FIXME</p> <p><code>onStageCompleted</code> is part of the SparkListener abstraction.</p>","text":""},{"location":"StatsReportListener/","title":"StatsReportListener \u2014 Logging Summary Statistics","text":"<p><code>org.apache.spark.scheduler.StatsReportListener</code> (see https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.StatsReportListener[the listener's scaladoc]) is a SparkListener.md[] that logs summary statistics when each stage completes.</p> <p><code>StatsReportListener</code> listens to SparkListenerTaskEnd and SparkListenerStageCompleted events and prints them out at <code>INFO</code> logging level.</p>","tags":["DeveloperApi"]},{"location":"StatsReportListener/#tip","title":"[TIP]","text":"<p>Enable <code>INFO</code> logging level for <code>org.apache.spark.scheduler.StatsReportListener</code> logger to see Spark events.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.StatsReportListener=INFO\n</code></pre>","tags":["DeveloperApi"]},{"location":"StatsReportListener/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":"<p>=== [[onStageCompleted]] Intercepting Stage Completed Events -- <code>onStageCompleted</code> Callback</p> <p>CAUTION: FIXME</p> <p>=== [[example]] Example</p> <pre><code>$ ./bin/spark-shell -c spark.extraListeners=org.apache.spark.scheduler.StatsReportListener\n...\nINFO SparkContext: Registered listener org.apache.spark.scheduler.StatsReportListener\n...\n\nscala&gt; spark.read.text(\"README.md\").count\n...\nINFO StatsReportListener: Finished stage: Stage(0, 0); Name: 'count at &lt;console&gt;:24'; Status: succeeded; numTasks: 1; Took: 212 msec\nINFO StatsReportListener: task runtime:(count: 1, mean: 198.000000, stdev: 0.000000, max: 198.000000, min: 198.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms\nINFO StatsReportListener: shuffle bytes written:(count: 1, mean: 59.000000, stdev: 0.000000, max: 59.000000, min: 59.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B\nINFO StatsReportListener: fetch wait time:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms\nINFO StatsReportListener: remote bytes read:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B\nINFO StatsReportListener: task result size:(count: 1, mean: 1885.000000, stdev: 0.000000, max: 1885.000000, min: 1885.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B\nINFO StatsReportListener: executor (non-fetch) time pct: (count: 1, mean: 73.737374, stdev: 0.000000, max: 73.737374, min: 73.737374)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   74 %    74 %    74 %    74 %    74 %    74 %    74 %    74 %    74 %\nINFO StatsReportListener: fetch wait time pct: (count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:    0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %\nINFO StatsReportListener: other time pct: (count: 1, mean: 26.262626, stdev: 0.000000, max: 26.262626, min: 26.262626)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   26 %    26 %    26 %    26 %    26 %    26 %    26 %    26 %    26 %\nINFO StatsReportListener: Finished stage: Stage(1, 0); Name: 'count at &lt;console&gt;:24'; Status: succeeded; numTasks: 1; Took: 34 msec\nINFO StatsReportListener: task runtime:(count: 1, mean: 33.000000, stdev: 0.000000, max: 33.000000, min: 33.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms\nINFO StatsReportListener: shuffle bytes written:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B\nINFO StatsReportListener: fetch wait time:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms\nINFO StatsReportListener: remote bytes read:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B\nINFO StatsReportListener: task result size:(count: 1, mean: 1960.000000, stdev: 0.000000, max: 1960.000000, min: 1960.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B\nINFO StatsReportListener: executor (non-fetch) time pct: (count: 1, mean: 75.757576, stdev: 0.000000, max: 75.757576, min: 75.757576)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   76 %    76 %    76 %    76 %    76 %    76 %    76 %    76 %    76 %\nINFO StatsReportListener: fetch wait time pct: (count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:    0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %\nINFO StatsReportListener: other time pct: (count: 1, mean: 24.242424, stdev: 0.000000, max: 24.242424, min: 24.242424)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   24 %    24 %    24 %    24 %    24 %    24 %    24 %    24 %    24 %\nres0: Long = 99\n</code></pre>","tags":["DeveloperApi"]},{"location":"TaskCompletionListener/","title":"TaskCompletionListener","text":"<p><code>TaskCompletionListener</code>\u00a0is an extension of the <code>EventListener</code> (Java) abstraction for task listeners that can be notified on task completion.</p>","tags":["DeveloperApi"]},{"location":"TaskCompletionListener/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"TaskCompletionListener/#ontaskcompletion","title":"onTaskCompletion <pre><code>onTaskCompletion(\n  context: TaskContext): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskContextImpl</code> is requested to addTaskCompletionListener (and a task has already completed) and markTaskCompleted</li> <li><code>ShuffleFetchCompletionListener</code> is requested to onComplete</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"TaskFailureListener/","title":"TaskFailureListener","text":"<p><code>TaskFailureListener</code>\u00a0is an extension of the <code>EventListener</code> (Java) abstraction for task listeners that can be notified on task failure.</p>","tags":["DeveloperApi"]},{"location":"TaskFailureListener/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"TaskFailureListener/#ontaskfailure","title":"onTaskFailure <pre><code>onTaskFailure(\n  context: TaskContext,\n  error: Throwable): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskContextImpl</code> is requested to addTaskFailureListener (and a task has already failed) and markTaskFailed</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"Utils/","title":"Utils Utility","text":""},{"location":"Utils/#getdynamicallocationinitialexecutors","title":"getDynamicAllocationInitialExecutors <pre><code>getDynamicAllocationInitialExecutors(\n  conf: SparkConf): Int\n</code></pre> <p><code>getDynamicAllocationInitialExecutors</code> gives the maximum value of the following configuration properties (for the initial number of executors):</p> <ul> <li>spark.dynamicAllocation.initialExecutors</li> <li>spark.dynamicAllocation.minExecutors</li> <li>spark.executor.instances</li> </ul> <p><code>getDynamicAllocationInitialExecutors</code> prints out the following INFO message to the logs:</p> <pre><code>Using initial executors = [initialExecutors],\nmax of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances\n</code></pre>  <p>With spark.dynamicAllocation.initialExecutors less than spark.dynamicAllocation.minExecutors, <code>getDynamicAllocationInitialExecutors</code> prints out the following WARN message to the logs:</p> <pre><code>spark.dynamicAllocation.initialExecutors less than spark.dynamicAllocation.minExecutors is invalid,\nignoring its setting, please update your configs.\n</code></pre> <p>With spark.executor.instances less than spark.dynamicAllocation.minExecutors, <code>getDynamicAllocationInitialExecutors</code> prints out the following WARN message to the logs:</p> <pre><code>spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid,\nignoring its setting, please update your configs.\n</code></pre> <p><code>getDynamicAllocationInitialExecutors</code> is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is created</li> <li><code>SchedulerBackendUtils</code> utility is used to getInitialTargetExecutorNumber</li> </ul>","text":""},{"location":"Utils/#local-directories-for-scratch-space","title":"Local Directories for Scratch Space <pre><code>getConfiguredLocalDirs(\n  conf: SparkConf): Array[String]\n</code></pre> <p><code>getConfiguredLocalDirs</code> returns the local directories where Spark can write files to.</p>  <p><code>getConfiguredLocalDirs</code> uses the given SparkConf to know if External Shuffle Service is enabled or not (based on spark.shuffle.service.enabled configuration property).</p> <p>When in a YARN container (<code>CONTAINER_ID</code>), <code>getConfiguredLocalDirs</code> uses <code>LOCAL_DIRS</code> environment variable for YARN-approved local directories.</p> <p>In non-YARN mode (or for the driver in yarn-client mode), <code>getConfiguredLocalDirs</code> checks the following environment variables (in order) and returns the value of the first found:</p> <ol> <li><code>SPARK_EXECUTOR_DIRS</code></li> <li><code>SPARK_LOCAL_DIRS</code></li> <li><code>MESOS_DIRECTORY</code> (only when External Shuffle Service is not used)</li> </ol> <p>The environment variables are a comma-separated list of local directory paths.</p> <p>In the end, when no earlier environment variables were found, <code>getConfiguredLocalDirs</code> uses spark.local.dir configuration property (with <code>java.io.tmpdir</code> System property as the default value).</p>  <p><code>getConfiguredLocalDirs</code> is used when:</p> <ul> <li><code>DiskBlockManager</code> is requested to createLocalDirs and createLocalDirsForMergedShuffleBlocks</li> <li><code>Utils</code> utility is used to get a single random local root directory and create a spark directory in every local root directory</li> </ul>","text":""},{"location":"Utils/#random-local-directory-path","title":"Random Local Directory Path <pre><code>getLocalDir(\n  conf: SparkConf): String\n</code></pre> <p><code>getLocalDir</code> takes a random directory path out of the configured local root directories</p> <p><code>getLocalDir</code> throws an <code>IOException</code> if no local directory is defined:</p> <pre><code>Failed to get a temp directory under [[configuredLocalDirs]].\n</code></pre> <p><code>getLocalDir</code> is used when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a base SparkEnv for the driver</li> <li><code>Utils</code> utility is used to fetchFile</li> <li><code>DriverLogger</code> is created</li> <li><code>RocksDBStateStoreProvider</code> (Spark Structured Streaming) is requested for a <code>RocksDB</code></li> <li><code>PythonBroadcast</code> (PySpark) is requested to <code>readObject</code></li> <li><code>AggregateInPandasExec</code> (PySpark) is requested to <code>doExecute</code></li> <li><code>EvalPythonExec</code> (PySpark) is requested to <code>doExecute</code></li> <li><code>WindowInPandasExec</code> (PySpark) is requested to <code>doExecute</code></li> <li><code>PythonForeachWriter</code> (PySpark) is requested for a <code>UnsafeRowBuffer</code></li> <li><code>Client</code> (Spark on YARN) is requested to <code>prepareLocalResources</code> and <code>createConfArchive</code></li> </ul>","text":""},{"location":"Utils/#localrootdirs-registry","title":"localRootDirs Registry <p><code>Utils</code> utility uses <code>localRootDirs</code> internal registry so getOrCreateLocalRootDirsImpl is executed just once (when first requested).</p> <p><code>localRootDirs</code> is available using <code>getOrCreateLocalRootDirs</code> method.</p> <pre><code>getOrCreateLocalRootDirs(\n  conf: SparkConf): Array[String]\n</code></pre> <p><code>getOrCreateLocalRootDirs</code> is used when:</p> <ul> <li><code>Utils</code> is used to getLocalDir</li> <li><code>Worker</code> (Spark Standalone) is requested to launch an executor</li> </ul>","text":""},{"location":"Utils/#creating-spark-directory-in-every-local-root-directory","title":"Creating spark Directory in Every Local Root Directory <pre><code>getOrCreateLocalRootDirsImpl(\n  conf: SparkConf): Array[String]\n</code></pre> <p><code>getOrCreateLocalRootDirsImpl</code> creates a <code>spark-[randomUUID]</code> directory under every root directory for local storage (and registers a shutdown hook to delete the directories at shutdown).</p> <p><code>getOrCreateLocalRootDirsImpl</code> prints out the following WARN message to the logs when there is a local root directories as a URI (with a scheme):</p> <pre><code>The configured local directories are not expected to be URIs;\nhowever, got suspicious values [[uris]].\nPlease check your configured local directories.\n</code></pre>","text":""},{"location":"Utils/#local-uri-scheme","title":"Local URI Scheme <p><code>Utils</code> defines a <code>local</code> URI scheme for files that are locally available on worker nodes in the cluster.</p> <p>The <code>local</code> URL scheme is used when:</p> <ul> <li><code>Utils</code> is used to isLocalUri</li> <li><code>Client</code> (Spark on YARN) is used</li> </ul>","text":""},{"location":"Utils/#islocaluri","title":"isLocalUri <pre><code>isLocalUri(\n  uri: String): Boolean\n</code></pre> <p><code>isLocalUri</code> is <code>true</code> when the URI is a <code>local:</code> URI (the given <code>uri</code> starts with local: scheme).</p> <p><code>isLocalUri</code> is used when:</p> <ul> <li>FIXME</li> </ul>","text":""},{"location":"Utils/#getcurrentusername","title":"getCurrentUserName <pre><code>getCurrentUserName(): String\n</code></pre> <p><code>getCurrentUserName</code> computes the user name who has started the SparkContext.md[SparkContext] instance.</p> <p>NOTE: It is later available as SparkContext.md#sparkUser[SparkContext.sparkUser].</p> <p>Internally, it reads SparkContext.md#SPARK_USER[SPARK_USER] environment variable and, if not set, reverts to Hadoop Security API's <code>UserGroupInformation.getCurrentUser().getShortUserName()</code>.</p> <p>NOTE: It is another place where Spark relies on Hadoop API for its operation.</p>","text":""},{"location":"Utils/#localhostname","title":"localHostName <pre><code>localHostName(): String\n</code></pre> <p><code>localHostName</code> computes the local host name.</p> <p>It starts by checking <code>SPARK_LOCAL_HOSTNAME</code> environment variable for the value. If it is not defined, it uses <code>SPARK_LOCAL_IP</code> to find the name (using <code>InetAddress.getByName</code>). If it is not defined either, it calls <code>InetAddress.getLocalHost</code> for the name.</p> <p>NOTE: <code>Utils.localHostName</code> is executed while SparkContext.md#creating-instance[<code>SparkContext</code> is created] and also to compute the default value of spark-driver.md#spark_driver_host[spark.driver.host Spark property].</p>","text":""},{"location":"Utils/#getuserjars","title":"getUserJars <pre><code>getUserJars(\n  conf: SparkConf): Seq[String]\n</code></pre> <p><code>getUserJars</code> is the spark.jars configuration property with non-empty entries.</p> <p><code>getUserJars</code> is used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"Utils/#extracthostportfromsparkurl","title":"extractHostPortFromSparkUrl <pre><code>extractHostPortFromSparkUrl(\n  sparkUrl: String): (String, Int)\n</code></pre> <p><code>extractHostPortFromSparkUrl</code> creates a Java URI with the input <code>sparkUrl</code> and takes the host and port parts.</p> <p><code>extractHostPortFromSparkUrl</code> asserts that the input <code>sparkURL</code> uses spark scheme.</p> <p><code>extractHostPortFromSparkUrl</code> throws a <code>SparkException</code> for unparseable spark URLs:</p> <pre><code>Invalid master URL: [sparkUrl]\n</code></pre> <p><code>extractHostPortFromSparkUrl</code> is used when:</p> <ul> <li><code>StandaloneSubmitRequestServlet</code> is requested to <code>buildDriverDescription</code></li> <li><code>RpcAddress</code> is requested to extract an RpcAddress from a Spark master URL</li> </ul>","text":""},{"location":"Utils/#isDynamicAllocationEnabled","title":"isDynamicAllocationEnabled <pre><code>isDynamicAllocationEnabled(\n  conf: SparkConf): Boolean\n</code></pre> <p><code>isDynamicAllocationEnabled</code> checks whether Dynamic Allocation of Executors is enabled (<code>true</code>) or not (<code>false</code>).</p>  <p><code>isDynamicAllocationEnabled</code> is positive (<code>true</code>) when all the following hold:</p> <ol> <li>spark.dynamicAllocation.enabled configuration property is <code>true</code></li> <li>spark.master is non-<code>local</code></li> </ol>  <p><code>isDynamicAllocationEnabled</code> is used when:</p> <ul> <li><code>SparkContext</code> is created (to start an ExecutorAllocationManager)</li> <li><code>TaskResourceProfile</code> is requested for custom executor resources</li> <li><code>ResourceProfileManager</code> is created</li> <li><code>DAGScheduler</code> is requested to checkBarrierStageWithDynamicAllocation</li> <li><code>TaskSchedulerImpl</code> is requested to resourceOffers</li> <li><code>SchedulerBackendUtils</code> is requested to getInitialTargetExecutorNumber</li> <li><code>StandaloneSchedulerBackend</code> (Spark Standalone) is requested to <code>start</code> (for reporting purposes)</li> <li><code>ExecutorPodsAllocator</code> (Spark on Kubernetes) is created (<code>maxPVCs</code>)</li> <li><code>ApplicationMaster</code> (Spark on YARN) is created (<code>maxNumExecutorFailures</code>)</li> <li><code>YarnSchedulerBackend</code> (Spark on YARN) is requested to <code>getShufflePushMergerLocations</code></li> </ul>","text":""},{"location":"Utils/#checkandgetk8smasterurl","title":"checkAndGetK8sMasterUrl <pre><code>checkAndGetK8sMasterUrl(\n  rawMasterURL: String): String\n</code></pre> <p><code>checkAndGetK8sMasterUrl</code>...FIXME</p> <p><code>checkAndGetK8sMasterUrl</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment (for Kubernetes cluster manager)</li> </ul>","text":""},{"location":"Utils/#fetching-file","title":"Fetching File <pre><code>fetchFile(\n  url: String,\n  targetDir: File,\n  conf: SparkConf,\n  securityMgr: SecurityManager,\n  hadoopConf: Configuration,\n  timestamp: Long,\n  useCache: Boolean): File\n</code></pre> <p><code>fetchFile</code>...FIXME</p> <p><code>fetchFile</code> is used when:</p> <ul> <li> <p><code>SparkContext</code> is requested to SparkContext.md#addFile[addFile]</p> </li> <li> <p><code>Executor</code> is requested to executor:Executor.md#updateDependencies[updateDependencies]</p> </li> <li> <p>Spark Standalone's <code>DriverRunner</code> is requested to <code>downloadUserJar</code></p> </li> </ul>","text":""},{"location":"Utils/#ispushbasedshuffleenabled","title":"isPushBasedShuffleEnabled <pre><code>isPushBasedShuffleEnabled(\n  conf: SparkConf,\n  isDriver: Boolean,\n  checkSerializer: Boolean = true): Boolean\n</code></pre> <p><code>isPushBasedShuffleEnabled</code> takes the value of spark.shuffle.push.enabled configuration property (from the given SparkConf).</p> <p>If <code>false</code>, <code>isPushBasedShuffleEnabled</code> does nothing and returns <code>false</code> as well.</p> <p>Otherwise, <code>isPushBasedShuffleEnabled</code> returns whether it is even possible to use push-based shuffle or not based on the following:</p> <ol> <li>External Shuffle Service is used (based on spark.shuffle.service.enabled that should be <code>true</code>)</li> <li>spark.master is <code>yarn</code></li> <li>(only with <code>checkSerializer</code> enabled) spark.serializer is a Serializer that supportsRelocationOfSerializedObjects</li> <li>spark.io.encryption.enabled is <code>false</code></li> </ol> <p>In case spark.shuffle.push.enabled configuration property is enabled but the above requirements did not hold, <code>isPushBasedShuffleEnabled</code> prints out the following WARN message to the logs:</p> <pre><code>Push-based shuffle can only be enabled\nwhen the application is submitted to run in YARN mode,\nwith external shuffle service enabled, IO encryption disabled,\nand relocation of serialized objects supported.\n</code></pre> <p><code>isPushBasedShuffleEnabled</code>\u00a0is used when:</p> <ul> <li><code>ShuffleDependency</code> is requested to canShuffleMergeBeEnabled</li> <li><code>MapOutputTrackerMaster</code> is created</li> <li><code>MapOutputTrackerWorker</code> is created</li> <li><code>DAGScheduler</code> is created</li> <li><code>ShuffleBlockPusher</code> utility is used to create a <code>BLOCK_PUSHER_POOL</code> thread pool</li> <li><code>BlockManager</code> is requested to initialize and registerWithExternalShuffleServer</li> <li><code>BlockManagerMasterEndpoint</code> is created</li> <li><code>DiskBlockManager</code> is requested to createLocalDirsForMergedShuffleBlocks</li> </ul>","text":""},{"location":"Utils/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.util.Utils</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.util.Utils=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"architecture/","title":"Architecture","text":"<p>= Spark Architecture</p> <p>Spark uses a master/worker architecture. There is a spark-driver.md[driver] that talks to a single coordinator called spark-master.md[master] that manages spark-workers.md[workers] in which executor:Executor.md[executors] run.</p> <p>.Spark architecture image::driver-sparkcontext-clustermanager-workers-executors.png[align=\"center\"]</p> <p>The driver and the executors run in their own Java processes. You can run them all on the same (horizontal cluster) or separate machines (vertical cluster) or in a mixed machine configuration.</p> <p>.Spark architecture in detail image::sparkapp-sparkcontext-master-slaves.png[align=\"center\"]</p> <p>Physical machines are called hosts or nodes.</p>"},{"location":"configuration-properties/","title":"Configuration Properties","text":""},{"location":"configuration-properties/#sparkappid","title":"spark.app.id <p>Unique identifier of a Spark application that Spark uses to uniquely identify metric sources.</p> <p>Default: TaskScheduler.applicationId()</p> <p>Set when SparkContext is created</p>","text":""},{"location":"configuration-properties/#sparkbroadcastblocksize","title":"spark.broadcast.blockSize <p>The size of each piece of a block  (in kB unless the unit is specified)</p> <p>Default: <code>4m</code></p> <p>Too large a value decreases parallelism during broadcast (makes it slower); however, if it is too small, BlockManager might take a performance hit</p> <p>Used when:</p> <ul> <li><code>TorrentBroadcast</code> is requested to setConf</li> </ul>","text":""},{"location":"configuration-properties/#sparkbroadcastcompress","title":"spark.broadcast.compress <p>Controls broadcast variable compression (before sending them over the wire)</p> <p>Default: <code>true</code></p> <p>Generally a good idea. Compression will use spark.io.compression.codec</p> <p>Used when:</p> <ul> <li><code>TorrentBroadcast</code> is requested to setConf</li> <li><code>SerializerManager</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#spark.buffer.pageSize","title":"spark.buffer.pageSize <p>spark.buffer.pageSize</p> <p>The amount of memory used per page (in bytes)</p> <p>Default: (undefined)</p> <p>Used when:</p> <ul> <li><code>MemoryManager</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparkcleanerreferencetracking","title":"spark.cleaner.referenceTracking <p>Controls whether to enable ContextCleaner</p> <p>Default: <code>true</code></p>","text":""},{"location":"configuration-properties/#sparkdiskstoresubdirectories","title":"spark.diskStore.subDirectories <p>Number of subdirectories inside each path listed in spark.local.dir for hashing block files into.</p> <p>Default: <code>64</code></p> <p>Used by BlockManager and DiskBlockManager</p>","text":""},{"location":"configuration-properties/#sparkdriverhost","title":"spark.driver.host <p>Address of the driver (endpoints)</p> <p>Default: Utils.localCanonicalHostName</p>","text":""},{"location":"configuration-properties/#sparkdriverlogallowerasurecoding","title":"spark.driver.log.allowErasureCoding <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>DfsAsyncWriter</code> is requested to <code>init</code></li> </ul>","text":""},{"location":"configuration-properties/#sparkdriverlogdfsdir","title":"spark.driver.log.dfsDir <p>The directory on a Hadoop DFS-compliant file system where DriverLogger copies driver logs to</p> <p>Default: (undefined)</p> <p>Used when:</p> <ul> <li><code>FsHistoryProvider</code> is requested to startPolling (and cleanDriverLogs)</li> <li><code>DfsAsyncWriter</code> is requested to <code>init</code></li> <li><code>DriverLogger</code> utility is used to create a DriverLogger (for a SparkContext)</li> </ul>","text":""},{"location":"configuration-properties/#sparkdriverlogpersisttodfsenabled","title":"spark.driver.log.persistToDfs.enabled <p>Enables DriverLogger</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>DriverLogger</code> utility is used to create a DriverLogger (for a SparkContext)</li> </ul>","text":""},{"location":"configuration-properties/#sparkdrivermaxresultsize","title":"spark.driver.maxResultSize <p>Maximum size of task results (in bytes)</p> <p>Default: <code>1g</code></p> <p>Used when:</p> <ul> <li> <p><code>TaskRunner</code> is requested to run a task (and decide on the type of a serialized task result)</p> </li> <li> <p><code>TaskSetManager</code> is requested to check available memory for task results</p> </li> </ul>","text":""},{"location":"configuration-properties/#sparkdriverport","title":"spark.driver.port <p>Port of the driver (endpoints)</p> <p>Default: <code>0</code></p>","text":""},{"location":"configuration-properties/#sparkexecutorcores","title":"spark.executor.cores <p>Number of CPU cores for Executor</p> <p>Default: <code>1</code></p>","text":""},{"location":"configuration-properties/#sparkexecutorheartbeatmaxfailures","title":"spark.executor.heartbeat.maxFailures <p>Number of times an Executor tries sending heartbeats to the driver before it gives up and exits (with exit code <code>56</code>).</p> <p>Default: <code>60</code></p> <p>For example, with max failures <code>60</code> (the default) and spark.executor.heartbeatInterval <code>10s</code>, then <code>Executor</code> will try to send heartbeats for up to <code>600s</code> (10 minutes).</p> <p>Used when:</p> <ul> <li><code>Executor</code> is created (and reportHeartBeat)</li> </ul>","text":""},{"location":"configuration-properties/#sparkexecutorheartbeatinterval","title":"spark.executor.heartbeatInterval <p>Interval between Executor heartbeats (to the driver)</p> <p>Default: <code>10s</code></p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is created</li> <li><code>Executor</code> is created and requested to reportHeartBeat</li> <li><code>HeartbeatReceiver</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparkexecutorid","title":"spark.executor.id <p>Default: (undefined)</p>","text":""},{"location":"configuration-properties/#sparkexecutorinstances","title":"spark.executor.instances <p>Number of executors to use</p> <p>Default: (undefined)</p>","text":""},{"location":"configuration-properties/#sparkexecutormemory","title":"spark.executor.memory <p>Amount of memory to use for an Executor</p> <p>Default: <code>1g</code></p> <p>Equivalent to SPARK_EXECUTOR_MEMORY environment variable.</p>","text":""},{"location":"configuration-properties/#sparkexecutormemoryoverhead","title":"spark.executor.memoryOverhead <p>The amount of non-heap memory (in MiB) to be allocated per executor</p> <p>Used when:</p> <ul> <li><code>ResourceProfile</code> is requested for the default executor resources</li> <li><code>Client</code> (Spark on YARN) is created</li> </ul>","text":""},{"location":"configuration-properties/#sparkexecutormetricsfilesystemschemes","title":"spark.executor.metrics.fileSystemSchemes <p>A comma-separated list of the file system schemes to report in executor metrics</p> <p>Default: <code>file,hdfs</code></p>","text":""},{"location":"configuration-properties/#sparkexecutormetricspollinginterval","title":"spark.executor.metrics.pollingInterval <p>How often to collect executor metrics (in ms):</p> <ul> <li><code>0</code> - the polling is done on executor heartbeats</li> <li>A positive number - the polling is done at this interval</li> </ul> <p>Default: <code>0</code></p> <p>Used when:</p> <ul> <li><code>Executor</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparkexecutoruserclasspathfirst","title":"spark.executor.userClassPathFirst <p>Controls whether to load classes in user-defined jars before those in Spark jars</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>CoarseGrainedExecutorBackend</code> is requested to create a ClassLoader</li> <li><code>Executor</code> is created</li> <li><code>Client</code> utility (Spark on YARN) is used to <code>isUserClassPathFirst</code></li> </ul>","text":""},{"location":"configuration-properties/#sparkextralisteners","title":"spark.extraListeners <p>A comma-separated list of fully-qualified class names of SparkListeners (to be registered when SparkContext is created)</p> <p>Default: (empty)</p>","text":""},{"location":"configuration-properties/#sparkfiletransferto","title":"spark.file.transferTo <p>Controls whether to use Java FileChannels (Java NIO) for copying data between two Java <code>FileInputStream</code>s to improve copy performance</p> <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li>BypassMergeSortShuffleWriter and UnsafeShuffleWriter are created</li> </ul>","text":""},{"location":"configuration-properties/#sparkfiles","title":"spark.files <p>The files to be added to a Spark application (that can be defined directly as a configuration property or indirectly using <code>--files</code> option of <code>spark-submit</code> script)</p> <p>Default: (empty)</p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparkioencryptionenabled","title":"spark.io.encryption.enabled <p>Controls local disk I/O encryption</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv for the driver (to create a IO encryption key)</li> <li><code>BlockStoreShuffleReader</code> is requested to read combined records (and fetchContinuousBlocksInBatch)</li> </ul>","text":""},{"location":"configuration-properties/#sparkjars","title":"spark.jars <p>Default: (empty)</p>","text":""},{"location":"configuration-properties/#sparkkryopool","title":"spark.kryo.pool <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li><code>KryoSerializer</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparkkryounsafe","title":"spark.kryo.unsafe <p>Whether KryoSerializer should use Unsafe-based IO for serialization</p> <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparklocaldir","title":"spark.local.dir <p>A comma-separated list of directory paths for \"scratch\" space (a temporary storage for map output files, RDDs that get stored on disk, etc.). It is recommended to use paths on fast local disks in your system (e.g. SSDs).</p> <p>Default: <code>java.io.tmpdir</code> System property</p>","text":""},{"location":"configuration-properties/#sparklocalitywait","title":"spark.locality.wait <p>How long to wait until an executor is available for locality-aware delay scheduling (for <code>PROCESS_LOCAL</code>, <code>NODE_LOCAL</code>, and <code>RACK_LOCAL</code> TaskLocalities) unless locality-specific setting is set (i.e., spark.locality.wait.process, spark.locality.wait.node, and spark.locality.wait.rack, respectively)</p> <p>Default: <code>3s</code></p>","text":""},{"location":"configuration-properties/#sparklocalitywaitlegacyresetontasklaunch","title":"spark.locality.wait.legacyResetOnTaskLaunch <p>(internal) Whether to use the legacy behavior of locality wait, which resets the delay timer anytime a task is scheduled.</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is created</li> <li><code>TaskSetManager</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparklocalitywaitnode","title":"spark.locality.wait.node <p>Scheduling delay for TaskLocality.NODE_LOCAL</p> <p>Default: spark.locality.wait</p> <p>Used when:</p> <ul> <li><code>TaskSetManager</code> is requested for the locality wait (of <code>TaskLocality.NODE_LOCAL</code>)</li> </ul>","text":""},{"location":"configuration-properties/#sparklocalitywaitprocess","title":"spark.locality.wait.process <p>Scheduling delay for TaskLocality.PROCESS_LOCAL</p> <p>Default: spark.locality.wait</p> <p>Used when:</p> <ul> <li><code>TaskSetManager</code> is requested for the locality wait (of <code>TaskLocality.PROCESS_LOCAL</code>)</li> </ul>","text":""},{"location":"configuration-properties/#sparklocalitywaitrack","title":"spark.locality.wait.rack <p>Scheduling delay for TaskLocality.RACK_LOCAL</p> <p>Default: spark.locality.wait</p> <p>Used when:</p> <ul> <li><code>TaskSetManager</code> is requested for the locality wait (of <code>TaskLocality.RACK_LOCAL</code>)</li> </ul>","text":""},{"location":"configuration-properties/#sparklogconf","title":"spark.logConf <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparkloglineage","title":"spark.logLineage <p>Enables printing out the RDD lineage graph (using RDD.toDebugString) when executing an action (and running a job)</p> <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparkmaster","title":"spark.master <p>Master URL of the cluster manager to connect the Spark application to</p>","text":""},{"location":"configuration-properties/#sparkmemoryfraction","title":"spark.memory.fraction <p>Fraction of JVM heap space used for execution and storage.</p> <p>Default: <code>0.6</code></p> <p>The lower the more frequent spills and cached data eviction. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. Leaving this at the default value is recommended.</p>","text":""},{"location":"configuration-properties/#sparkmemoryoffheapenabled","title":"spark.memory.offHeap.enabled <p>Controls whether Tungsten memory will be allocated on the JVM heap (<code>false</code>) or off-heap (<code>true</code> / using <code>sun.misc.Unsafe</code>).</p> <p>Default: <code>false</code></p> <p>When enabled, spark.memory.offHeap.size must be greater than 0.</p> <p>Used when:</p> <ul> <li><code>MemoryManager</code> is requested for tungstenMemoryMode</li> </ul>","text":""},{"location":"configuration-properties/#sparkmemoryoffheapsize","title":"spark.memory.offHeap.size <p>Maximum memory (in bytes) for off-heap memory allocation</p> <p>Default: <code>0</code></p> <p>This setting has no impact on heap memory usage, so if your executors' total memory consumption must fit within some hard limit then be sure to shrink your JVM heap size accordingly.</p> <p>Must not be negative and be set to a positive value when spark.memory.offHeap.enabled is enabled</p>","text":""},{"location":"configuration-properties/#sparkmemorystoragefraction","title":"spark.memory.storageFraction <p>Amount of storage memory immune to eviction, expressed as a fraction of the size of the region set aside by spark.memory.fraction.</p> <p>Default: <code>0.5</code></p> <p>The higher the less working memory may be available to execution and tasks may spill to disk more often. The default value is recommended.</p> <p>Must be in <code>[0,1)</code></p> <p>Used when:</p> <ul> <li><code>UnifiedMemoryManager</code> is created</li> <li><code>MemoryManager</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparknetworkiopreferdirectbufs","title":"spark.network.io.preferDirectBufs <p>Default: <code>true</code></p>","text":""},{"location":"configuration-properties/#sparknetworkmaxremoteblocksizefetchtomem","title":"spark.network.maxRemoteBlockSizeFetchToMem <p>Remote block will be fetched to disk when size of the block is above this threshold in bytes</p> <p>This is to avoid a giant request takes too much memory. Note this configuration will affect both shuffle fetch and block manager remote block fetch.</p> <p>With an external shuffle service use at least 2.3.0</p> <p>Default: <code>200m</code></p> <p>Used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> <li><code>NettyBlockTransferService</code> is requested to uploadBlock</li> <li><code>BlockManager</code> is requested to fetchRemoteManagedBuffer</li> </ul>","text":""},{"location":"configuration-properties/#sparknetworksharedbytebufallocatorsenabled","title":"spark.network.sharedByteBufAllocators.enabled <p>Default: <code>true</code></p>","text":""},{"location":"configuration-properties/#sparknetworktimeout","title":"spark.network.timeout <p>Network timeout (in seconds) to use for RPC remote endpoint lookup</p> <p>Default: <code>120s</code></p>","text":""},{"location":"configuration-properties/#sparknetworktimeoutinterval","title":"spark.network.timeoutInterval <p>(in millis)</p> <p>Default: spark.storage.blockManagerTimeoutIntervalMs</p>","text":""},{"location":"configuration-properties/#sparkrddcompress","title":"spark.rdd.compress <p>Controls whether to compress RDD partitions when stored serialized</p> <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparkreducermaxblocksinflightperaddress","title":"spark.reducer.maxBlocksInFlightPerAddress <p>Maximum number of remote blocks being fetched per reduce task from a given host port</p> <p>When a large number of blocks are being requested from a given address in a single fetch or simultaneously, this could crash the serving executor or a Node Manager. This is especially useful to reduce the load on the Node Manager when external shuffle is enabled. You can mitigate the issue by setting it to a lower value.</p> <p>Default: (unlimited)</p> <p>Used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> </ul>","text":""},{"location":"configuration-properties/#sparkreducermaxreqsinflight","title":"spark.reducer.maxReqsInFlight <p>Maximum number of remote requests to fetch blocks at any given point</p> <p>When the number of hosts in the cluster increase, it might lead to very large number of inbound connections to one or more nodes, causing the workers to fail under load. By allowing it to limit the number of fetch requests, this scenario can be mitigated</p> <p>Default: (unlimited)</p> <p>Used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> </ul>","text":""},{"location":"configuration-properties/#sparkreducermaxsizeinflight","title":"spark.reducer.maxSizeInFlight <p>Maximum size of all map outputs to fetch simultaneously from each reduce task (in MiB unless otherwise specified)</p> <p>Since each output requires us to create a buffer to receive it, this represents a fixed memory overhead per reduce task, so keep it small unless you have a large amount of memory</p> <p>Default: <code>48m</code></p> <p>Used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> </ul>","text":""},{"location":"configuration-properties/#sparkreplclassuri","title":"spark.repl.class.uri <p>Controls whether to compress RDD partitions when stored serialized</p> <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparkrpclookuptimeout","title":"spark.rpc.lookupTimeout <p>Default Endpoint Lookup Timeout</p> <p>Default: <code>120s</code></p>","text":""},{"location":"configuration-properties/#sparkrpcmessagemaxsize","title":"spark.rpc.message.maxSize <p>Maximum allowed message size for RPC communication (in <code>MB</code> unless specified)</p> <p>Default: <code>128</code></p> <p>Must be below 2047MB (<code>Int.MaxValue / 1024 / 1024</code>)</p> <p>Used when:</p> <ul> <li><code>CoarseGrainedSchedulerBackend</code> is requested to launch tasks</li> <li><code>RpcUtils</code> is requested for the maximum message size<ul> <li><code>Executor</code> is created</li> <li><code>MapOutputTrackerMaster</code> is created (and makes sure that spark.shuffle.mapOutput.minSizeForBroadcast is below the threshold)</li> </ul> </li> </ul>","text":""},{"location":"configuration-properties/#sparkscheduler","title":"spark.scheduler","text":""},{"location":"configuration-properties/#spark.scheduler.barrier.maxConcurrentTasksCheck.interval","title":"barrier.maxConcurrentTasksCheck.interval","text":"<p>spark.scheduler.barrier.maxConcurrentTasksCheck.interval</p>"},{"location":"configuration-properties/#spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures","title":"barrier.maxConcurrentTasksCheck.maxFailures","text":"<p>spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures</p>"},{"location":"configuration-properties/#spark.scheduler.minRegisteredResourcesRatio","title":"minRegisteredResourcesRatio","text":"<p>spark.scheduler.minRegisteredResourcesRatio</p> <p>Minimum ratio of (registered resources / total expected resources) before submitting tasks</p> <p>Default: (undefined)</p>"},{"location":"configuration-properties/#spark.scheduler.revive.interval","title":"spark.scheduler.revive.interval <p>spark.scheduler.revive.interval</p> <p>The time (in millis) between resource offers revives</p> <p>Default: <code>1s</code></p> <p>Used when:</p> <ul> <li><code>DriverEndpoint</code> is requested to onStart</li> </ul>","text":""},{"location":"configuration-properties/#sparkserializer","title":"spark.serializer <p>The fully-qualified class name of the Serializer (of the driver and executors)</p> <p>Default: <code>org.apache.spark.serializer.JavaSerializer</code></p> <p>Used when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv</li> <li><code>SparkConf</code> is requested to registerKryoClasses (as a side-effect)</li> </ul>","text":""},{"location":"configuration-properties/#sparkshuffle","title":"spark.shuffle","text":""},{"location":"configuration-properties/#spark.shuffle.sort.io.plugin.class","title":"sort.io.plugin.class <p>spark.shuffle.sort.io.plugin.class</p> <p>Name of the class to use for shuffle IO</p> <p>Default: LocalDiskShuffleDataIO</p> <p>Used when:</p> <ul> <li><code>ShuffleDataIOUtils</code> is requested to loadShuffleDataIO</li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.checksum.enabled","title":"checksum.enabled <p>spark.shuffle.checksum.enabled</p> <p>Controls checksuming of shuffle data. If enabled, Spark will calculate the checksum values for each partition data within the map output file and store the values in a checksum file on the disk. When there's shuffle data corruption detected, Spark will try to diagnose the cause (e.g., network issue, disk issue, etc.) of the corruption by using the checksum file.</p> <p>Default: <code>true</code></p>","text":""},{"location":"configuration-properties/#spark.shuffle.compress","title":"compress <p>spark.shuffle.compress</p> <p>Enables compressing shuffle output when stored</p> <p>Default: <code>true</code></p>","text":""},{"location":"configuration-properties/#spark.shuffle.detectCorrupt","title":"detectCorrupt <p>spark.shuffle.detectCorrupt</p> <p>Controls corruption detection in fetched blocks</p> <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.detectCorrupt.useExtraMemory","title":"detectCorrupt.useExtraMemory <p>spark.shuffle.detectCorrupt.useExtraMemory</p> <p>If enabled, part of a compressed/encrypted stream will be de-compressed/de-crypted by using extra memory to detect early corruption. Any <code>IOException</code> thrown will cause the task to be retried once and if it fails again with same exception, then <code>FetchFailedException</code> will be thrown to retry previous stage</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.file.buffer","title":"file.buffer <p>spark.shuffle.file.buffer</p> <p>Size of the in-memory buffer for each shuffle file output stream, in KiB unless otherwise specified. These buffers reduce the number of disk seeks and system calls made in creating intermediate shuffle files.</p> <p>Default: <code>32k</code></p> <p>Must be greater than <code>0</code> and less than or equal to <code>2097151</code> (<code>(Integer.MAX_VALUE - 15) / 1024</code>)</p> <p>Used when the following are created:</p> <ul> <li>BypassMergeSortShuffleWriter</li> <li>ShuffleExternalSorter</li> <li>UnsafeShuffleWriter</li> <li>ExternalAppendOnlyMap</li> <li>ExternalSorter</li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.manager","title":"manager <p>spark.shuffle.manager</p> <p>A fully-qualified class name or the alias of the ShuffleManager in a Spark application</p> <p>Default: <code>sort</code></p> <p>Supported aliases:</p> <ul> <li><code>sort</code></li> <li><code>tungsten-sort</code></li> </ul> <p>Used when <code>SparkEnv</code> object is requested to create a \"base\" SparkEnv for a driver or an executor</p>","text":""},{"location":"configuration-properties/#spark.shuffle.mapOutput.parallelAggregationThreshold","title":"mapOutput.parallelAggregationThreshold <p>spark.shuffle.mapOutput.parallelAggregationThreshold</p> <p>(internal) Multi-thread is used when the number of mappers * shuffle partitions is greater than or equal to this threshold. Note that the actual parallelism is calculated by number of mappers * shuffle partitions / this threshold + 1, so this threshold should be positive.</p> <p>Default: <code>10000000</code></p> <p>Used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested for the statistics of a ShuffleDependency</li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.minNumPartitionsToHighlyCompress","title":"minNumPartitionsToHighlyCompress <p>spark.shuffle.minNumPartitionsToHighlyCompress</p> <p>(internal) Minimum number of partitions (threshold) for <code>MapStatus</code> utility to prefer a HighlyCompressedMapStatus (over CompressedMapStatus) (for ShuffleWriters).</p> <p>Default: <code>2000</code></p> <p>Must be a positive integer (above <code>0</code>)</p>","text":""},{"location":"configuration-properties/#spark.shuffle.push.enabled","title":"push.enabled <p>spark.shuffle.push.enabled</p> <p>Enables push-based shuffle on the client side</p> <p>Default: <code>false</code></p> <p>Works in conjunction with the server side flag <code>spark.shuffle.push.server.mergedShuffleFileManagerImpl</code> which needs to be set with the appropriate <code>org.apache.spark.network.shuffle.MergedShuffleFileManager</code> implementation for push-based shuffle to be enabled</p> <p>Used when:</p> <ul> <li><code>Utils</code> utility is used to determine whether push-based shuffle is enabled or not</li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.readHostLocalDisk","title":"readHostLocalDisk <p>spark.shuffle.readHostLocalDisk</p> <p>If enabled (with spark.shuffle.useOldFetchProtocol disabled and spark.shuffle.service.enabled enabled), shuffle blocks requested from those block managers which are running on the same host are read from the disk directly instead of being fetched as remote blocks over the network.</p> <p>Default: <code>true</code></p>","text":""},{"location":"configuration-properties/#spark.shuffle.registration.maxAttempts","title":"registration.maxAttempts <p>spark.shuffle.registration.maxAttempts</p> <p>How many attempts to register a BlockManager with External Shuffle Service</p> <p>Default: <code>3</code></p> <p>Used when <code>BlockManager</code> is requested to register with External Shuffle Server</p>","text":""},{"location":"configuration-properties/#spark.shuffle.sort.bypassMergeThreshold","title":"sort.bypassMergeThreshold <p>spark.shuffle.sort.bypassMergeThreshold</p> <p>Maximum number of reduce partitions below which SortShuffleManager avoids merge-sorting data for no map-side aggregation</p> <p>Default: <code>200</code></p> <p>Used when:</p> <ul> <li><code>SortShuffleWriter</code> utility is used to shouldBypassMergeSort</li> <li><code>ShuffleExchangeExec</code> (Spark SQL) physical operator is requested to <code>prepareShuffleDependency</code></li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.spill.initialMemoryThreshold","title":"spill.initialMemoryThreshold <p>spark.shuffle.spill.initialMemoryThreshold</p> <p>Initial threshold for the size of an in-memory collection</p> <p>Default: 5MB</p> <p>Used by Spillable</p>","text":""},{"location":"configuration-properties/#spark.shuffle.spill.numElementsForceSpillThreshold","title":"spill.numElementsForceSpillThreshold <p>spark.shuffle.spill.numElementsForceSpillThreshold</p> <p>(internal) The maximum number of elements in memory before forcing the shuffle sorter to spill.</p> <p>Default: <code>Integer.MAX_VALUE</code></p> <p>The default value is to never force the sorter to spill, until Spark reaches some limitations, like the max page size limitation for the pointer array in the sorter.</p> <p>Used when:</p> <ul> <li>ShuffleExternalSorter is created</li> <li>Spillable is created</li> <li>Spark SQL's <code>SortBasedAggregator</code> is requested for an <code>UnsafeKVExternalSorter</code></li> <li>Spark SQL's <code>ObjectAggregationMap</code> is requested to <code>dumpToExternalSorter</code></li> <li>Spark SQL's <code>UnsafeExternalRowSorter</code> is created</li> <li>Spark SQL's <code>UnsafeFixedWidthAggregationMap</code> is requested for an <code>UnsafeKVExternalSorter</code></li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.sync","title":"sync <p>spark.shuffle.sync</p> <p>Controls whether <code>DiskBlockObjectWriter</code> should force outstanding writes to disk while committing a single atomic block (i.e. all operating system buffers should synchronize with the disk to ensure that all changes to a file are in fact recorded in the storage)</p> <p>Default: <code>false</code></p> <p>Used when <code>BlockManager</code> is requested for a DiskBlockObjectWriter</p>","text":""},{"location":"configuration-properties/#spark.shuffle.useOldFetchProtocol","title":"useOldFetchProtocol <p>spark.shuffle.useOldFetchProtocol</p> <p>Whether to use the old protocol while doing the shuffle block fetching. It is only enabled while we need the compatibility in the scenario of new Spark version job fetching shuffle blocks from old version external shuffle service.</p> <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparkspeculation","title":"spark.speculation <p>Controls Speculative Execution of Tasks</p> <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparkspeculationinterval","title":"spark.speculation.interval <p>The time interval to use before checking for speculative tasks in Speculative Execution of Tasks.</p> <p>Default: <code>100ms</code></p>","text":""},{"location":"configuration-properties/#sparkspeculationmultiplier","title":"spark.speculation.multiplier <p>Default: <code>1.5</code></p>","text":""},{"location":"configuration-properties/#sparkspeculationquantile","title":"spark.speculation.quantile <p>The percentage of tasks that has not finished yet at which to start speculation in Speculative Execution of Tasks.</p> <p>Default: <code>0.75</code></p>","text":""},{"location":"configuration-properties/#sparkstorageblockmanagerslavetimeoutms","title":"spark.storage.blockManagerSlaveTimeoutMs <p>(in millis)</p> <p>Default: spark.network.timeout</p>","text":""},{"location":"configuration-properties/#sparkstorageblockmanagertimeoutintervalms","title":"spark.storage.blockManagerTimeoutIntervalMs <p>(in millis)</p> <p>Default: <code>60s</code></p>","text":""},{"location":"configuration-properties/#sparkstoragelocaldiskbyexecutorscachesize","title":"spark.storage.localDiskByExecutors.cacheSize <p>The max number of executors for which the local dirs are stored. This size is both applied for the driver and both for the executors side to avoid having an unbounded store. This cache will be used to avoid the network in case of fetching disk persisted RDD blocks or shuffle blocks (when spark.shuffle.readHostLocalDisk is set) from the same host.</p> <p>Default: <code>1000</code></p>","text":""},{"location":"configuration-properties/#sparkstoragereplicationpolicy","title":"spark.storage.replication.policy <p>Default: RandomBlockReplicationPolicy</p>","text":""},{"location":"configuration-properties/#sparkstorageunrollmemorythreshold","title":"spark.storage.unrollMemoryThreshold <p>Initial memory threshold (in bytes) to unroll (materialize) a block to store in memory</p> <p>Default: <code>1024 * 1024</code></p> <p>Must be at most the total amount of memory available for storage</p> <p>Used when:</p> <ul> <li><code>MemoryStore</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparksubmitdeploymode","title":"spark.submit.deployMode <ul> <li><code>client</code> (default)</li> <li><code>cluster</code></li> </ul>","text":""},{"location":"configuration-properties/#sparktaskcpus","title":"spark.task.cpus <p>The number of CPU cores to schedule (allocate) to a task</p> <p>Default: <code>1</code></p> <p>Used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is created</li> <li><code>TaskSchedulerImpl</code> is created</li> <li><code>AppStatusListener</code> is requested to handle a SparkListenerEnvironmentUpdate event</li> <li><code>SparkContext</code> utility is used to create a TaskScheduler</li> <li><code>ResourceProfile</code> is requested to getDefaultTaskResources</li> <li><code>LocalityPreferredContainerPlacementStrategy</code> is requested to <code>numExecutorsPending</code></li> </ul>","text":""},{"location":"configuration-properties/#sparktaskmaxdirectresultsize","title":"spark.task.maxDirectResultSize <p>Maximum size of a task result (in bytes) to be sent to the driver as a DirectTaskResult</p> <p>Default: <code>1048576B</code> (<code>1L &lt;&lt; 20</code>)</p> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run a task (and decide on the type of a serialized task result)</li> </ul>","text":""},{"location":"configuration-properties/#sparktaskmaxfailures","title":"spark.task.maxFailures <p>Number of failures of a single task (of a TaskSet) before giving up on the entire <code>TaskSet</code> and then the job</p> <p>Default: <code>4</code></p>","text":""},{"location":"configuration-properties/#sparkplugins","title":"spark.plugins <p>A comma-separated list of class names implementing org.apache.spark.api.plugin.SparkPlugin to load into a Spark application.</p> <p>Default: <code>(empty)</code></p> <p>Since: <code>3.0.0</code></p> <p>Set when SparkContext is created</p>","text":""},{"location":"configuration-properties/#sparkpluginsdefaultlist","title":"spark.plugins.defaultList <p>FIXME</p>","text":""},{"location":"configuration-properties/#sparkuishowconsoleprogress","title":"spark.ui.showConsoleProgress <p>Controls whether to enable ConsoleProgressBar and show the progress bar in the console</p> <p>Default: <code>false</code></p>","text":""},{"location":"developer-api/","title":"Developer API","text":""},{"location":"developer-api/#developerapi","title":"DeveloperApi","text":"<ul> <li>SparkEnv</li> <li>SparkListener</li> <li>StatsReportListener</li> <li>TaskCompletionListener</li> <li>TaskFailureListener</li> <li>ExecutorMetrics</li> <li>ShuffleReadMetrics</li> <li>ShuffleWriteMetrics</li> <li>TaskMetrics</li> <li>SparkPlugin</li> <li>Dependency</li> <li>NarrowDependency</li> <li>ShuffleDependency</li> <li>ShuffledRDD</li> <li>ResourceID</li> <li>SparkListenerResourceProfileAdded</li> <li>StorageLevel</li> </ul>"},{"location":"driver/","title":"Driver","text":"<p>A Spark driver (aka an application's driver process) is a JVM process that hosts SparkContext.md[SparkContext] for a Spark application. It is the master node in a Spark application.</p> <p>It is the cockpit of jobs and tasks execution (using scheduler:DAGScheduler.md[DAGScheduler] and scheduler:TaskScheduler.md[Task Scheduler]). It hosts spark-webui.md[Web UI] for the environment.</p> <p>.Driver with the services image::spark-driver.png[align=\"center\"]</p> <p>It splits a Spark application into tasks and schedules them to run on executors.</p> <p>A driver is where the task scheduler lives and spawns tasks across workers.</p> <p>A driver coordinates workers and overall execution of tasks.</p> <p>NOTE: spark-shell.md[Spark shell] is a Spark application and the driver. It creates a <code>SparkContext</code> that is available as <code>sc</code>.</p> <p>Driver requires the additional services (beside the common ones like shuffle:ShuffleManager.md[], memory:MemoryManager.md[], storage:BlockTransferService.md[], BroadcastManager:</p> <ul> <li>Listener Bus</li> <li>rpc:index.md[]</li> <li>scheduler:MapOutputTrackerMaster.md[] with the name MapOutputTracker</li> <li>storage:BlockManagerMaster.md[] with the name BlockManagerMaster</li> <li>MetricsSystem with the name driver</li> <li>OutputCommitCoordinator</li> </ul> <p>CAUTION: FIXME Diagram of RpcEnv for a driver (and later executors). Perhaps it should be in the notes about RpcEnv?</p> <ul> <li>High-level control flow of work</li> <li>Your Spark application runs as long as the Spark driver. ** Once the driver terminates, so does your Spark application.</li> <li>Creates <code>SparkContext</code>, <code>RDD</code>'s, and executes transformations and actions</li> <li>Launches scheduler:Task.md[tasks]</li> </ul> <p>=== [[driver-memory]] Driver's Memory</p> <p>It can be set first using spark-submit/index.md#command-line-options[spark-submit's <code>--driver-memory</code>] command-line option or &lt;&gt; and falls back to spark-submit/index.md#environment-variables[SPARK_DRIVER_MEMORY] if not set earlier. <p>NOTE: It is printed out to the standard error output in spark-submit/index.md#verbose-mode[spark-submit's verbose mode].</p>"},{"location":"driver/#driver-cores","title":"Driver Cores <p>It can be set first using spark-submit/index.md#driver-cores[spark-submit's <code>--driver-cores</code>] command-line option for <code>cluster</code> deploy mode.</p> <p>NOTE: In <code>client</code> deploy mode the driver's memory corresponds to the memory of the JVM process the Spark application runs on.</p> <p>NOTE: It is printed out to the standard error output in spark-submit/index.md#verbose-mode[spark-submit's verbose mode].</p> <p>=== [[settings]] Settings</p> <p>.Spark Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Spark Property | Default Value | Description | [[spark_driver_blockManager_port]] <code>spark.driver.blockManager.port</code> | storage:BlockManager.md#spark_blockManager_port[spark.blockManager.port] | Port to use for the storage:BlockManager.md[BlockManager] on the driver.</p> <p>More precisely, <code>spark.driver.blockManager.port</code> is used when core:SparkEnv.md#NettyBlockTransferService[<code>NettyBlockTransferService</code> is created] (while <code>SparkEnv</code> is created for the driver).</p> <p>| [[spark_driver_memory]] <code>spark.driver.memory</code> | <code>1g</code> | The driver's memory size (in MiBs).</p> <p>Refer to &lt;&gt;. <p>| [[spark_driver_cores]] <code>spark.driver.cores</code> | <code>1</code> | The number of CPU cores assigned to the driver in <code>cluster</code> deploy mode.</p> <p>NOTE: When yarn/spark-yarn-client.md#creating-instance[Client is created] (for Spark on YARN in cluster mode only), it sets the number of cores for <code>ApplicationManager</code> using <code>spark.driver.cores</code>.</p> <p>Refer to &lt;&gt;. <p>| [[spark_driver_extraLibraryPath]] <code>spark.driver.extraLibraryPath</code> | |</p> <p>| [[spark_driver_extraJavaOptions]] <code>spark.driver.extraJavaOptions</code> | | Additional JVM options for the driver.</p> <p>| [[spark.driver.appUIAddress]] spark.driver.appUIAddress</p> <p><code>spark.driver.appUIAddress</code> is used exclusively in yarn/README.md[Spark on YARN]. It is set when yarn/spark-yarn-client-yarnclientschedulerbackend.md#start[YarnClientSchedulerBackend starts] to yarn/spark-yarn-applicationmaster.md#runExecutorLauncher[run ExecutorLauncher] (and yarn/spark-yarn-applicationmaster.md#registerAM[register ApplicationMaster] for the Spark application).</p> <p>| [[spark_driver_libraryPath]] <code>spark.driver.libraryPath</code> | |</p> <p>|===</p>","text":""},{"location":"driver/#sparkdriverextraclasspath","title":"spark.driver.extraClassPath <p><code>spark.driver.extraClassPath</code> system property sets the additional classpath entries (e.g. jars and directories) that should be added to the driver's classpath in <code>cluster</code> deploy mode.</p>","text":""},{"location":"driver/#note","title":"[NOTE]","text":"<p>For <code>client</code> deploy mode you can use a properties file or command line to set <code>spark.driver.extraClassPath</code>.</p> <p>Do not use SparkConf.md[SparkConf] since it is too late for <code>client</code> deploy mode given the JVM has already been set up to start a Spark application.</p>"},{"location":"driver/#refer-to-spark-classmdbuildsparksubmitcommandbuildsparksubmitcommand-internal-method-for-the-very-low-level-details-of-how-it-is-handled-internally","title":"Refer to spark-class.md#buildSparkSubmitCommand[<code>buildSparkSubmitCommand</code> Internal Method] for the very low-level details of how it is handled internally.","text":"<p><code>spark.driver.extraClassPath</code> uses a OS-specific path separator.</p> <p>NOTE: Use <code>spark-submit</code>'s spark-submit/index.md#driver-class-path[<code>--driver-class-path</code> command-line option] on command line to override <code>spark.driver.extraClassPath</code> from a spark-properties.md#spark-defaults-conf[Spark properties file].</p>"},{"location":"local-properties/","title":"Local Properties","text":"<p><code>SparkContext.setLocalProperty</code> lets you set key-value pairs that will be propagated down to tasks and can be accessed there using TaskContext.getLocalProperty.</p>"},{"location":"local-properties/#creating-logical-job-groups","title":"Creating Logical Job Groups","text":"<p>One of the purposes of local properties is to create logical groups of Spark jobs by means of properties that (regardless of the threads used to submit the jobs) makes the separate jobs launched from different threads belong to a single logical group.</p> <p>A common use case for the local property concept is to set a local property in a thread, say spark-scheduler-FairSchedulableBuilder.md[spark.scheduler.pool], after which all jobs submitted within the thread will be grouped, say into a pool by FAIR job scheduler.</p> <pre><code>val data = sc.parallelize(0 to 9)\n\nsc.setLocalProperty(\"spark.scheduler.pool\", \"myPool\")\n\n// these two jobs (one per action) will run in the myPool pool\ndata.count\ndata.collect\n\nsc.setLocalProperty(\"spark.scheduler.pool\", null)\n\n// this job will run in the default pool\ndata.count\n</code></pre>"},{"location":"master/","title":"Master","text":"<p>== Master</p> <p>A master is a running Spark instance that connects to a cluster manager for resources.</p> <p>The master acquires cluster nodes to run executors.</p> <p>CAUTION: FIXME Add it to the Spark architecture figure above.</p>"},{"location":"overview/","title":"Spark Core","text":"<p>Apache Spark is an open-source distributed general-purpose cluster computing framework with (mostly) in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL.</p> <p></p> <p>You could also describe Spark as a distributed, data processing engine for batch and streaming modes featuring SQL queries, graph processing, and machine learning.</p> <p>In contrast to Hadoop\u2019s two-stage disk-based MapReduce computation engine, Spark's multi-stage (mostly) in-memory computing engine allows for running most computations in memory, and hence most of the time provides better performance for certain applications, e.g. iterative algorithms or interactive data mining (read Spark officially sets a new record in large-scale sorting).</p> <p>Spark aims at speed, ease of use, extensibility and interactive analytics.</p> <p>Spark is a distributed platform for executing complex multi-stage applications, like machine learning algorithms, and interactive ad hoc queries. Spark provides an efficient abstraction for in-memory cluster computing called Resilient Distributed Dataset.</p> <p>Using Spark Application Frameworks, Spark simplifies access to machine learning and predictive analytics at scale.</p> <p>Spark is mainly written in http://scala-lang.org/[Scala], but provides developer API for languages like Java, Python, and R.</p> <p>If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is a viable alternative.</p> <ul> <li>Access any data type across any data source.</li> <li>Huge demand for storage and data processing.</li> </ul> <p>The Apache Spark project is an umbrella for https://jaceklaskowski.gitbooks.io/mastering-spark-sql/[SQL] (with Datasets), https://jaceklaskowski.gitbooks.io/spark-structured-streaming/[streaming], http://spark.apache.org/mllib/[machine learning] (pipelines) and http://spark.apache.org/graphx/[graph] processing engines built on top of the Spark Core. You can run them all in a single application using a consistent API.</p> <p>Spark runs locally as well as in clusters, on-premises or in cloud. It runs on top of Hadoop YARN, Apache Mesos, standalone or in the cloud (Amazon EC2 or IBM Bluemix).</p> <p>Apache Spark's https://jaceklaskowski.gitbooks.io/spark-structured-streaming/[Structured Streaming] and https://jaceklaskowski.gitbooks.io/mastering-spark-sql/[SQL] programming models with MLlib and GraphX make it easier for developers and data scientists to build applications that exploit machine learning and graph analytics.</p> <p>At a high level, any Spark application creates RDDs out of some input, run rdd:index.md[(lazy) transformations] of these RDDs to some other form (shape), and finally perform rdd:index.md[actions] to collect or store data. Not much, huh?</p> <p>You can look at Spark from programmer's, data engineer's and administrator's point of view. And to be honest, all three types of people will spend quite a lot of their time with Spark to finally reach the point where they exploit all the available features. Programmers use language-specific APIs (and work at the level of RDDs using transformations and actions), data engineers use higher-level abstractions like DataFrames or Pipelines APIs or external tools (that connect to Spark), and finally it all can only be possible to run because administrators set up Spark clusters to deploy Spark applications to.</p> <p>It is Spark's goal to be a general-purpose computing platform with various specialized applications frameworks on top of a single unified engine.</p> <p>NOTE: When you hear \"Apache Spark\" it can be two things -- the Spark engine aka Spark Core or the Apache Spark open source project which is an \"umbrella\" term for Spark Core and the accompanying Spark Application Frameworks, i.e. Spark SQL, spark-streaming/spark-streaming.md[Spark Streaming], spark-mllib/spark-mllib.md[Spark MLlib] and spark-graphx.md[Spark GraphX] that sit on top of Spark Core and the main data abstraction in Spark called rdd:index.md[RDD - Resilient Distributed Dataset].</p>"},{"location":"overview/#why-spark","title":"Why Spark","text":"<p>Let's list a few of the many reasons for Spark. We are doing it first, and then comes the overview that lends a more technical helping hand.</p>"},{"location":"overview/#easy-to-get-started","title":"Easy to Get Started","text":"<p>Spark offers spark-shell that makes for a very easy head start to writing and running Spark applications on the command line on your laptop.</p> <p>You could then use Spark Standalone built-in cluster manager to deploy your Spark applications to a production-grade cluster to run on a full dataset.</p>"},{"location":"overview/#unified-engine-for-diverse-workloads","title":"Unified Engine for Diverse Workloads","text":"<p>As said by Matei Zaharia - the author of Apache Spark - in Introduction to AmpLab Spark Internals video (quoting with few changes):</p> <p>One of the Spark project goals was to deliver a platform that supports a very wide array of diverse workflows - not only MapReduce batch jobs (there were available in Hadoop already at that time), but also iterative computations like graph algorithms or Machine Learning.</p> <p>And also different scales of workloads from sub-second interactive jobs to jobs that run for many hours.</p> <p>Spark combines batch, interactive, and streaming workloads under one rich concise API.</p> <p>Spark supports near real-time streaming workloads via spark-streaming/spark-streaming.md[Spark Streaming] application framework.</p> <p>ETL workloads and Analytics workloads are different, however Spark attempts to offer a unified platform for a wide variety of workloads.</p> <p>Graph and Machine Learning algorithms are iterative by nature and less saves to disk or transfers over network means better performance.</p> <p>There is also support for interactive workloads using Spark shell.</p> <p>You should watch the video https://youtu.be/SxAxAhn-BDU[What is Apache Spark?] by Mike Olson, Chief Strategy Officer and Co-Founder at Cloudera, who provides a very exceptional overview of Apache Spark, its rise in popularity in the open source community, and how Spark is primed to replace MapReduce as the general processing engine in Hadoop.</p> <p>=== Leverages the Best in distributed batch data processing</p> <p>When you think about distributed batch data processing, varia/spark-hadoop.md[Hadoop] naturally comes to mind as a viable solution.</p> <p>Spark draws many ideas out of Hadoop MapReduce. They work together well - Spark on YARN and HDFS - while improving on the performance and simplicity of the distributed computing engine.</p> <p>For many, Spark is Hadoop++, i.e. MapReduce done in a better way.</p> <p>And it should not come as a surprise, without Hadoop MapReduce (its advances and deficiencies), Spark would not have been born at all.</p> <p>=== RDD - Distributed Parallel Scala Collections</p> <p>As a Scala developer, you may find Spark's RDD API very similar (if not identical) to http://www.scala-lang.org/docu/files/collections-api/collections.html[Scala's Collections API].</p> <p>It is also exposed in Java, Python and R (as well as SQL, i.e. SparkSQL, in a sense).</p> <p>So, when you have a need for distributed Collections API in Scala, Spark with RDD API should be a serious contender.</p> <p>=== [[rich-standard-library]] Rich Standard Library</p> <p>Not only can you use <code>map</code> and <code>reduce</code> (as in Hadoop MapReduce jobs) in Spark, but also a vast array of other higher-level operators to ease your Spark queries and application development.</p> <p>It expanded on the available computation styles beyond the only map-and-reduce available in Hadoop MapReduce.</p> <p>=== Unified development and deployment environment for all</p> <p>Regardless of the Spark tools you use - the Spark API for the many programming languages supported - Scala, Java, Python, R, or spark-shell.md[the Spark shell], or the many Spark Application Frameworks leveraging the concept of rdd:index.md[RDD], i.e. Spark SQL, spark-streaming/spark-streaming.md[Spark Streaming], spark-mllib/spark-mllib.md[Spark MLlib] and spark-graphx.md[Spark GraphX], you still use the same development and deployment environment to for large data sets to yield a result, be it a prediction (spark-mllib/spark-mllib.md[Spark MLlib]), a structured data queries (Spark SQL) or just a large distributed batch (Spark Core) or streaming (Spark Streaming) computation.</p> <p>It's also very productive of Spark that teams can exploit the different skills the team members have acquired so far. Data analysts, data scientists, Python programmers, or Java, or Scala, or R, can all use the same Spark platform using tailor-made API. It makes for bringing skilled people with their expertise in different programming languages together to a Spark project.</p> <p>=== Interactive Exploration / Exploratory Analytics</p> <p>It is also called ad hoc queries.</p> <p>Using spark-shell.md[the Spark shell] you can execute computations to process large amount of data (The Big Data). It's all interactive and very useful to explore the data before final production release.</p> <p>Also, using the Spark shell you can access any spark-cluster.md[Spark cluster] as if it was your local machine. Just point the Spark shell to a 20-node of 10TB RAM memory in total (using <code>--master</code>) and use all the components (and their abstractions) like Spark SQL, Spark MLlib, spark-streaming/spark-streaming.md[Spark Streaming], and Spark GraphX.</p> <p>Depending on your needs and skills, you may see a better fit for SQL vs programming APIs or apply machine learning algorithms (Spark MLlib) from data in graph data structures (Spark GraphX).</p> <p>=== Single Environment</p> <p>Regardless of which programming language you are good at, be it Scala, Java, Python, R or SQL, you can use the same single clustered runtime environment for prototyping, ad hoc queries, and deploying your applications leveraging the many ingestion data points offered by the Spark platform.</p> <p>You can be as low-level as using RDD API directly or leverage higher-level APIs of Spark SQL (Datasets), Spark MLlib (ML Pipelines), Spark GraphX (Graphs) or spark-streaming/spark-streaming.md[Spark Streaming] (DStreams).</p> <p>Or use them all in a single application.</p> <p>The single programming model and execution engine for different kinds of workloads simplify development and deployment architectures.</p> <p>=== Data Integration Toolkit with Rich Set of Supported Data Sources</p> <p>Spark can read from many types of data sources -- relational, NoSQL, file systems, etc. -- using many types of data formats - Parquet, Avro, CSV, JSON.</p> <p>Both, input and output data sources, allow programmers and data engineers use Spark as the platform with the large amount of data that is read from or saved to for processing, interactively (using Spark shell) or in applications.</p> <p>=== Tools unavailable then, at your fingertips now</p> <p>As much and often as it's recommended http://c2.com/cgi/wiki?PickTheRightToolForTheJob[to pick the right tool for the job], it's not always feasible. Time, personal preference, operating system you work on are all factors to decide what is right at a time (and using a hammer can be a reasonable choice).</p> <p>Spark embraces many concepts in a single unified development and runtime environment.</p> <ul> <li>Machine learning that is so tool- and feature-rich in Python, e.g. SciKit library, can now be used by Scala developers (as Pipeline API in Spark MLlib or calling <code>pipe()</code>).</li> <li>DataFrames from R are available in Scala, Java, Python, R APIs.</li> <li>Single node computations in machine learning algorithms are migrated to their distributed versions in Spark MLlib.</li> </ul> <p>This single platform gives plenty of opportunities for Python, Scala, Java, and R programmers as well as data engineers (SparkR) and scientists (using proprietary enterprise data warehouses with spark-sql-thrift-server.md[Thrift JDBC/ODBC Server] in Spark SQL).</p> <p>Mind the proverb https://en.wiktionary.org/wiki/if_all_you_have_is_a_hammer,_everything_looks_like_a_nail[if all you have is a hammer, everything looks like a nail], too.</p> <p>=== Low-level Optimizations</p> <p>Apache Spark uses a scheduler:DAGScheduler.md[directed acyclic graph (DAG) of computation stages] (aka execution DAG). It postpones any processing until really required for actions. Spark's lazy evaluation gives plenty of opportunities to induce low-level optimizations (so users have to know less to do more).</p> <p>Mind the proverb https://en.wiktionary.org/wiki/less_is_more[less is more].</p> <p>=== Excels at low-latency iterative workloads</p> <p>Spark supports diverse workloads, but successfully targets low-latency iterative ones. They are often used in Machine Learning and graph algorithms.</p> <p>Many Machine Learning algorithms require plenty of iterations before the result models get optimal, like logistic regression. The same applies to graph algorithms to traverse all the nodes and edges when needed. Such computations can increase their performance when the interim partial results are stored in memory or at very fast solid state drives.</p> <p>Spark can spark-rdd-caching.md[cache intermediate data in memory for faster model building and training]. Once the data is loaded to memory (as an initial step), reusing it multiple times incurs no performance slowdowns.</p> <p>Also, graph algorithms can traverse graphs one connection per iteration with the partial result in memory.</p> <p>Less disk access and network can make a huge difference when you need to process lots of data, esp. when it is a BIG Data.</p> <p>=== ETL done easier</p> <p>Spark gives Extract, Transform and Load (ETL) a new look with the many programming languages supported - Scala, Java, Python (less likely R). You can use them all or pick the best for a problem.</p> <p>Scala in Spark, especially, makes for a much less boiler-plate code (comparing to other languages and approaches like MapReduce in Java).</p> <p>=== [[unified-api]] Unified Concise High-Level API</p> <p>Spark offers a unified, concise, high-level APIs for batch analytics (RDD API), SQL queries (Dataset API), real-time analysis (DStream API), machine learning (ML Pipeline API) and graph processing (Graph API).</p> <p>Developers no longer have to learn many different processing engines and platforms, and let the time be spent on mastering framework APIs per use case (atop a single computation engine Spark).</p> <p>=== Different kinds of data processing using unified API</p> <p>Spark offers three kinds of data processing using batch, interactive, and stream processing with the unified API and data structures.</p> <p>=== Little to no disk use for better performance</p> <p>In the no-so-long-ago times, when the most prevalent distributed computing framework was varia/spark-hadoop.md[Hadoop MapReduce], you could reuse a data between computation (even partial ones!) only after you've written it to an external storage like varia/spark-hadoop.md[Hadoop Distributed Filesystem (HDFS)]. It can cost you a lot of time to compute even very basic multi-stage computations. It simply suffers from IO (and perhaps network) overhead.</p> <p>One of the many motivations to build Spark was to have a framework that is good at data reuse.</p> <p>Spark cuts it out in a way to keep as much data as possible in memory and keep it there until a job is finished. It doesn't matter how many stages belong to a job. What does matter is the available memory and how effective you are in using Spark API (so rdd:index.md[no shuffle occur]).</p> <p>The less network and disk IO, the better performance, and Spark tries hard to find ways to minimize both.</p> <p>=== Fault Tolerance included</p> <p>Faults are not considered a special case in Spark, but obvious consequence of being a parallel and distributed system. Spark handles and recovers from faults by default without particularly complex logic to deal with them.</p> <p>=== Small Codebase Invites Contributors</p> <p>Spark's design is fairly simple and the code that comes out of it is not huge comparing to the features it offers.</p> <p>The reasonably small codebase of Spark invites project contributors - programmers who extend the platform and fix bugs in a more steady pace.</p> <p>== [[i-want-more]] Further reading or watching</p> <ul> <li>(video) https://youtu.be/L029ZNBG7bk[Keynote: Spark 2.0 - Matei Zaharia, Apache Spark Creator and CTO of Databricks]</li> </ul>"},{"location":"push-based-shuffle/","title":"Push-Based Shuffle","text":"<p>Push-Based Shuffle is a new feature of Apache Spark 3.2.0 (cf. SPARK-30602) to improve shuffle efficiency.</p> <p>Push-based shuffle is enabled using spark.shuffle.push.enabled configuration property and can only be used in a Spark application submitted to YARN cluster manager, with external shuffle service enabled, IO encryption disabled, and relocation of serialized objects supported.</p>"},{"location":"spark-debugging/","title":"Debugging Spark","text":""},{"location":"spark-debugging/#using-spark-shell-and-intellij-idea","title":"Using spark-shell and IntelliJ IDEA","text":"<p>Start <code>spark-shell</code> with <code>SPARK_SUBMIT_OPTS</code> environment variable that configures the JVM's JDWP.</p> <pre><code>SPARK_SUBMIT_OPTS=\"-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005\" ./bin/spark-shell\n</code></pre> <p>Attach IntelliJ IDEA to the JVM process using Run &gt; Attach to Local Process menu.</p>"},{"location":"spark-debugging/#using-sbt","title":"Using sbt","text":"<p>Use <code>sbt -jvm-debug 5005</code>, connect to the remote JVM at the port <code>5005</code> using IntelliJ IDEA, place breakpoints on the desired lines of the source code of Spark.</p> <pre><code>$ sbt -jvm-debug 5005\nListening for transport dt_socket at address: 5005\n...\n</code></pre> <p>Run Spark context and the breakpoints get triggered.</p> <pre><code>scala&gt; val sc = new SparkContext(conf)\n15/11/14 22:58:46 INFO SparkContext: Running Spark version 1.6.0-SNAPSHOT\n</code></pre> <p>Tip</p> <p>Read Debugging chapter in IntelliJ IDEA's Help.</p>"},{"location":"spark-logging/","title":"Spark Logging","text":"<p>Apache Spark uses Apache Log4j 2 for logging.</p>"},{"location":"spark-logging/#conflog4j2properties","title":"conf/log4j2.properties","text":"<p>The default logging for Spark applications is in <code>conf/log4j2.properties</code>.</p> <p>Use <code>conf/log4j2.properties.template</code> as a starting point.</p>"},{"location":"spark-logging/#logging-levels","title":"Logging Levels <p>The valid logging levels are log4j's Levels (from most specific to least):</p>    Name Description     <code>OFF</code> No events will be logged   <code>FATAL</code> A fatal event that will prevent the application from continuing   <code>ERROR</code> An error in the application, possibly recoverable   <code>WARN</code> An event that might possible lead to an error   <code>INFO</code> An event for informational purposes   <code>DEBUG</code> A general debugging event   <code>TRACE</code> A fine-grained debug message, typically capturing the flow through the application   <code>ALL</code> All events should be logged    <p>The names of the logging levels are case-insensitive.</p>","text":""},{"location":"spark-logging/#turn-logging-off","title":"Turn Logging Off <p>The following sample <code>conf/log4j2.properties</code> turns all logging of Apache Spark (and Apache Hadoop) off.</p> <pre><code># Set to debug or trace if log4j initialization fails\nstatus = warn\n\n# Name of the configuration\nname = exploring-internals\n\n# Console appender configuration\nappender.console.type = Console\nappender.console.name = consoleLogger\nappender.console.layout.type = PatternLayout\nappender.console.layout.pattern = %d{YYYY-MM-dd HH:mm:ss} [%t] %-5p %c:%L - %m%n\nappender.console.target = SYSTEM_OUT\n\nrootLogger.level = off\nrootLogger.appenderRef.stdout.ref = consoleLogger\n\nlogger.spark.name = org.apache.spark\nlogger.spark.level = off\n\nlogger.hadoop.name = org.apache.hadoop\nlogger.hadoop.level = off\n</code></pre>","text":""},{"location":"spark-logging/#setting-default-log-level-programatically","title":"Setting Default Log Level Programatically <p>Setting Default Log Level Programatically</p>","text":""},{"location":"spark-logging/#setting-log-levels-in-spark-applications","title":"Setting Log Levels in Spark Applications <p>In standalone Spark applications or while in Spark Shell session, use the following:</p> <pre><code>import org.apache.log4j.{Level, Logger}\n\nLogger.getLogger(classOf[RackResolver]).getLevel\nLogger.getLogger(\"org\").setLevel(Level.OFF)\n</code></pre>","text":""},{"location":"spark-properties/","title":"Spark Properties and spark-defaults.conf Properties File","text":"<p>Spark properties are the means of tuning the execution environment of a Spark application.</p> <p>The default Spark properties file is &lt;$SPARK_HOME/conf/spark-defaults.conf&gt;&gt; that could be overriden using <code>spark-submit</code> with the spark-submit/index.md#properties-file[--properties-file] command-line option. <p>.Environment Variables [options=\"header\",width=\"100%\"] |=== | Environment Variable | Default Value | Description | <code>SPARK_CONF_DIR</code> | <code>$\\{SPARK_HOME}/conf</code> | Spark's configuration directory (with <code>spark-defaults.conf</code>) |===</p> <p>TIP: Read the official documentation of Apache Spark on http://spark.apache.org/docs/latest/configuration.html[Spark Configuration].</p> <p>=== [[spark-defaults-conf]] <code>spark-defaults.conf</code> -- Default Spark Properties File</p> <p><code>spark-defaults.conf</code> (under <code>SPARK_CONF_DIR</code> or <code>$SPARK_HOME/conf</code>) is the default properties file with the Spark properties of your Spark applications.</p> <p>NOTE: <code>spark-defaults.conf</code> is loaded by spark-AbstractCommandBuilder.md#loadPropertiesFile[AbstractCommandBuilder's <code>loadPropertiesFile</code> internal method].</p> <p>=== [[getDefaultPropertiesFile]] Calculating Path of Default Spark Properties -- <code>Utils.getDefaultPropertiesFile</code> method</p>"},{"location":"spark-properties/#source-scala","title":"[source, scala]","text":""},{"location":"spark-properties/#getdefaultpropertiesfileenv-mapstring-string-sysenv-string","title":"getDefaultPropertiesFile(env: Map[String, String] = sys.env): String","text":"<p><code>getDefaultPropertiesFile</code> calculates the absolute path to <code>spark-defaults.conf</code> properties file that can be either in directory specified by <code>SPARK_CONF_DIR</code> environment variable or <code>$SPARK_HOME/conf</code> directory.</p> <p>NOTE: <code>getDefaultPropertiesFile</code> is part of <code>private[spark]</code> <code>org.apache.spark.util.Utils</code> object.</p>"},{"location":"spark-tips-and-tricks-access-private-members-spark-shell/","title":"Access private members in Scala in Spark shell","text":"<p>== Access private members in Scala in Spark shell</p> <p>If you ever wanted to use <code>private[spark]</code> members in Spark using the Scala programming language, e.g. toy with <code>org.apache.spark.scheduler.DAGScheduler</code> or similar, you will have to use the following trick in Spark shell - use <code>:paste -raw</code> as described in https://issues.scala-lang.org/browse/SI-5299[REPL: support for package definition].</p> <p>Open <code>spark-shell</code> and execute <code>:paste -raw</code> that allows you to enter any valid Scala code, including <code>package</code>.</p> <p>The following snippet shows how to access <code>private[spark]</code> member <code>DAGScheduler.RESUBMIT_TIMEOUT</code>:</p> <pre><code>scala&gt; :paste -raw\n// Entering paste mode (ctrl-D to finish)\n\npackage org.apache.spark\n\nobject spark {\n  def test = {\n    import org.apache.spark.scheduler._\n    println(DAGScheduler.RESUBMIT_TIMEOUT == 200)\n  }\n}\n\nscala&gt; spark.test\ntrue\n\nscala&gt; sc.version\nres0: String = 1.6.0-SNAPSHOT\n</code></pre>"},{"location":"spark-tips-and-tricks-running-spark-windows/","title":"Running Spark Applications on Windows","text":"<p>== Running Spark Applications on Windows</p> <p>Running Spark applications on Windows in general is no different than running it on other operating systems like Linux or macOS.</p> <p>NOTE: A Spark application could be spark-shell.md[spark-shell] or your own custom Spark application.</p> <p>What makes the huge difference between the operating systems is Hadoop that is used internally for file system access in Spark.</p> <p>You may run into few minor issues when you are on Windows due to the way Hadoop works with Windows' POSIX-incompatible NTFS filesystem.</p> <p>NOTE: You do not have to install Apache Hadoop to work with Spark or run Spark applications.</p> <p>TIP: Read the Apache Hadoop project's https://wiki.apache.org/hadoop/WindowsProblems[Problems running Hadoop on Windows].</p> <p>Among the issues is the infamous <code>java.io.IOException</code> when running Spark Shell (below a stacktrace from Spark 2.0.2 on Windows 10 so the line numbers may be different in your case).</p> <pre><code>16/12/26 21:34:11 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path\njava.io.IOException: Could not locate executable null\\bin\\winutils.exe in the Hadoop binaries.\n  at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)\n  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)\n  at org.apache.hadoop.util.Shell.&lt;clinit&gt;(Shell.java:387)\n  at org.apache.hadoop.hive.conf.HiveConf$ConfVars.findHadoopBinary(HiveConf.java:2327)\n  at org.apache.hadoop.hive.conf.HiveConf$ConfVars.&lt;clinit&gt;(HiveConf.java:365)\n  at org.apache.hadoop.hive.conf.HiveConf.&lt;clinit&gt;(HiveConf.java:105)\n  at java.lang.Class.forName0(Native Method)\n  at java.lang.Class.forName(Class.java:348)\n  at org.apache.spark.util.Utils$.classForName(Utils.scala:228)\n  at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:963)\n  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:91)\n</code></pre>"},{"location":"spark-tips-and-tricks-running-spark-windows/#note","title":"[NOTE]","text":"<p>You need to have Administrator rights on your laptop. All the following commands must be executed in a command-line window (<code>cmd</code>) ran as Administrator, i.e. using Run as administrator option while executing <code>cmd</code>.</p>"},{"location":"spark-tips-and-tricks-running-spark-windows/#read-the-official-document-in-microsoft-technet-httpstechnetmicrosoftcomen-uslibrarycc947813vws10aspxstart-a-command-prompt-as-an-administrator","title":"Read the official document in Microsoft TechNet -- ++https://technet.microsoft.com/en-us/library/cc947813(v=ws.10).aspx++[Start a Command Prompt as an Administrator].","text":"<p>Download <code>winutils.exe</code> binary from https://github.com/steveloughran/winutils repository.</p> <p>NOTE: You should select the version of Hadoop the Spark distribution was compiled with, e.g. use <code>hadoop-2.7.1</code> for Spark 2 (https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe[here is the direct link to <code>winutils.exe</code> binary]).</p> <p>Save <code>winutils.exe</code> binary to a directory of your choice, e.g. <code>c:\\hadoop\\bin</code>.</p> <p>Set <code>HADOOP_HOME</code> to reflect the directory with <code>winutils.exe</code> (without <code>bin</code>).</p> <pre><code>set HADOOP_HOME=c:\\hadoop\n</code></pre> <p>Set <code>PATH</code> environment variable to include <code>%HADOOP_HOME%\\bin</code> as follows:</p> <pre><code>set PATH=%HADOOP_HOME%\\bin;%PATH%\n</code></pre> <p>TIP: Define <code>HADOOP_HOME</code> and <code>PATH</code> environment variables in Control Panel so any Windows program would use them.</p> <p>Create <code>C:\\tmp\\hive</code> directory.</p>"},{"location":"spark-tips-and-tricks-running-spark-windows/#note_1","title":"[NOTE]","text":"<p><code>c:\\tmp\\hive</code> directory is the default value of https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.scratchdir[<code>hive.exec.scratchdir</code> configuration property] in Hive 0.14.0 and later and Spark uses a custom build of Hive 1.2.1.</p>"},{"location":"spark-tips-and-tricks-running-spark-windows/#you-can-change-hiveexecscratchdir-configuration-property-to-another-directory-as-described-in-wzxhzdk27-configuration-property-in-this-document","title":"You can change <code>hive.exec.scratchdir</code> configuration property to another directory as described in &lt;hive.exec.scratchdir Configuration Property&gt;&gt; in this document. <p>Execute the following command in <code>cmd</code> that you started using the option Run as administrator.</p> <pre><code>winutils.exe chmod -R 777 C:\\tmp\\hive\n</code></pre> <p>Check the permissions (that is one of the commands that are executed under the covers):</p> <pre><code>winutils.exe ls -F C:\\tmp\\hive\n</code></pre> <p>Open <code>spark-shell</code> and observe the output (perhaps with few WARN messages that you can simply disregard).</p> <p>As a verification step, execute the following line to display the content of a <code>DataFrame</code>:</p>","text":""},{"location":"spark-tips-and-tricks-running-spark-windows/#source-scala","title":"[source, scala]","text":"<p>scala&gt; spark.range(1).withColumn(\"status\", lit(\"All seems fine. Congratulations!\")).show(false) +---+--------------------------------+ |id |status                          | +---+--------------------------------+ |0  |All seems fine. Congratulations!| +---+--------------------------------+</p>"},{"location":"spark-tips-and-tricks-running-spark-windows/#note_2","title":"[NOTE] <p>Disregard WARN messages when you start <code>spark-shell</code>. They are harmless.</p>","text":""},{"location":"spark-tips-and-tricks-running-spark-windows/#161226-220541-warn-general-plugin-bundle-orgdatanucleus-is-already-registered-ensure-you-dont-have-multiple-jar-versions-of-the-same-plugin-in-the-classpath-the-url-filecspark-202-bin-hadoop27jarsdatanucleus-core-3210jar-is-already-registered-and-you-are-trying-to-register-an-identical-plugin-located-at-url-filecspark-202-bin-hadoop27binjarsdatanucleus-core-3210jar-161226-220541-warn-general-plugin-bundle-orgdatanucleusapijdo-is-already-registered-ensure-you-dont-have-multiple-jar-versions-of-the-same-plugin-in-the-classpath-the-url-filecspark-202-bin-hadoop27jarsdatanucleus-api-jdo-326jar-is-already-registered-and-you-are-trying-to-register-an-identical-plugin-located-at-url-filecspark-202-bin-hadoop27binjarsdatanucleus-api-jdo-326jar-161226-220541-warn-general-plugin-bundle-orgdatanucleusstorerdbms-is-already-registered-ensure-you-dont-have-multiple-jar-versions-of-the-same-plugin-in-the-classpath-the-url-filecspark-202-bin-hadoop27binjarsdatanucleus-rdbms-329jar-is-already-registered-and-you-are-trying-to-register-an-identical-plugin-located-at-url-filecspark-202-bin-hadoop27jarsdatanucleus-rdbms-329jar","title":"<pre><code>16/12/26 22:05:41 WARN General: Plugin (Bundle) \"org.datanucleus\" is already registered. Ensure you dont have multiple JAR versions of\nthe same plugin in the classpath. The URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar\" is already registered,\nand you are trying to register an identical plugin located at URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-core-\n3.2.10.jar.\"\n16/12/26 22:05:41 WARN General: Plugin (Bundle) \"org.datanucleus.api.jdo\" is already registered. Ensure you dont have multiple JAR\nversions of the same plugin in the classpath. The URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar\" is already\nregistered, and you are trying to register an identical plugin located at URL \"file:/C:/spark-2.0.2-bin-\nhadoop2.7/bin/../jars/datanucleus-api-jdo-3.2.6.jar.\"\n16/12/26 22:05:41 WARN General: Plugin (Bundle) \"org.datanucleus.store.rdbms\" is already registered. Ensure you dont have multiple JAR\nversions of the same plugin in the classpath. The URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-rdbms-3.2.9.jar\" is\nalready registered, and you are trying to register an identical plugin located at URL \"file:/C:/spark-2.0.2-bin-\nhadoop2.7/jars/datanucleus-rdbms-3.2.9.jar.\"\n</code></pre> <p>If you see the above output, you're done. You should now be able to run Spark applications on your Windows. Congrats!</p> <p>=== [[changing-hive.exec.scratchdir]] Changing <code>hive.exec.scratchdir</code> Configuration Property</p> <p>Create a <code>hive-site.xml</code> file with the following content:</p> <pre><code>&lt;configuration&gt;\n  &lt;property&gt;\n    &lt;name&gt;hive.exec.scratchdir&lt;/name&gt;\n    &lt;value&gt;/tmp/mydir&lt;/value&gt;\n    &lt;description&gt;Scratch space for Hive jobs&lt;/description&gt;\n  &lt;/property&gt;\n&lt;/configuration&gt;\n</code></pre> <p>Start a Spark application, e.g. <code>spark-shell</code>, with <code>HADOOP_CONF_DIR</code> environment variable set to the directory with <code>hive-site.xml</code>.</p> <pre><code>HADOOP_CONF_DIR=conf ./bin/spark-shell\n</code></pre>","text":""},{"location":"spark-tips-and-tricks-sparkexception-task-not-serializable/","title":"Task not serializable Exception","text":"<p>== org.apache.spark.SparkException: Task not serializable</p> <p>When you run into <code>org.apache.spark.SparkException: Task not serializable</code> exception, it means that you use a reference to an instance of a non-serializable class inside a transformation. See the following example:</p> <pre><code>\u279c  spark git:(master) \u2717 ./bin/spark-shell\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 1.6.0-SNAPSHOT\n      /_/\n\nUsing Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)\nType in expressions to have them evaluated.\nType :help for more information.\n\nscala&gt; class NotSerializable(val num: Int)\ndefined class NotSerializable\n\nscala&gt; val notSerializable = new NotSerializable(10)\nnotSerializable: NotSerializable = NotSerializable@2700f556\n\nscala&gt; sc.parallelize(0 to 10).map(_ =&gt; notSerializable.num).count\norg.apache.spark.SparkException: Task not serializable\n  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)\n  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)\n  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)\n  at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)\n  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)\n  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)\n  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)\n  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)\n  at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)\n  at org.apache.spark.rdd.RDD.map(RDD.scala:317)\n  ... 48 elided\nCaused by: java.io.NotSerializableException: NotSerializable\nSerialization stack:\n    - object not serializable (class: NotSerializable, value: NotSerializable@2700f556)\n    - field (class: $iw, name: notSerializable, type: class NotSerializable)\n    - object (class $iw, $iw@10e542f3)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@729feae8)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@5fc3b20b)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@36dab184)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@5eb974)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@79c514e4)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@5aeaee3)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@2be9425f)\n    - field (class: $line18.$read, name: $iw, type: class $iw)\n    - object (class $line18.$read, $line18.$read@6311640d)\n    - field (class: $iw, name: $line18$read, type: class $line18.$read)\n    - object (class $iw, $iw@c9cd06e)\n    - field (class: $iw, name: $outer, type: class $iw)\n    - object (class $iw, $iw@6565691a)\n    - field (class: $anonfun$1, name: $outer, type: class $iw)\n    - object (class $anonfun$1, &lt;function1&gt;)\n  at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)\n  at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)\n  at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)\n  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)\n  ... 57 more\n</code></pre> <p>=== Further reading</p> <ul> <li>https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html[Job aborted due to stage failure: Task not serializable]</li> <li>https://issues.apache.org/jira/browse/SPARK-5307[Add utility to help with NotSerializableException debugging]</li> <li>http://stackoverflow.com/q/22592811/1305344[Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects]</li> </ul>"},{"location":"spark-tips-and-tricks/","title":"Spark Tips and Tricks","text":"<p>= Spark Tips and Tricks</p> <p>== [[SPARK_PRINT_LAUNCH_COMMAND]] Print Launch Command of Spark Scripts</p> <p><code>SPARK_PRINT_LAUNCH_COMMAND</code> environment variable controls whether the Spark launch command is printed out to the standard error output, i.e. <code>System.err</code>, or not.</p> <pre><code>Spark Command: [here comes the command]\n========================================\n</code></pre> <p>All the Spark shell scripts use <code>org.apache.spark.launcher.Main</code> class internally that checks <code>SPARK_PRINT_LAUNCH_COMMAND</code> and when set (to any value) will print out the entire command line to launch it.</p> <pre><code>$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell\nSpark Command: /Library/Java/JavaVirtualMachines/Current/Contents/Home/bin/java -cp /Users/jacek/dev/oss/spark/conf/:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar -Dscala.usejavacp=true -Xms1g -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://localhost:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell\n========================================\n</code></pre> <p>== Show Spark version in Spark shell</p> <p>In spark-shell, use <code>sc.version</code> or <code>org.apache.spark.SPARK_VERSION</code> to know the Spark version:</p> <pre><code>scala&gt; sc.version\nres0: String = 1.6.0-SNAPSHOT\n\nscala&gt; org.apache.spark.SPARK_VERSION\nres1: String = 1.6.0-SNAPSHOT\n</code></pre> <p>== Resolving local host name</p> <p>When you face networking issues when Spark can't resolve your local hostname or IP address, use the preferred <code>SPARK_LOCAL_HOSTNAME</code> environment variable as the custom host name or <code>SPARK_LOCAL_IP</code> as the custom IP that is going to be later resolved to a hostname.</p> <p>Spark checks them out before using http://docs.oracle.com/javase/8/docs/api/java/net/InetAddress.html#getLocalHost--[java.net.InetAddress.getLocalHost()] (consult https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L759[org.apache.spark.util.Utils.findLocalInetAddress()] method).</p> <p>You may see the following WARN messages in the logs when Spark finished the resolving process:</p> <pre><code>Your hostname, [hostname] resolves to a loopback address: [host-address]; using...\nSet SPARK_LOCAL_IP if you need to bind to another address\n</code></pre>"},{"location":"spark-tips-and-tricks/#starting-standalone-master-and-workers-on-windows-7","title":"Starting standalone Master and workers on Windows 7","text":"<p>Windows 7 users can use spark-class to start Spark Standalone as there are no launch scripts for the Windows platform.</p> <pre><code>./bin/spark-class org.apache.spark.deploy.master.Master -h localhost\n</code></pre> <pre><code>./bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077\n</code></pre>"},{"location":"speculative-execution-of-tasks/","title":"Speculative Execution of Tasks","text":"<p>Speculative tasks (also speculatable tasks or task strugglers) are tasks that run slower than most (FIXME the setting) of the all tasks in a job.</p> <p>Speculative execution of tasks is a health-check procedure that checks for tasks to be speculated, i.e. running slower in a stage than the median of all successfully completed tasks in a taskset (FIXME the setting). Such slow tasks will be re-submitted to another worker. It will not stop the slow tasks, but run a new copy in parallel.</p> <p>The thread starts as <code>TaskSchedulerImpl</code> starts in spark-cluster.md[clustered deployment modes] with configuration-properties.md#spark.speculation[spark.speculation] enabled. It executes periodically every configuration-properties.md#spark.speculation.interval[spark.speculation.interval] after the initial <code>spark.speculation.interval</code> passes.</p> <p>When enabled, you should see the following INFO message in the logs:</p>"},{"location":"speculative-execution-of-tasks/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"speculative-execution-of-tasks/#starting-speculative-execution-thread","title":"Starting speculative execution thread","text":"<p>It works as scheduler:TaskSchedulerImpl.md#task-scheduler-speculation[<code>task-scheduler-speculation</code> daemon thread pool] (using <code>j.u.c.ScheduledThreadPoolExecutor</code> with core pool size of 1).</p> <p>The job with speculatable tasks should finish while speculative tasks are running, and it will leave these tasks running - no KILL command yet.</p> <p>It uses <code>checkSpeculatableTasks</code> method that asks <code>rootPool</code> to check for speculatable tasks. If there are any, <code>SchedulerBackend</code> is called for scheduler:SchedulerBackend.md#reviveOffers[reviveOffers].</p> <p>CAUTION: FIXME How does Spark handle repeated results of speculative tasks since there are copies launched?</p>"},{"location":"workers/","title":"Workers","text":"<p>== Workers</p> <p>Workers (aka slaves) are running Spark instances where executors live to execute tasks. They are the compute nodes in Spark.</p> <p>CAUTION: FIXME Are workers perhaps part of Spark Standalone only?</p> <p>CAUTION: FIXME How many executors are spawned per worker?</p> <p>A worker receives serialized tasks that it runs in a thread pool.</p> <p>It hosts a local storage:BlockManager.md[Block Manager] that serves blocks to other workers in a Spark cluster. Workers communicate among themselves using their Block Manager instances.</p> <p>CAUTION: FIXME Diagram of a driver with workers as boxes.</p> <p>Explain task execution in Spark and understand Spark\u2019s underlying execution model.</p> <p>New vocabulary often faced in Spark UI</p> <p>SparkContext.md[When you create SparkContext], each worker starts an executor. This is a separate process (JVM), and it loads your jar, too. The executors connect back to your driver program. Now the driver can send them commands, like <code>flatMap</code>, <code>map</code> and <code>reduceByKey</code>. When the driver quits, the executors shut down.</p> <p>A new process is not started for each step. A new process is started on each worker when the SparkContext is constructed.</p> <p>The executor deserializes the command (this is possible because it has loaded your jar), and executes it on a partition.</p> <p>Shortly speaking, an application in Spark is executed in three steps:</p> <ol> <li>Create RDD graph, i.e. DAG (directed acyclic graph) of RDDs to represent entire computation.</li> <li>Create stage graph, i.e. a DAG of stages that is a logical execution plan based on the RDD graph. Stages are created by breaking the RDD graph at shuffle boundaries.</li> <li>Based on the plan, schedule and execute tasks on workers.</li> </ol> <p>exercises/spark-examples-wordcount-spark-shell.md[In the WordCount example], the RDD graph is as follows:</p> <p>file -&gt; lines -&gt; words -&gt; per-word count -&gt; global word count -&gt; output</p> <p>Based on this graph, two stages are created. The stage creation rule is based on the idea of pipelining as many rdd:index.md[narrow transformations] as possible. RDD operations with \"narrow\" dependencies, like <code>map()</code> and <code>filter()</code>, are pipelined together into one set of tasks in each stage.</p> <p>In the end, every stage will only have shuffle dependencies on other stages, and may compute multiple operations inside it.</p> <p>In the WordCount example, the narrow transformation finishes at per-word count. Therefore, you get two stages:</p> <ul> <li>file -&gt; lines -&gt; words -&gt; per-word count</li> <li>global word count -&gt; output</li> </ul> <p>Once stages are defined, Spark will generate scheduler:Task.md[tasks] from scheduler:Stage.md[stages]. The first stage will create scheduler:ShuffleMapTask.md[ShuffleMapTask]s with the last stage creating scheduler:ResultTask.md[ResultTask]s because in the last stage, one action operation is included to produce results.</p> <p>The number of tasks to be generated depends on how your files are distributed. Suppose that you have 3 three different files in three different nodes, the first stage will generate 3 tasks: one task per partition.</p> <p>Therefore, you should not map your steps to tasks directly. A task belongs to a stage, and is related to a partition.</p> <p>The number of tasks being generated in each stage will be equal to the number of partitions.</p> <p>=== [[Cleanup]] Cleanup</p> <p>CAUTION: FIXME</p> <p>=== [[settings]] Settings</p> <ul> <li><code>spark.worker.cleanup.enabled</code> (default: <code>false</code>) &lt;&gt; enabled."},{"location":"accumulators/","title":"Accumulators","text":"<p>Accumulators are shared variables that accumulate values from executors on the driver using associative and commutative \"add\" operation.</p> <p>The main abstraction is AccumulatorV2.</p> <p>Accumulators are registered (created) using SparkContext with or without a name. Only named accumulators are displayed in web UI.</p> <p></p> <p><code>DAGScheduler</code> is responsible for updating accumulators (from partial values from tasks running on executors every heartbeat).</p> <p>Accumulators are serializable so they can safely be referenced in the code executed in executors and then safely send over the wire for execution.</p> <pre><code>// on the driver\nval counter = sc.longAccumulator(\"counter\")\n\nsc.parallelize(1 to 9).foreach { x =&gt;\n  // on executors\n  counter.add(x) }\n\n// on the driver\nprintln(counter.value)\n</code></pre>"},{"location":"accumulators/#further-reading","title":"Further Reading","text":"<ul> <li>Performance and Scalability of Broadcast in Spark</li> </ul>"},{"location":"accumulators/AccumulableInfo/","title":"AccumulableInfo","text":"<p><code>AccumulableInfo</code> represents an update to an AccumulatorV2.</p> <p><code>AccumulableInfo</code> is used to transfer accumulator updates from executors to the driver every executor heartbeat or when a task finishes.</p>"},{"location":"accumulators/AccumulableInfo/#creating-instance","title":"Creating Instance","text":"<p><code>AccumulableInfo</code> takes the following to be created:</p> <ul> <li> Accumulator ID <li> Name <li> Partial Update <li> Partial Value <li>internal flag</li> <li> <code>countFailedValues</code> flag <li> Metadata (default: <code>None</code>) <p><code>AccumulableInfo</code> is created\u00a0when:</p> <ul> <li><code>AccumulatorV2</code> is requested to convert itself to an AccumulableInfo</li> <li><code>JsonProtocol</code> is requested to accumulableInfoFromJson</li> <li><code>SQLMetric</code> (Spark SQL) is requested to convert itself to an <code>AccumulableInfo</code></li> </ul>"},{"location":"accumulators/AccumulableInfo/#internal-flag","title":"internal Flag <pre><code>internal: Boolean\n</code></pre> <p><code>AccumulableInfo</code> is given an <code>internal</code> flag when created.</p> <p><code>internal</code> flag denotes whether this accumulator is internal.</p> <p><code>internal</code> is used when:</p> <ul> <li><code>LiveEntityHelpers</code> is requested for <code>newAccumulatorInfos</code></li> <li><code>JsonProtocol</code> is requested to accumulableInfoToJson</li> </ul>","text":""},{"location":"accumulators/AccumulatorContext/","title":"AccumulatorContext","text":"<p>== [[AccumulatorContext]] AccumulatorContext</p> <p><code>AccumulatorContext</code> is a <code>private[spark]</code> internal object used to track accumulators by Spark itself using an internal <code>originals</code> lookup table. Spark uses the <code>AccumulatorContext</code> object to register and unregister accumulators.</p> <p>The <code>originals</code> lookup table maps accumulator identifier to the accumulator itself.</p> <p>Every accumulator has its own unique accumulator id that is assigned using the internal <code>nextId</code> counter.</p> <p>=== [[register]] <code>register</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[newId]] <code>newId</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[AccumulatorContext-SQL_ACCUM_IDENTIFIER]] AccumulatorContext.SQL_ACCUM_IDENTIFIER</p> <p><code>AccumulatorContext.SQL_ACCUM_IDENTIFIER</code> is an internal identifier for Spark SQL's internal accumulators. The value is <code>sql</code> and Spark uses it to distinguish spark-sql-SparkPlan.md#SQLMetric[Spark SQL metrics] from others.</p>"},{"location":"accumulators/AccumulatorSource/","title":"AccumulatorSource","text":"<p><code>AccumulatorSource</code> is...FIXME</p>"},{"location":"accumulators/AccumulatorV2/","title":"AccumulatorV2","text":"<p><code>AccumulatorV2[IN, OUT]</code> is an abstraction of accumulators</p> <p><code>AccumulatorV2</code> is a Java Serializable.</p>"},{"location":"accumulators/AccumulatorV2/#contract","title":"Contract","text":""},{"location":"accumulators/AccumulatorV2/#adding-value","title":"Adding Value <pre><code>add(\n  v: IN): Unit\n</code></pre> <p>Accumulates (adds) the given <code>v</code> value to this accumulator</p>","text":""},{"location":"accumulators/AccumulatorV2/#copying-accumulator","title":"Copying Accumulator <pre><code>copy(): AccumulatorV2[IN, OUT]\n</code></pre>","text":""},{"location":"accumulators/AccumulatorV2/#is-zero-value","title":"Is Zero Value <pre><code>isZero: Boolean\n</code></pre>","text":""},{"location":"accumulators/AccumulatorV2/#merging-updates","title":"Merging Updates <pre><code>merge(\n  other: AccumulatorV2[IN, OUT]): Unit\n</code></pre>","text":""},{"location":"accumulators/AccumulatorV2/#resetting-accumulator","title":"Resetting Accumulator <pre><code>reset(): Unit\n</code></pre>","text":""},{"location":"accumulators/AccumulatorV2/#value","title":"Value <pre><code>value: OUT\n</code></pre> <p>The current value of this accumulator</p> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to collectAccumulatorsAndResetStatusOnFailure</li> <li><code>AccumulatorSource</code> is requested to register</li> <li><code>DAGScheduler</code> is requested to update accumulators</li> <li><code>TaskSchedulerImpl</code> is requested to executorHeartbeatReceived</li> <li><code>TaskSetManager</code> is requested to handleSuccessfulTask</li> <li><code>JsonProtocol</code> is requested to taskEndReasonFromJson</li> <li>others</li> </ul>","text":""},{"location":"accumulators/AccumulatorV2/#implementations","title":"Implementations","text":"<ul> <li>AggregatingAccumulator (Spark SQL)</li> <li>CollectionAccumulator</li> <li>DoubleAccumulator</li> <li>EventTimeStatsAccum (Spark Structured Streaming)</li> <li>LongAccumulator</li> <li>SetAccumulator (Spark SQL)</li> <li>SQLMetric (Spark SQL)</li> </ul>"},{"location":"accumulators/AccumulatorV2/#converting-this-accumulator-to-accumulableinfo","title":"Converting this Accumulator to AccumulableInfo <pre><code>toInfo(\n  update: Option[Any],\n  value: Option[Any]): AccumulableInfo\n</code></pre> <p><code>toInfo</code> determines whether the accumulator is internal based on the name (and whether it uses the internal.metrics prefix) and uses it to create an AccumulableInfo.</p> <p><code>toInfo</code>\u00a0is used when:</p> <ul> <li><code>TaskRunner</code> is requested to collectAccumulatorsAndResetStatusOnFailure</li> <li><code>DAGScheduler</code> is requested to updateAccumulators</li> <li><code>TaskSchedulerImpl</code> is requested to executorHeartbeatReceived</li> <li><code>JsonProtocol</code> is requested to taskEndReasonFromJson</li> <li><code>SQLAppStatusListener</code> (Spark SQL) is requested to handle a <code>SparkListenerTaskEnd</code> event (<code>onTaskEnd</code>)</li> </ul>","text":""},{"location":"accumulators/AccumulatorV2/#registering-accumulator","title":"Registering Accumulator <pre><code>register(\n  sc: SparkContext,\n  name: Option[String] = None,\n  countFailedValues: Boolean = false): Unit\n</code></pre> <p><code>register</code>...FIXME</p> <p><code>register</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is requested to register an accumulator</li> <li><code>TaskMetrics</code> is requested to register task accumulators</li> <li><code>CollectMetricsExec</code> (Spark SQL) is requested for an <code>AggregatingAccumulator</code></li> <li><code>SQLMetrics</code> (Spark SQL) is used to create a performance metric</li> </ul>","text":""},{"location":"accumulators/AccumulatorV2/#serializing-accumulatorv2","title":"Serializing AccumulatorV2 <pre><code>writeReplace(): Any\n</code></pre> <p><code>writeReplace</code> is part of the <code>Serializable</code> (Java) abstraction (to designate an alternative object to be used when writing an object to the stream).</p> <p><code>writeReplace</code>...FIXME</p>","text":""},{"location":"accumulators/AccumulatorV2/#deserializing-accumulatorv2","title":"Deserializing AccumulatorV2 <pre><code>readObject(\n  in: ObjectInputStream): Unit\n</code></pre> <p><code>readObject</code> is part of the <code>Serializable</code> (Java) abstraction (for special handling during deserialization).</p> <p><code>readObject</code> reads the non-static and non-transient fields of the <code>AccumulatorV2</code> from the given <code>ObjectInputStream</code>.</p> <p>If the <code>atDriverSide</code> internal flag is turned on, <code>readObject</code> turns it off (to indicate <code>readObject</code> is executed on an executor). Otherwise, <code>atDriverSide</code> internal flag is turned on.</p> <p><code>readObject</code> requests the active TaskContext to register this accumulator.</p>","text":""},{"location":"accumulators/InternalAccumulator/","title":"InternalAccumulator","text":"<p><code>InternalAccumulator</code> is an utility with field names for internal accumulators.</p>"},{"location":"accumulators/InternalAccumulator/#internalmetrics-prefix","title":"internal.metrics Prefix <p><code>internal.metrics.</code> is the prefix of metrics that are considered internal and should not be displayed in web UI.</p> <p><code>internal.metrics.</code> is used when:</p> <ul> <li><code>AccumulatorV2</code> is requested to convert itself to AccumulableInfo and writeReplace</li> <li><code>JsonProtocol</code> is requested to accumValueToJson and accumValueFromJson</li> </ul>","text":""},{"location":"barrier-execution-mode/","title":"Barrier Execution Mode","text":"<p>Barrier Execution Mode (Barrier Scheduling) introduces a strong requirement on Spark Scheduler to launch all tasks of a Barrier Stage at the same time or not at all (and consequently wait until required resources are available). Moreover, a failure of a single task of a barrier stage fails the whole stage (and so the other tasks).</p> <p>Barrier Execution Mode allows for as many tasks to be executed concurrently as ResourceProfile permits (that is enforced upon scheduling a barrier job).</p> <p>Barrier Execution Mode aims at making Distributed Deep Learning with Apache Spark easier (or even possible).</p> <p>Rephrasing dmlc/xgboost, Barrier Execution Mode makes sure that:</p> <ol> <li> <p>All tasks of a barrier stage are all launched at once. If there is not enough task slots, the exception will be produced</p> </li> <li> <p>Tasks either all succeed or fail. Upon a task failure Spark aborts all the other tasks (TaskScheduler will kill all other running tasks) and restarts the whole barrier stage</p> </li> <li> <p>Spark makes no assumption that tasks don't talk to each other. Actually, it is the opposite. Spark provides BarrierTaskContext which facilitates tasks discovery (e.g., barrier, allGather)</p> </li> <li> <p>Permits restarting a training from a known state (checkpoint) in case of a failure</p> </li> </ol> <p>From the Design doc: Barrier Execution Mode:</p> <p>In Spark, a task in a stage doesn't depend on any other task in the same stage, and hence it can be scheduled independently.</p> <p>That gives Spark a freedom to schedule tasks in as many task batches as needed. So, 5 tasks can be scheduled on 1 CPU core quite easily in 5 consecutive batches. That's unlike MPI (or non-MapReduce scheduling systems) that allows for greater flexibility and inter-task dependency.</p> <p>Later in Design doc: Barrier Execution Mode:</p> <p>In MPI, all workers start at the same time and pass messages around.</p> <p>To embed this workload in Spark, we need to introduce a new scheduling model, tentatively named \"barrier scheduling\", which launches the tasks at the same time and provides users enough information and tooling to embed distributed DL training into a Spark pipeline.</p>"},{"location":"barrier-execution-mode/#barrier-rdd","title":"Barrier RDD","text":"<p>Barrier RDD is a RDDBarrier.</p>"},{"location":"barrier-execution-mode/#barrier-stage","title":"Barrier Stage","text":"<p>Barrier Stage is a Stage with at least one Barrier RDD.</p>"},{"location":"barrier-execution-mode/#abstractions","title":"Abstractions","text":"<ul> <li>BarrierTaskContext</li> <li>RDDBarrier</li> </ul>"},{"location":"barrier-execution-mode/#barrier","title":"RDD.barrier Operator","text":"<p>Barrier Execution Mode is based on RDD.barrier operator to indicate that Spark Scheduler must launch the tasks together for the current stage (and mark the current stage as a barrier stage).</p> <pre><code>barrier(): RDDBarrier[T]\n</code></pre> <p><code>RDD.barrier</code> creates a RDDBarrier that comes with the barrier-aware mapPartitions transformation.</p> <pre><code>mapPartitions[S](\n  f: Iterator[T] =&gt; Iterator[S],\n  preservesPartitioning: Boolean = false): RDD[S]\n</code></pre> <p>Under the covers, <code>RDDBarrier.mapPartitions</code> creates a MapPartitionsRDD like the regular <code>RDD.mapPartitions</code> transformation but with isFromBarrier flag enabled.</p> <ul> <li><code>Task</code> has a isBarrier flag that says whether this task belongs to a barrier stage (default: <code>false</code>).</li> </ul>"},{"location":"barrier-execution-mode/#isFromBarrier","title":"isFromBarrier Flag","text":"<p>An RDD is in a barrier stage, if at least one of its parent RDD(s), or itself, are mapped from an <code>RDDBarrier</code>.</p> <p>ShuffledRDD has the isBarrier flag always disabled (<code>false</code>).</p> <p>MapPartitionsRDD is the only RDD that can have the isBarrier flag enabled.</p> <p>RDDBarrier.mapPartitions is the only transformation that creates a MapPartitionsRDD with the isFromBarrier flag enabled.</p>"},{"location":"barrier-execution-mode/#unsupported-spark-features","title":"Unsupported Spark Features","text":"<p>The following Spark features are not supported:</p> <ul> <li>Push-Based Shuffle</li> <li>Dynamic Allocation of Executors</li> </ul>"},{"location":"barrier-execution-mode/#demo","title":"Demo","text":"<p>Enable <code>ALL</code> logging level for org.apache.spark.BarrierTaskContext logger to see what happens inside.</p> <pre><code>val tasksNum = 3\nval nums = sc.parallelize(seq = 0 until 9, numSlices = tasksNum)\nassert(nums.getNumPartitions == tasksNum)\n</code></pre> <p>Print out the available partitions and the number of records within each (using Spark SQL for a human-friendlier output).</p> Scala <pre><code>import org.apache.spark.TaskContext\nnums\n  .mapPartitions { it =&gt; Iterator.single((TaskContext.get.partitionId, it.size)) }\n  .toDF(\"partitionId\", \"size\")\n  .show\n</code></pre> <pre><code>+-----------+----+\n|partitionId|size|\n+-----------+----+\n|          0|   3|\n|          1|   3|\n|          2|   3|\n+-----------+----+\n</code></pre>"},{"location":"barrier-execution-mode/#distributed-training","title":"Distributed Training","text":"<p>RDD.barrier creates a Barrier Stage (a RDDBarrier).</p> <pre><code>import org.apache.spark.rdd.RDDBarrier\nassert(nums.barrier.isInstanceOf[RDDBarrier[_]])\n</code></pre> <p>Use RDD.mapPartitions transformation to access a BarrierTaskContext.</p> <pre><code>val barrierRdd = nums\n  .barrier\n  .mapPartitions { ns =&gt;\n    import org.apache.spark.{BarrierTaskContext, TaskContext}\n    val ctx = TaskContext.get.asInstanceOf[BarrierTaskContext]\n    val tid = ctx.partitionId()\n    val port = 10000 + tid\n    val host = \"localhost\"\n    val message = s\"A message from task $tid, e.g. $host:$port it listens at\"\n    val allTaskMessages = ctx.allGather(message)\n\n    if (tid == 0) { // only Task 0 prints out status\n      println(\"&gt;&gt;&gt; Got host:port's from the other tasks\")\n      allTaskMessages.foreach(println)\n    }\n\n    if (tid == 0) { // only Task 0 prints out status\n      println(\"&gt;&gt;&gt; Starting a distributed training at the nodes...\")\n    }\n\n    ctx.barrier() // this is BarrierTaskContext.barrier (not RDD.barrier)\n                  // which can be confusing\n\n    if (tid == 0) { // only Task 0 prints out status\n      println(\"&gt;&gt;&gt; All tasks have finished\")\n    }\n\n    // return a model after combining (model) pieces from the nodes\n    ns\n  }\n</code></pre> <p>Run a distributed computation (using RDD.count action).</p> <pre><code>barrierRdd.count()\n</code></pre> <p>There should be INFO and TRACE messages printed out to the console (given <code>ALL</code> logging level for org.apache.spark.BarrierTaskContext logger).</p> <pre><code>[Executor task launch worker for task 1.0 in stage 5.0 (TID 13)] INFO  org.apache.spark.BarrierTaskContext:60 - Task 13 from Stage 5(Attempt 0) has entered the global sync, current barrier epoch is 0.\n...\n[Executor task launch worker for task 1.0 in stage 5.0 (TID 13)] TRACE org.apache.spark.BarrierTaskContext:68 - Current callSite: CallSite($anonfun$runBarrier$2 at Logging.scala:68,org.apache.spark.BarrierTaskContext.$anonfun$runBarrier$2(BarrierTaskContext.scala:61)\n...\n[Executor task launch worker for task 1.0 in stage 5.0 (TID 13)] INFO  org.apache.spark.BarrierTaskContext:60 - Task 13 from Stage 5(Attempt 0) finished global sync successfully, waited for 1 seconds, current barrier epoch is 1.\n...\n</code></pre> <p>Open up web UI and explore the execution plans.</p>"},{"location":"barrier-execution-mode/#access-mappartitionsrdd","title":"Access MapPartitionsRDD","text":"<p>MapPartitionsRDD is a <code>private[spark]</code> class so to access <code>RDD.isBarrier</code> method requires to be in <code>org.apache.spark</code> package.</p> <p>Paste the following code in spark-shell / Scala REPL using <code>:paste -raw</code> mode.</p> <pre><code>package org.apache.spark\n\nobject IsBarrier {\n  import org.apache.spark.rdd.RDD\n  implicit class BypassPrivateSpark[T](rdd: RDD[T]) {\n    def isBarrier = rdd.isBarrier\n  }\n}\n</code></pre> <pre><code>import org.apache.spark.IsBarrier._\nassert(barrierRdd.isBarrier)\n</code></pre>"},{"location":"barrier-execution-mode/#examples","title":"Examples","text":"<p>Something worth reviewing the source code and learn from it</p>"},{"location":"barrier-execution-mode/#synapseml","title":"SynapseML","text":"<p>SynapseML's LightGBM on Apache Spark can be configured to use Barrier Execution Mode in the following modules:</p> <ul> <li><code>synapse.ml.lightgbm.LightGBMClassifier</code></li> <li><code>synapse.ml.lightgbm.LightGBMRanker</code></li> <li><code>synapse.ml.lightgbm.LightGBMRegressor</code></li> </ul>"},{"location":"barrier-execution-mode/#xgboost4j","title":"XGBoost4J","text":"<p>XGBoost4J is the JVM package of xgboost (an optimized distributed gradient boosting library with machine learning algorithms for regression and classification under the Gradient Boosting framework).</p> <p>The heart of distributed training in xgboost4j-spark (that can run distributed xgboost on Apache Spark) is XGBoost.trainDistributed.</p> <p>There's a familiar line that creates a barrier stage (using <code>RDD.barrier()</code>):</p> <pre><code>val boostersAndMetrics = trainingRDD.barrier().mapPartitions {\n  // distributed training using XGBoost happens here\n}\n</code></pre> <p>The barrier <code>mapPartitions</code> block finishes is followed by <code>RDD.collect()</code> that gets XGBoost4J-specific metadata (<code>booster</code> and <code>metrics</code>):</p> <pre><code>val (booster, metrics) = boostersAndMetrics.collect()(0)\n</code></pre> <p>Within the barrier stage (within <code>mapPartitions</code> block), xgboost4j-spark builds a distributed booster:</p> <ol> <li>Checkpointing, when enabled, happens only by Task 0</li> <li>All tasks initialize so-called collective Communicator for synchronization</li> <li>xgboost4j-spark uses XGBoostJNI to talk to XGBoost using JNI</li> <li>Only Task 0 returns non-empty iterator (and that's why the <code>RDD.collect()(0)</code> gets <code>(booster, metrics)</code>)</li> <li>All tasks execute SXGBoost.train that eventually leads to XGBoost.trainAndSaveCheckpoint</li> </ol>"},{"location":"barrier-execution-mode/#learn-more","title":"Learn More","text":"<ol> <li>SPIP: Support Barrier Execution Mode in Apache Spark (esp. Design: Barrier execution mode)</li> <li>Barrier Execution Mode in Spark 3.0 - Part 1 : Introduction</li> </ol>"},{"location":"barrier-execution-mode/BarrierCoordinator/","title":"Barrier Coordinator RPC Endpoint","text":"<p><code>BarrierCoordinator</code> is a ThreadSafeRpcEndpoint that is registered as barrierSync RPC Endpoint when <code>TaskSchedulerImpl</code> is requested to maybeInitBarrierCoordinator.</p> <p><code>BarrierCoordinator</code> is responsible for handling RequestToSync messages to coordinate Global Syncs of barrier tasks (using allGather and barrier operators).</p> <p>In other words, the driver sets up a <code>BarrierCoordinator</code> (TaskSchedulerImpl precisely) upon startup that BarrierTaskContexts talk to using RequestToSync messages. <code>BarrierCoordinator</code> tracks the number of tasks to wait for until a barrier stage is complete and a response can be sent back to the tasks to continue (that are paused for 365 days (!)).</p>"},{"location":"barrier-execution-mode/BarrierCoordinator/#creating-instance","title":"Creating Instance","text":"<p><code>BarrierCoordinator</code> takes the following to be created:</p> <ul> <li> Timeout (seconds) <li> LiveListenerBus <li> RpcEnv <p><code>BarrierCoordinator</code> is created when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to maybeInitBarrierCoordinator</li> </ul>"},{"location":"barrier-execution-mode/BarrierCoordinator/#receiveAndReply","title":"Processing RequestToSync Messages (from Barrier Tasks)","text":"RpcEndpoint <pre><code>receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n</code></pre> <p><code>receiveAndReply</code> is part of the RpcEndpoint abstraction.</p> <p><code>receiveAndReply</code> handles RequestToSync messages.</p> <p>Unless already registered, <code>receiveAndReply</code> registers a new <code>ContextBarrierId</code> (for the stageId and the stageAttemptId) in the Barrier States registry.</p> <p>Multiple Tasks and One BarrierCoordinator</p> <p><code>receiveAndReply</code> handles RequestToSync messages, one per task in a barrier stage. Out of all the properties of <code>RequestToSync</code>, numTasks, stageId and stageAttemptId are used.</p> <p>The very first <code>RequestToSync</code> is used to register the stageId and stageAttemptId (as <code>ContextBarrierId</code>) with numTasks.</p> <p><code>receiveAndReply</code> finds the ContextBarrierState for the stage and the stage attempt (in the Barrier States registry) to handle the RequestToSync.</p>"},{"location":"barrier-execution-mode/BarrierCoordinator/#states","title":"Barrier States","text":"<pre><code>states: ConcurrentHashMap[ContextBarrierId, ContextBarrierState]\n</code></pre> <p><code>BarrierCoordinator</code> creates an empty <code>ConcurrentHashMap</code> (Java) when created.</p> <p><code>states</code> registry is used to keep track of all the active barrier stage attempts and the corresponding internal ContextBarrierState.</p> <p><code>states</code> is used when:</p> <ul> <li>onStop to clean up</li> <li>cleanupBarrierStage to remove a specific stage attempt</li> <li>receiveAndReply to handle RequestToSync messages</li> </ul>"},{"location":"barrier-execution-mode/BarrierCoordinator/#listener","title":"SparkListener","text":"<p><code>BarrierCoordinator</code> creates a SparkListener when created.</p> <p>The <code>SparkListener</code> is used to intercept SparkListenerStageCompleted events.</p> <p>The <code>SparkListener</code> is addToStatusQueue upon startup and removed at stop.</p>"},{"location":"barrier-execution-mode/BarrierCoordinator/#onStageCompleted","title":"onStageCompleted","text":"SparkListener <pre><code>onStageCompleted(\n  stageCompleted: SparkListenerStageCompleted): Unit\n</code></pre> <p><code>onStageCompleted</code> is part of the SparkListenerInterface abstraction.</p> <p><code>onStageCompleted</code> cleanupBarrierStage for the stage and the attempt number (based on the given <code>SparkListenerStageCompleted</code>).</p>"},{"location":"barrier-execution-mode/BarrierCoordinator/#logging","title":"Logging","text":"<p>Enable <code>ALL</code> logging level for <code>org.apache.spark.BarrierCoordinator</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.BarrierCoordinator.name = org.apache.spark.BarrierCoordinator\nlogger.BarrierCoordinator.level = all\n</code></pre> <p>Refer to Logging.</p>"},{"location":"barrier-execution-mode/BarrierCoordinatorMessage/","title":"BarrierCoordinatorMessage RPC Messages","text":"<p><code>BarrierCoordinatorMessage</code> is an abstraction of RPC messages that tasks can send out using BarrierTaskContext operators for BarrierCoordinator to handle.</p> <p><code>BarrierCoordinatorMessage</code> is a <code>Serializable</code> (Java) (so it can be sent from executors to the driver over the wire).</p>"},{"location":"barrier-execution-mode/BarrierCoordinatorMessage/#implementations","title":"Implementations","text":"Sealed Trait <p><code>BarrierCoordinatorMessage</code> is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).</p> <p>Learn more in the Scala Language Specification.</p> <ul> <li>RequestToSync</li> </ul>"},{"location":"barrier-execution-mode/BarrierJobAllocationFailed/","title":"BarrierJobAllocationFailed","text":"<p><code>BarrierJobAllocationFailed</code> is...FIXME</p>"},{"location":"barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/","title":"BarrierJobSlotsNumberCheckFailed","text":""},{"location":"barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/#barrierjobslotsnumbercheckfailed","title":"BarrierJobSlotsNumberCheckFailed","text":"<p><code>BarrierJobSlotsNumberCheckFailed</code> is a BarrierJobAllocationFailed with the following exception message:</p> <pre><code>[SPARK-24819]: Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently.\nPlease init a new cluster with more resources(e.g. CPU, GPU) or repartition the input RDD(s) to reduce the number of slots required to run this barrier stage.\n</code></pre> <p><code>BarrierJobSlotsNumberCheckFailed</code> can be thrown when <code>DAGScheduler</code> is requested to handle a JobSubmitted event.</p>"},{"location":"barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/#creating-instance","title":"Creating Instance","text":"<p><code>BarrierJobSlotsNumberCheckFailed</code> takes the following to be created:</p> <ul> <li> Required Concurrent Tasks (based on the number of partitions of a barrier RDD) <li> Maximum Number of Concurrent Tasks (based on a ResourceProfile used) <p><code>BarrierJobSlotsNumberCheckFailed</code> is created when:</p> <ul> <li><code>SparkCoreErrors</code> is requested to numPartitionsGreaterThanMaxNumConcurrentTasksError</li> </ul>"},{"location":"barrier-execution-mode/BarrierTaskContext/","title":"BarrierTaskContext \u2014 TaskContext for Barrier Tasks","text":"<p><code>BarrierTaskContext</code> is a concrete TaskContext of the tasks in a Barrier Stage in Barrier Execution Mode.</p>"},{"location":"barrier-execution-mode/BarrierTaskContext/#creating-instance","title":"Creating Instance","text":"<p><code>BarrierTaskContext</code> takes the following to be created:</p> <ul> <li> TaskContext <p><code>BarrierTaskContext</code> is created when:</p> <ul> <li><code>Task</code> is requested to run (with isBarrier flag enabled)</li> </ul>"},{"location":"barrier-execution-mode/BarrierTaskContext/#barrierCoordinator","title":"Barrier Coordinator RPC Endpoint","text":"<pre><code>barrierCoordinator: RpcEndpointRef\n</code></pre> <p><code>BarrierTaskContext</code> creates a RpcEndpointRef to Barrier Coordinator RPC Endpoint when created.</p> <p><code>barrierCoordinator</code> is used to handle barrier and allGather operators (through runBarrier).</p>"},{"location":"barrier-execution-mode/BarrierTaskContext/#allGather","title":"allGather","text":"<pre><code>allGather(\n  message: String): Array[String]\n</code></pre> <p><code>allGather</code> runBarrier with the given <code>message</code> and <code>ALL_GATHER</code> request method.</p> Public API and PySpark <p><code>allGather</code> is part of a public API.</p> <p><code>allGather</code> is used in <code>BasePythonRunner.WriterThread</code> (PySpark) when requested to <code>barrierAndServe</code>.</p>"},{"location":"barrier-execution-mode/BarrierTaskContext/#barrier","title":"barrier","text":"<pre><code>barrier(): Unit\n</code></pre> <p><code>barrier</code> runBarrier with no message and <code>BARRIER</code> request method.</p> Public API and PySpark <p><code>barrier</code> is part of a public API.</p> <p><code>barrier</code> is used in <code>BasePythonRunner.WriterThread</code> (PySpark) when requested to <code>barrierAndServe</code>.</p>"},{"location":"barrier-execution-mode/BarrierTaskContext/#runBarrier","title":"Global Sync","text":"<pre><code>runBarrier(\n  message: String,\n  requestMethod: RequestMethod.Value): Array[String]\n</code></pre> <p><code>runBarrier</code> prints out the following INFO message to the logs:</p> <pre><code>Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) has entered the global sync, current barrier epoch is [barrierEpoch].\n</code></pre> <p><code>runBarrier</code> prints out the following TRACE message to the logs:</p> <pre><code>Current callSite: [callSite]\n</code></pre> <p><code>runBarrier</code> schedules a <code>TimerTask</code> (Java) to print out the following INFO message to the logs every minute:</p> <pre><code>Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) waiting under the global sync since [startTime],\nhas been waiting for [duration] seconds,\ncurrent barrier epoch is [barrierEpoch].\n</code></pre> <p><code>runBarrier</code> requests the Barrier Coordinator RPC Endpoint to send a RequestToSync one-off message and waits 365 days (!) for a response (a collection of responses from all the barrier tasks).</p> <p>1 Year to Wait for Response from Barrier Coordinator</p> <p><code>runBarrier</code> uses 1 year to wait until the response arrives.</p> <p><code>runBarrier</code> checks every second if the response \"bundle\" arrived.</p> <p><code>runBarrier</code> increments the barrierEpoch.</p> <p><code>runBarrier</code> prints out the following INFO message to the logs:</p> <pre><code>Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) finished global sync successfully,\nwaited for [duration] seconds,\ncurrent barrier epoch is [barrierEpoch].\n</code></pre> <p>In the end, <code>runBarrier</code> returns the response \"bundle\" (a collection of responses from all the barrier tasks).</p> <p>In case of a <code>SparkException</code>, <code>runBarrier</code> prints out the following INFO message to the logs and reports (re-throws) the exception up (the call chain):</p> <pre><code>Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) failed to perform global sync,\nwaited for [duration] seconds,\ncurrent barrier epoch is [barrierEpoch].\n</code></pre> <p><code>runBarrier</code> is used when:</p> <ul> <li><code>BarrierTaskContext</code> is requested to barrier, allGather</li> </ul>"},{"location":"barrier-execution-mode/BarrierTaskContext/#logging","title":"Logging","text":"<p>Enable <code>ALL</code> logging level for <code>org.apache.spark.BarrierTaskContext</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.BarrierTaskContext.name = org.apache.spark.BarrierTaskContext\nlogger.BarrierTaskContext.level = all\n</code></pre> <p>Refer to Logging.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/","title":"ContextBarrierState","text":"<p><code>ContextBarrierState</code> represents the state of global sync of a barrier stage (with the number of tasks).</p> <p><code>ContextBarrierState</code> is used by BarrierCoordinator to handle RequestToSync messages (and to keep track of active barrier stage attempts).</p> ContextBarrierState <p><code>ContextBarrierState</code> is a <code>private class</code> of BarrierCoordinator.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/#creating-instance","title":"Creating Instance","text":"<p><code>ContextBarrierState</code> takes the following to be created:</p> <ul> <li>ContextBarrierId</li> <li> Number of Tasks (of a barrier stage) <p><code>ContextBarrierState</code> is created when:</p> <ul> <li><code>BarrierCoordinator</code> is requested to handle a RequestToSync message for a new stage and stage attempt IDs</li> </ul>"},{"location":"barrier-execution-mode/ContextBarrierState/#barrierId","title":"Barrier Stage Attempt (ContextBarrierId)","text":"<p><code>ContextBarrierState</code> is given a <code>ContextBarrierId</code> (of a barrier stage) when created.</p> <p>The <code>ContextBarrierId</code> uniquely identifies a barrier stage by the stage and stage attempt IDs.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/#barrierEpoch","title":"Barrier Epoch","text":"<p><code>ContextBarrierState</code> initializes <code>barrierEpoch</code> counter to be <code>0</code> when created.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/#requesters","title":"Barrier Tasks","text":"<pre><code>requesters: ArrayBuffer[RpcCallContext]\n</code></pre> <p><code>requesters</code> is a registry of <code>RpcCallContext</code>s of the barrier tasks (of a barrier stage attempt) pending a reply.</p> <p>It is only when the number of <code>RpcCallContext</code>s in the <code>requesters</code> reaches the number of tasks expected (while handling RequestToSync requests) that this <code>ContextBarrierState</code> is considered finished successfully.</p> <p><code>ContextBarrierState</code> initializes <code>requesters</code> when created to be of number of tasks size.</p> <p>A new <code>RpcCallContext</code> of a barrier task is added in handleRequest only when the epoch of the barrier task matches the current barrierEpoch.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/#timerTask","title":"TimerTask","text":"<pre><code>timerTask: TimerTask\n</code></pre> <p><code>ContextBarrierState</code> uses a <code>TimerTask</code> (Java) to ensure that a <code>barrier()</code> call can time out.</p> <p><code>ContextBarrierState</code> creates a <code>TimerTask</code> (Java) when requested to initTimerTask when requested to handle a RequestToSync message for the first global sync message received (when the requesters is empty). The <code>TimerTask</code> is then immediately scheduled to be executed after spark.barrier.sync.timeout.</p> <p>spark.barrier.sync.timeout</p> <p>Since spark.barrier.sync.timeout defaults to <code>365d</code> (1 year), the <code>TimerTask</code> will run only after one year.</p> <p>The <code>TimerTask</code> is stopped in cancelTimerTask.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/#initTimerTask","title":"Initializing TimerTask","text":"<pre><code>initTimerTask(\n  state: ContextBarrierState): Unit\n</code></pre> <p><code>initTimerTask</code> creates a new <code>TimerTask</code> (Java) that, when executed, sends a <code>SparkException</code> to all the requesters with the following message followed by cleanupBarrierStage for this ContextBarrierId.</p> <pre><code>The coordinator didn't get all barrier sync requests\nfor barrier epoch [barrierEpoch] from [barrierId] within [timeoutInSecs] second(s).\n</code></pre> <p>The <code>TimerTask</code> is made available as timerTask.</p> <p><code>initTimerTask</code> is used when:</p> <ul> <li><code>ContextBarrierState</code> is requested to handle a RequestToSync message (for the first global sync message received when the requesters is empty)</li> </ul>"},{"location":"barrier-execution-mode/ContextBarrierState/#messages","title":"messages","text":"<p><code>ContextBarrierState</code> initializes <code>messages</code> registry of messages from all numTasks barrier tasks (of a barrier stage attempt) when created.</p> <p><code>messages</code> registry is empty.</p> <p>A new message is registered (added) when handling a RequestToSync request.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/#handleRequest","title":"Handling RequestToSync Message","text":"<pre><code>handleRequest(\n  requester: RpcCallContext,\n  request: RequestToSync): Unit\n</code></pre> <p><code>handleRequest</code> makes sure that the RequestMethod (of the given RequestToSync) is consistent across barrier tasks (using requestMethods registry).</p> <p><code>handleRequest</code> asserts that the number of tasks is this numTasks, and so consistent across barrier tasks. Otherwise, <code>handleRequest</code> reports <code>IllegalArgumentException</code>:</p> <pre><code>Number of tasks of [barrierId] is [numTasks] from Task [taskId], previously it was [numTasks].\n</code></pre> <p><code>handleRequest</code> prints out the following INFO message to the logs (with the ContextBarrierId and barrierEpoch):</p> <pre><code>Current barrier epoch for [barrierId] is [barrierEpoch].\n</code></pre> <p>For the first sync message received (requesters is empty), <code>handleRequest</code> initializes the TimerTask and schedules it for execution after the timeoutInSecs.</p> <p>Timeout</p> <p>Starting the timerTask ensures that a sync may eventually time out (after a configured delay).</p> <p><code>handleRequest</code> registers the given <code>requester</code> in the requesters.</p> <p><code>handleRequest</code> registers the message of the RequestToSync in the messages for the partitionId.</p> <p><code>handleRequest</code> prints out the following INFO message to the logs:</p> <pre><code>Barrier sync epoch [barrierEpoch] from [barrierId] received update from Task taskId,\ncurrent progress: [requesters]/[numTasks].\n</code></pre>"},{"location":"barrier-execution-mode/ContextBarrierState/#updates-from-all-barrier-tasks-received","title":"Updates from All Barrier Tasks Received","text":"<p>When the barrier sync received updates from all barrier tasks (i.e., the number of requesters is the numTasks), <code>handleRequest</code> replies back to all the requesters with the messages.</p> <p><code>handleRequest</code> prints out the following INFO message to the logs:</p> <pre><code>Barrier sync epoch [barrierEpoch] from [barrierId] received all updates from tasks,\nfinished successfully.\n</code></pre> <p><code>handleRequest</code> increments the barrierEpoch, clears the requesters and the requestMethods, and then cancelTimerTask.</p> <p>In case of the epoch of the given RequestToSync being different from this barrierEpoch, <code>handleRequest</code> sends back a failure message (with a <code>SparkException</code>) to the given <code>requester</code>:</p> <pre><code>The request to sync of [barrierId] with barrier epoch [barrierEpoch] has already finished.\nMaybe task [taskId] is not properly killed.\n</code></pre> <p>In case of different RequestMethods (in requestMethods registry), <code>handleRequest</code> sends back a failure message to the requesters (incl. the given <code>requester</code>):</p> <pre><code>Different barrier sync types found for the sync [barrierId]: [requestMethods].\nPlease use the same barrier sync type within a single sync.\n</code></pre> <p><code>handleRequest</code> clear.</p> <p><code>handleRequest</code> is used when:</p> <ul> <li><code>BarrierCoordinator</code> is requested to handle a RequestToSync message</li> </ul>"},{"location":"barrier-execution-mode/ContextBarrierState/#logging","title":"Logging","text":"<p><code>ContextBarrierState</code> is a private class of BarrierCoordinator and logging is configured using the logger of BarrierCoordinator.</p>"},{"location":"barrier-execution-mode/RDDBarrier/","title":"RDDBarrier","text":"<p><code>RDDBarrier</code> is a wrapper around RDD with two custom map transformations:</p> <ul> <li>mapPartitions</li> <li>mapPartitionsWithIndex</li> </ul> <p>Unlike regular RDD.mapPartitions transformations, <code>RDDBarrier</code> transformations create a MapPartitionsRDD with isFromBarrier flag enabled.</p> <p><code>RDDBarrier</code> (of <code>T</code> records) marks the current stage as a barrier stage in Barrier Execution Mode.</p>"},{"location":"barrier-execution-mode/RDDBarrier/#creating-instance","title":"Creating Instance","text":"<p><code>RDDBarrier</code> takes the following to be created:</p> <ul> <li> RDD (of <code>T</code> records) <p><code>RDDBarrier</code> is created when:</p> <ul> <li>RDD.barrier transformation is used</li> </ul>"},{"location":"barrier-execution-mode/RequestMethod/","title":"RequestMethod","text":"<p><code>RequestMethod</code> represents the allowed request methods of RequestToSyncs (that are sent out from barrier tasks using BarrierTaskContext).</p> <p>ContextBarrierState tracks <code>RequestMethod</code>s (from tasks inside a barrier sync) to make sure that the tasks are all part of a legitimate barrier sync. All tasks should make sure that they're calling the same method within the same barrier sync phase.</p>"},{"location":"barrier-execution-mode/RequestMethod/#BARRIER","title":"BARRIER","text":"<p>Marks execution of BarrierTaskContext.barrier</p>"},{"location":"barrier-execution-mode/RequestMethod/#ALL_GATHER","title":"ALL_GATHER","text":"<p>Marks execution of BarrierTaskContext.allGather</p>"},{"location":"barrier-execution-mode/RequestToSync/","title":"RequestToSync RPC Message","text":"<p><code>RequestToSync</code> is a BarrierCoordinatorMessage to start Global Sync phase.</p> <p><code>RequestToSync</code> is sent out from BarrierTaskContext (i.e., barrier tasks on executors) to a BarrierCoordinator (on the driver) to handle.</p> Operation Message Request Message allGather User-defined message ALL_GATHER barrier empty BARRIER"},{"location":"barrier-execution-mode/RequestToSync/#creating-instance","title":"Creating Instance","text":"<p><code>RequestToSync</code> takes the following to be created:</p> <ul> <li> Number of tasks (partitions) <li> Stage ID <li> Stage Attempt ID <li> Task Attempt ID <li> BarrierEpoch <li> Partition ID <li> Message <li> RequestMethod <p><code>RequestToSync</code> is created when:</p> <ul> <li><code>BarrierTaskContext</code> is requested for a Global Sync</li> </ul>"},{"location":"broadcast-variables/","title":"Broadcast Variables","text":"<p>From the official documentation about Broadcast Variables:</p> <p>Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.</p> <p>And later in the document:</p> <p>Explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.</p> <p></p> <p>Spark uses <code>SparkContext</code> to create broadcast variables and BroadcastManager with ContextCleaner to manage their lifecycle.</p> <p></p> <p>Not only can Spark developers use broadcast variables for efficient data distribution, but Spark itself uses them quite often too. A very notable use case is when Spark distributes tasks (to executors) for execution.</p> <p>The idea is to transfer values used in transformations from a driver to executors in a most effective way so they are copied once and used many times by tasks (rather than being copied every time a task is launched).</p>"},{"location":"broadcast-variables/#lifecycle-of-broadcast-variable","title":"Lifecycle of Broadcast Variable <p>Broadcast variables (TorrentBroadcasts, actually) are created using SparkContext.broadcast method.</p> <pre><code>scala&gt; val b = sc.broadcast(1)\nb: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(0)\n</code></pre>  <p>Tip</p> <p>Enable <code>DEBUG</code> logging level for org.apache.spark.storage.BlockManager logger to debug <code>broadcast</code> method.</p>  <p>With DEBUG logging level enabled, there should be the following messages printed out to the logs:</p> <pre><code>Put block broadcast_0 locally took  430 ms\nPutting block broadcast_0 without replication took  431 ms\nTold master about block broadcast_0_piece0\nPut block broadcast_0_piece0 locally took  4 ms\nPutting block broadcast_0_piece0 without replication took  4 ms\n</code></pre> <p>A broadcast variable is stored on the driver's BlockManager as a single value and separately as chunks (of spark.broadcast.blockSize).</p> <p></p> <p>When requested for the broadcast value, <code>TorrentBroadcast</code> reads the broadcast block from the local BroadcastManager and, if fails, from the local BlockManager. Only when the local lookups fail, <code>TorrentBroadcast</code> reads the broadcast block chunks (from the <code>BlockMannager</code>s on the other executors), persists them as a single broadcast variable (in the local <code>BlockManager</code>) and caches in <code>BroadcastManager</code>.</p> <pre><code>scala&gt; b.value\nres0: Int = 1\n</code></pre> <p>Broadcast.value is the only way to access the value of a broadcast variable in a Spark transformation. You can only access the broadcast value any time until the broadcast variable is destroyed.</p> <p>With DEBUG logging level enabled, there should be the following messages printed out to the logs:</p> <pre><code>Getting local block broadcast_0\nLevel for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)\n</code></pre> <p>In the end, broadcast variables should be destroyed to release memory.</p> <pre><code>b.destroy\n</code></pre> <p>With DEBUG logging level enabled, there should be the following messages printed out to the logs:</p> <pre><code>Removing broadcast 0\nRemoving block broadcast_0_piece0\nTold master about block broadcast_0_piece0\nRemoving block broadcast_0\n</code></pre> <p>Broadcast variables can optionally be unpersisted.</p> <pre><code>b.unpersist\n</code></pre>","text":""},{"location":"broadcast-variables/#introduction","title":"Introduction <p>You use broadcast variable to implement map-side join, i.e. a join using a <code>map</code>. For this, lookup tables are distributed across nodes in a cluster using <code>broadcast</code> and then looked up inside <code>map</code> (to do the join implicitly).</p> <p>When you broadcast a value, it is copied to executors only once (while it is copied multiple times for tasks otherwise). It means that broadcast can help to get your Spark application faster if you have a large value to use in tasks or there are more tasks than executors.</p> <p>It appears that a Spark idiom emerges that uses <code>broadcast</code> with <code>collectAsMap</code> to create a <code>Map</code> for broadcast. When an RDD is <code>map</code> over to a smaller dataset (column-wise not record-wise), <code>collectAsMap</code>, and <code>broadcast</code>, using the very big RDD to map its elements to the broadcast RDDs is computationally faster.</p> <pre><code>val acMap = sc.broadcast(myRDD.map { case (a,b,c,b) =&gt; (a, c) }.collectAsMap)\nval otherMap = sc.broadcast(myOtherRDD.collectAsMap)\n\nmyBigRDD.map { case (a, b, c, d) =&gt;\n  (acMap.value.get(a).get, otherMap.value.get(c).get)\n}.collect\n</code></pre> <p>Use large broadcasted <code>HashMap</code>s over <code>RDD</code>s whenever possible and leave <code>RDD</code>s with a key to lookup necessary data as demonstrated above.</p>","text":""},{"location":"broadcast-variables/#demo","title":"Demo <p>You're going to use a static mapping of interesting projects with their websites, i.e. <code>Map[String, String]</code> that the tasks, i.e. closures (anonymous functions) in transformations, use.</p> <pre><code>val pws = Map(\n  \"Apache Spark\" -&gt; \"http://spark.apache.org/\",\n  \"Scala\" -&gt; \"http://www.scala-lang.org/\")\n\nval websites = sc.parallelize(Seq(\"Apache Spark\", \"Scala\")).map(pws).collect\n// websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)\n</code></pre> <p>It works, but is very ineffective as the <code>pws</code> map is sent over the wire to executors while it could have been there already. If there were more tasks that need the <code>pws</code> map, you could improve their performance by minimizing the number of bytes that are going to be sent over the network for task execution.</p> <p>Enter broadcast variables.</p> <pre><code>val pwsB = sc.broadcast(pws)\nval websites = sc.parallelize(Seq(\"Apache Spark\", \"Scala\")).map(pwsB.value).collect\n// websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)\n</code></pre> <p>Semantically, the two computations - with and without the broadcast value - are exactly the same, but the broadcast-based one wins performance-wise when there are more executors spawned to execute many tasks that use <code>pws</code> map.</p>","text":""},{"location":"broadcast-variables/#further-reading-or-watching","title":"Further Reading or Watching <ul> <li>Map-Side Join in Spark</li> </ul>","text":""},{"location":"broadcast-variables/Broadcast/","title":"Broadcast","text":"<p><code>Broadcast[T]</code> is an abstraction of broadcast variables (with the value of type <code>T</code>).</p>"},{"location":"broadcast-variables/Broadcast/#contract","title":"Contract","text":""},{"location":"broadcast-variables/Broadcast/#destroying-variable","title":"Destroying Variable <pre><code>doDestroy(\n  blocking: Boolean): Unit\n</code></pre> <p>Destroys all the data and metadata related to this broadcast variable</p> <p>Used when:</p> <ul> <li><code>Broadcast</code> is requested to destroy</li> </ul>","text":""},{"location":"broadcast-variables/Broadcast/#unpersisting-variable","title":"Unpersisting Variable <pre><code>doUnpersist(\n  blocking: Boolean): Unit\n</code></pre> <p>Deletes the cached copies of this broadcast value on executors</p> <p>Used when:</p> <ul> <li><code>Broadcast</code> is requested to unpersist</li> </ul>","text":""},{"location":"broadcast-variables/Broadcast/#broadcast-value","title":"Broadcast Value <pre><code>getValue(): T\n</code></pre> <p>Gets the broadcast value</p> <p>Used when:</p> <ul> <li><code>Broadcast</code> is requested for the value</li> </ul>","text":""},{"location":"broadcast-variables/Broadcast/#implementations","title":"Implementations","text":"<ul> <li>TorrentBroadcast</li> </ul>"},{"location":"broadcast-variables/Broadcast/#creating-instance","title":"Creating Instance","text":"<p><code>Broadcast</code> takes the following to be created:</p> <ul> <li> Unique Identifier Abstract Class <p><code>Broadcast</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete Broadcasts.</p>"},{"location":"broadcast-variables/Broadcast/#serializable","title":"Serializable <p><code>Broadcast</code> is a <code>Serializable</code> (Java) so it can be serialized (converted to bytes) and send over the wire from the driver to executors.</p>","text":""},{"location":"broadcast-variables/Broadcast/#destroying","title":"Destroying <pre><code>destroy(): Unit // (1)\ndestroy(\n  blocking: Boolean): Unit\n</code></pre> <ol> <li>Non-blocking destroy (<code>blocking</code> is <code>false</code>)</li> </ol> <p><code>destroy</code> removes persisted data and metadata associated with this broadcast variable.</p>  <p>Note</p> <p>Once a broadcast variable has been destroyed, it cannot be used again.</p>","text":""},{"location":"broadcast-variables/Broadcast/#unpersisting","title":"Unpersisting <pre><code>unpersist(): Unit // (1)\nunpersist(\n  blocking: Boolean): Unit\n</code></pre> <ol> <li>Non-blocking unpersist (<code>blocking</code> is <code>false</code>)</li> </ol> <p><code>unpersist</code>...FIXME</p>","text":""},{"location":"broadcast-variables/Broadcast/#brodcast-value","title":"Brodcast Value <pre><code>value: T\n</code></pre> <p><code>value</code> makes sure that it was not destroyed and gets the value.</p>","text":""},{"location":"broadcast-variables/Broadcast/#text-representation","title":"Text Representation <pre><code>toString: String\n</code></pre> <p><code>toString</code> uses the id as follows:</p> <pre><code>Broadcast([id])\n</code></pre>","text":""},{"location":"broadcast-variables/Broadcast/#validation","title":"Validation <p><code>Broadcast</code> is considered valid until destroyed.</p> <p><code>Broadcast</code> throws a <code>SparkException</code> (with the text representation) when destroyed but requested for the value, to unpersist or destroy:</p> <pre><code>Attempted to use [toString] after it was destroyed ([destroySite])\n</code></pre>","text":""},{"location":"broadcast-variables/BroadcastFactory/","title":"BroadcastFactory","text":"<p><code>BroadcastFactory</code> is an abstraction of broadcast variable factories that BroadcastManager uses to create or delete (unbroadcast) broadcast variables.</p>"},{"location":"broadcast-variables/BroadcastFactory/#contract","title":"Contract","text":""},{"location":"broadcast-variables/BroadcastFactory/#initialize","title":"Initializing","text":"<pre><code>initialize(\n  isDriver: Boolean,\n  conf: SparkConf): Unit\n</code></pre> Procedure <p><code>initialize</code> is a procedure (returns <code>Unit</code>) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).</p> <p>See:</p> <ul> <li>TorrentBroadcastFactory</li> </ul> <p>Used when:</p> <ul> <li><code>BroadcastManager</code> is requested to initialize</li> </ul>"},{"location":"broadcast-variables/BroadcastFactory/#newBroadcast","title":"Creating Broadcast Variable","text":"<pre><code>newBroadcast[T: ClassTag](\n  value: T,\n  isLocal: Boolean,\n  id: Long,\n  serializedOnly: Boolean = false): Broadcast[T]\n</code></pre> <p>See:</p> <ul> <li>TorrentBroadcastFactory</li> </ul> <p>Used when:</p> <ul> <li><code>BroadcastManager</code> is requested for a new broadcast variable</li> </ul>"},{"location":"broadcast-variables/BroadcastFactory/#stop","title":"Stopping","text":"<pre><code>stop(): Unit\n</code></pre> Procedure <p><code>stop</code> is a procedure (returns <code>Unit</code>) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).</p> <p>See:</p> <ul> <li>TorrentBroadcastFactory</li> </ul> <p>Used when:</p> <ul> <li><code>BroadcastManager</code> is requested to stop</li> </ul>"},{"location":"broadcast-variables/BroadcastFactory/#unbroadcast","title":"Deleting Broadcast Variable","text":"<pre><code>unbroadcast(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n</code></pre> Procedure <p><code>unbroadcast</code> is a procedure (returns <code>Unit</code>) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).</p> <p>See:</p> <ul> <li>TorrentBroadcastFactory</li> </ul> <p>Used when:</p> <ul> <li><code>BroadcastManager</code> is requested to delete a broadcast variable (unbroadcast)</li> </ul>"},{"location":"broadcast-variables/BroadcastFactory/#implementations","title":"Implementations","text":"<ul> <li>TorrentBroadcastFactory</li> </ul>"},{"location":"broadcast-variables/BroadcastManager/","title":"BroadcastManager","text":"<p><code>BroadcastManager</code> manages a TorrentBroadcastFactory.</p> <p></p> <p>Note</p> <p>As of Spark 2.0, it is no longer possible to plug a custom BroadcastFactory in, and TorrentBroadcastFactory is the only known implementation.</p>"},{"location":"broadcast-variables/BroadcastManager/#creating-instance","title":"Creating Instance","text":"<p><code>BroadcastManager</code> takes the following to be created:</p> <ul> <li> <code>isDriver</code> flag <li> SparkConf <li> <code>SecurityManager</code> <p>While being created, <code>BroadcastManager</code> is requested to initialize.</p> <p><code>BroadcastManager</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a base SparkEnv (for the driver and executors)</li> </ul>"},{"location":"broadcast-variables/BroadcastManager/#initializing","title":"Initializing <pre><code>initialize(): Unit\n</code></pre> <p>Unless initialized already, <code>initialize</code> creates a TorrentBroadcastFactory and requests it to initialize itself.</p>","text":""},{"location":"broadcast-variables/BroadcastManager/#torrentbroadcastfactory","title":"TorrentBroadcastFactory <p><code>BroadcastManager</code> manages a BroadcastFactory:</p> <ul> <li> <p>Creates and initializes it when created (and requested to initialize)</p> </li> <li> <p>Stops it when stopped</p> </li> </ul> <p><code>BroadcastManager</code> uses the <code>BroadcastFactory</code> when requested for the following:</p> <ul> <li>Creating a new broadcast variable</li> <li>Deleting a broadcast variable</li> </ul>","text":""},{"location":"broadcast-variables/BroadcastManager/#creating-broadcast-variable","title":"Creating Broadcast Variable <pre><code>newBroadcast(\n  value_ : T,\n  isLocal: Boolean): Broadcast[T]\n</code></pre> <p><code>newBroadcast</code> requests the BroadcastFactory for a new broadcast variable (with the next available broadcast ID).</p> <p><code>newBroadcast</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is requested for a new broadcast variable</li> <li><code>MapOutputTracker</code> utility is used to serializeMapStatuses</li> </ul>","text":""},{"location":"broadcast-variables/BroadcastManager/#unique-identifiers-of-broadcast-variables","title":"Unique Identifiers of Broadcast Variables <p><code>BroadcastManager</code> tracks broadcast variables and assigns unique and continuous identifiers.</p>","text":""},{"location":"broadcast-variables/BroadcastManager/#mapoutputtrackermaster","title":"MapOutputTrackerMaster <p><code>BroadcastManager</code> is used to create a MapOutputTrackerMaster</p>","text":""},{"location":"broadcast-variables/BroadcastManager/#deleting-broadcast-variable","title":"Deleting Broadcast Variable <pre><code>unbroadcast(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n</code></pre> <p><code>unbroadcast</code> requests the BroadcastFactory to delete a broadcast variable (by <code>id</code>).</p> <p><code>unbroadcast</code>\u00a0is used when:</p> <ul> <li><code>ContextCleaner</code> is requested to clean up a broadcast variable</li> </ul>","text":""},{"location":"broadcast-variables/TorrentBroadcast/","title":"TorrentBroadcast","text":"<p><code>TorrentBroadcast</code> is a Broadcast that uses a BitTorrent-like protocol for broadcast blocks distribution.</p> <p></p>"},{"location":"broadcast-variables/TorrentBroadcast/#creating-instance","title":"Creating Instance","text":"<p><code>TorrentBroadcast</code> takes the following to be created:</p> <ul> <li> Broadcast Value (of type <code>T</code>) <li> Identifier <p><code>TorrentBroadcast</code> is created\u00a0when:</p> <ul> <li><code>TorrentBroadcastFactory</code> is requested for a new broadcast variable</li> </ul>"},{"location":"broadcast-variables/TorrentBroadcast/#broadcastblockid","title":"BroadcastBlockId <p><code>TorrentBroadcast</code> creates a BroadcastBlockId (with the id) when created</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#number-of-block-chunks","title":"Number of Block Chunks <p><code>TorrentBroadcast</code> uses <code>numBlocks</code> for the number of blocks of a broadcast variable (that was blockified into when created).</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#transient-lazy-broadcast-value","title":"Transient Lazy Broadcast Value <pre><code>_value: T\n</code></pre> <p><code>TorrentBroadcast</code> uses <code>_value</code> transient registry for the value that is computed on demand (and cached afterwards).</p> <p><code>_value</code> is a <code>@transient private lazy val</code> and uses the following Scala language features:</p> <ol> <li>It is not serialized when the <code>TorrentBroadcast</code> is serialized to be sent over the wire to executors (and has to be re-computed afterwards)</li> <li>It is lazily instantiated when first requested and cached afterwards</li> </ol>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#value","title":"Value <pre><code>getValue(): T\n</code></pre> <p><code>getValue</code> uses the _value transient registry for the value if available (non-<code>null</code>).</p> <p>Otherwise, <code>getValue</code> reads the broadcast block (from the local BroadcastManager, BlockManager or falls back to readBlocks).</p> <p><code>getValue</code> saves the object in the _value registry.</p>  <p><code>getValue</code>\u00a0is part of the Broadcast abstraction.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#reading-broadcast-block","title":"Reading Broadcast Block <pre><code>readBroadcastBlock(): T\n</code></pre> <p><code>readBroadcastBlock</code> looks up the BroadcastBlockId in (the cache of) BroadcastManager and returns the value if found.</p> <p>Otherwise, <code>readBroadcastBlock</code> setConf and requests the BlockManager for the locally-stored broadcast data.</p> <p>If the broadcast block is found locally, <code>readBroadcastBlock</code> requests the <code>BroadcastManager</code> to cache it and returns the value.</p> <p>If not found locally, <code>readBroadcastBlock</code> multiplies the numBlocks by the blockSize for an estimated size of the broadcast block. <code>readBroadcastBlock</code> prints out the following INFO message to the logs:</p> <pre><code>Started reading broadcast variable [id] with [numBlocks] pieces\n(estimated total size [estimatedTotalSize])\n</code></pre> <p><code>readBroadcastBlock</code> readBlocks and prints out the following INFO message to the logs:</p> <pre><code>Reading broadcast variable [id] took [time] ms\n</code></pre> <p><code>readBroadcastBlock</code> unblockifies the block chunks into an object (using the Serializer and the CompressionCodec).</p> <p><code>readBroadcastBlock</code> requests the BlockManager to store the merged copy (so other tasks on this executor don't need to re-fetch it). <code>readBroadcastBlock</code> uses <code>MEMORY_AND_DISK</code> storage level and the <code>tellMaster</code> flag off.</p> <p><code>readBroadcastBlock</code> requests the <code>BroadcastManager</code> to cache it and returns the value.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#unblockifying-broadcast-value","title":"Unblockifying Broadcast Value <pre><code>unBlockifyObject(\n  blocks: Array[InputStream],\n  serializer: Serializer,\n  compressionCodec: Option[CompressionCodec]): T\n</code></pre> <p><code>unBlockifyObject</code>...FIXME</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#reading-broadcast-block-chunks","title":"Reading Broadcast Block Chunks <pre><code>readBlocks(): Array[BlockData]\n</code></pre> <p><code>readBlocks</code> creates a collection of BlockDatas for numBlocks block chunks.</p> <p>For every block (randomly-chosen by block ID between 0 and numBlocks), <code>readBlocks</code> creates a BroadcastBlockId for the id (of the broadcast variable) and the chunk (identified by the <code>piece</code> prefix followed by the ID).</p> <p><code>readBlocks</code> prints out the following DEBUG message to the logs:</p> <pre><code>Reading piece [pieceId] of [broadcastId]\n</code></pre> <p><code>readBlocks</code> first tries to look up the piece locally by requesting the <code>BlockManager</code> to getLocalBytes and, if found, stores the reference in the local block array (for the piece ID).</p> <p>If not found in the local <code>BlockManager</code>, <code>readBlocks</code> requests the <code>BlockManager</code> to getRemoteBytes.</p> <p>With checksumEnabled, <code>readBlocks</code>...FIXME</p> <p><code>readBlocks</code> requests the <code>BlockManager</code> to store the chunk (so other tasks on this executor don't need to re-fetch it) using <code>MEMORY_AND_DISK_SER</code> storage level and reporting to the driver (so other executors can pull these chunks from this executor as well).</p> <p><code>readBlocks</code> creates a ByteBufferBlockData for the chunk (and stores it in the <code>blocks</code> array).</p>  <p><code>readBlocks</code> throws a <code>SparkException</code> for blocks neither available locally nor remotely:</p> <pre><code>Failed to get [pieceId] of [broadcastId]\n</code></pre>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#compressioncodec","title":"CompressionCodec <pre><code>compressionCodec: Option[CompressionCodec]\n</code></pre> <p><code>TorrentBroadcast</code> uses the spark.broadcast.compress configuration property for the <code>CompressionCodec</code> to use for writeBlocks and readBroadcastBlock.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#broadcast-block-chunk-size","title":"Broadcast Block Chunk Size <p><code>TorrentBroadcast</code> uses the spark.broadcast.blockSize configuration property for the size of the chunks (pieces) of a broadcast block.</p> <p><code>TorrentBroadcast</code> uses the size for writeBlocks and readBroadcastBlock.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#persisting-broadcast-to-blockmanager","title":"Persisting Broadcast (to BlockManager) <pre><code>writeBlocks(\n  value: T): Int\n</code></pre> <p><code>writeBlocks</code> returns the number of blocks (chunks) this broadcast variable (was blockified into).</p> <p>The whole broadcast value is stored in the local <code>BlockManager</code> with <code>MEMORY_AND_DISK</code> storage level while the block chunks with <code>MEMORY_AND_DISK_SER</code> storage level.</p> <p><code>writeBlocks</code>\u00a0is used when:</p> <ul> <li><code>TorrentBroadcast</code> is created (that happens on the driver only)</li> </ul>  <p><code>writeBlocks</code> requests the BlockManager to store the given broadcast value (to be identified as the broadcastId and with the <code>MEMORY_AND_DISK</code> storage level).</p> <p><code>writeBlocks</code> blockify the object (into chunks of the block size, the Serializer, and the optional compressionCodec).</p> <p>With checksumEnabled <code>writeBlocks</code>...FIXME</p> <p>For every block, <code>writeBlocks</code> creates a BroadcastBlockId for the id and <code>piece[index]</code> identifier, and requests the <code>BlockManager</code> to store the chunk bytes (with <code>MEMORY_AND_DISK_SER</code> storage level and reporting to the driver).</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#blockifying-broadcast-variable","title":"Blockifying Broadcast Variable <pre><code>blockifyObject(\n  obj: T,\n  blockSize: Int,\n  serializer: Serializer,\n  compressionCodec: Option[CompressionCodec]): Array[ByteBuffer]\n</code></pre> <p><code>blockifyObject</code> divides (blockifies) the input <code>obj</code> broadcast value into blocks (<code>ByteBuffer</code> chunks). <code>blockifyObject</code> uses the given Serializer to write the value in a serialized format to a <code>ChunkedByteBufferOutputStream</code> of the given <code>blockSize</code> size with the optional <code>CompressionCodec</code>.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#error-handling","title":"Error Handling <p>In case of any error, <code>writeBlocks</code> prints out the following ERROR message to the logs and requests the local <code>BlockManager</code> to remove the broadcast.</p> <pre><code>Store broadcast [broadcastId] fail, remove all pieces of the broadcast\n</code></pre>  <p>In case of an error while storing the value itself, <code>writeBlocks</code> throws a <code>SparkException</code>:</p> <pre><code>Failed to store [broadcastId] in BlockManager\n</code></pre>  <p>In case of an error while storing the chunks of the blockified value, <code>writeBlocks</code> throws a <code>SparkException</code>:</p> <pre><code>Failed to store [pieceId] of [broadcastId] in local BlockManager\n</code></pre>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#destroying-variable","title":"Destroying Variable <pre><code>doDestroy(\n  blocking: Boolean): Unit\n</code></pre> <p><code>doDestroy</code> removes the persisted state (associated with the broadcast variable) on all the nodes in a Spark application (the driver and executors).</p> <p><code>doDestroy</code>\u00a0is part of the Broadcast abstraction.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#unpersisting-variable","title":"Unpersisting Variable <pre><code>doUnpersist(\n  blocking: Boolean): Unit\n</code></pre> <p><code>doUnpersist</code> removes the persisted state (associated with the broadcast variable) on executors only.</p> <p><code>doUnpersist</code>\u00a0is part of the Broadcast abstraction.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#removing-persisted-state-broadcast-blocks-of-broadcast-variable","title":"Removing Persisted State (Broadcast Blocks) of Broadcast Variable <pre><code>unpersist(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n</code></pre> <p><code>unpersist</code> prints out the following DEBUG message to the logs:</p> <pre><code>Unpersisting TorrentBroadcast [id]\n</code></pre> <p>In the end, <code>unpersist</code> requests the BlockManagerMaster to remove the blocks of the given broadcast.</p> <p><code>unpersist</code> is used when:</p> <ul> <li><code>TorrentBroadcast</code> is requested to unpersist and destroy</li> <li><code>TorrentBroadcastFactory</code> is requested to unbroadcast</li> </ul>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#setconf","title":"setConf <pre><code>setConf(\n  conf: SparkConf): Unit\n</code></pre> <p><code>setConf</code> uses the given SparkConf to initialize the compressionCodec, the blockSize and the checksumEnabled.</p> <p><code>setConf</code> is used when:</p> <ul> <li><code>TorrentBroadcast</code> is created and re-created (when deserialized on executors)</li> </ul>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.broadcast.TorrentBroadcast</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.broadcast.TorrentBroadcast=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcastFactory/","title":"TorrentBroadcastFactory","text":"<p><code>TorrentBroadcastFactory</code> is a BroadcastFactory of TorrentBroadcasts.</p> <p>Note</p> <p>As of Spark 2.0 <code>TorrentBroadcastFactory</code> is the only known BroadcastFactory.</p>"},{"location":"broadcast-variables/TorrentBroadcastFactory/#creating-instance","title":"Creating Instance","text":"<p><code>TorrentBroadcastFactory</code> takes no arguments to be created.</p> <p><code>TorrentBroadcastFactory</code> is created for BroadcastManager.</p>"},{"location":"broadcast-variables/TorrentBroadcastFactory/#newBroadcast","title":"Creating Broadcast Variable","text":"BroadcastFactory <pre><code>newBroadcast[T: ClassTag](\n  value_ : T,\n  isLocal: Boolean,\n  id: Long,\n  serializedOnly: Boolean = false): Broadcast[T]\n</code></pre> <p><code>newBroadcast</code>\u00a0is part of the BroadcastFactory abstraction.</p> <p><code>newBroadcast</code> creates a new TorrentBroadcast with the given <code>value_</code> and <code>id</code> (and ignoring <code>isLocal</code>).</p>"},{"location":"broadcast-variables/TorrentBroadcastFactory/#unbroadcast","title":"Deleting Broadcast Variable","text":"BroadcastFactory <pre><code>unbroadcast(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n</code></pre> <p><code>unbroadcast</code>\u00a0is part of the BroadcastFactory abstraction.</p> <p><code>unbroadcast</code> removes all persisted state associated with the broadcast variable (identified by <code>id</code>).</p>"},{"location":"broadcast-variables/TorrentBroadcastFactory/#initialize","title":"Initializing","text":"BroadcastFactory <pre><code>initialize(\n  isDriver: Boolean,\n  conf: SparkConf): Unit\n</code></pre> <p><code>initialize</code>\u00a0is part of the BroadcastFactory abstraction.</p> <p><code>initialize</code> does nothing (noop).</p>"},{"location":"broadcast-variables/TorrentBroadcastFactory/#stop","title":"Stopping","text":"BroadcastFactory <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>\u00a0is part of the BroadcastFactory abstraction.</p> <p><code>stop</code> does nothing (noop).</p>"},{"location":"core/BlockFetchStarter/","title":"BlockFetchStarter","text":"<p>BlockFetchStarter is the &lt;&gt; of...FIXME...to &lt;&gt;. <p>[[contract]] [[createAndStart]] [source, java]</p> <p>void createAndStart(String[] blockIds, BlockFetchingListener listener)    throws IOException, InterruptedException;</p> <p><code>createAndStart</code> is used when:</p> <ul> <li> <p><code>NettyBlockTransferService</code> is requested to storage:NettyBlockTransferService.md#fetchBlocks[fetchBlocks] (when network:TransportConf.md#io.maxRetries[maxIORetries] is <code>0</code>)</p> </li> <li> <p><code>RetryingBlockFetcher</code> is requested to core:RetryingBlockFetcher.md#fetchAllOutstanding[fetchAllOutstanding]</p> </li> </ul>"},{"location":"core/BlockFetchingListener/","title":"BlockFetchingListener","text":"<p><code>BlockFetchingListener</code>\u00a0is an extension of the <code>EventListener</code> (Java) abstraction that want to be notified about block fetch success and failures.</p> <p><code>BlockFetchingListener</code> is used to create a OneForOneBlockFetcher, <code>OneForOneBlockPusher</code> and RetryingBlockFetcher.</p>"},{"location":"core/BlockFetchingListener/#contract","title":"Contract","text":""},{"location":"core/BlockFetchingListener/#onblockfetchfailure","title":"onBlockFetchFailure <pre><code>void onBlockFetchFailure(\n  String blockId,\n  Throwable exception)\n</code></pre>","text":""},{"location":"core/BlockFetchingListener/#onblockfetchsuccess","title":"onBlockFetchSuccess <pre><code>void onBlockFetchSuccess(\n  String blockId,\n  ManagedBuffer data)\n</code></pre>","text":""},{"location":"core/BlockFetchingListener/#implementations","title":"Implementations","text":"<ul> <li>\"Unnamed\" in ShuffleBlockFetcherIterator</li> <li>\"Unnamed\" in BlockTransferService</li> <li>RetryingBlockFetchListener</li> </ul>"},{"location":"core/CleanerListener/","title":"CleanerListener","text":"<p>= CleanerListener</p> <p>CleanerListener is an abstraction of listeners that can be core:ContextCleaner.md#attachListener[registered with ContextCleaner] to be informed when &lt;&gt;, &lt;&gt;, &lt;&gt;, &lt;&gt; and &lt;&gt; are cleaned. <p>== [[rddCleaned]] rddCleaned Callback Method</p>"},{"location":"core/CleanerListener/#source-scala","title":"[source, scala]","text":"<p>rddCleaned(   rddId: Int): Unit</p> <p>rddCleaned is used when...FIXME</p> <p>== [[broadcastCleaned]] broadcastCleaned Callback Method</p>"},{"location":"core/CleanerListener/#source-scala_1","title":"[source, scala]","text":"<p>broadcastCleaned(   broadcastId: Long): Unit</p> <p>broadcastCleaned is used when...FIXME</p> <p>== [[shuffleCleaned]] shuffleCleaned Callback Method</p>"},{"location":"core/CleanerListener/#source-scala_2","title":"[source, scala]","text":"<p>shuffleCleaned(   shuffleId: Int,   blocking: Boolean): Unit</p> <p>shuffleCleaned is used when...FIXME</p> <p>== [[accumCleaned]] accumCleaned Callback Method</p>"},{"location":"core/CleanerListener/#source-scala_3","title":"[source, scala]","text":"<p>accumCleaned(   accId: Long): Unit</p> <p>accumCleaned is used when...FIXME</p> <p>== [[checkpointCleaned]] checkpointCleaned Callback Method</p>"},{"location":"core/CleanerListener/#source-scala_4","title":"[source, scala]","text":"<p>checkpointCleaned(   rddId: Long): Unit</p> <p>checkpointCleaned is used when...FIXME</p>"},{"location":"core/ContextCleaner/","title":"ContextCleaner","text":"<p><code>ContextCleaner</code> is a Spark service that is responsible for &lt;&gt; (cleanup) of &lt;&gt;, &lt;&gt;, &lt;&gt;, &lt;&gt; and &lt;&gt; that is aimed at reducing the memory requirements of long-running data-heavy Spark applications. <p></p>"},{"location":"core/ContextCleaner/#creating-instance","title":"Creating Instance","text":"<p>ContextCleaner takes the following to be created:</p> <ul> <li>[[sc]] SparkContext.md[]</li> </ul> <p><code>ContextCleaner</code> is created and requested to start when SparkContext is created with configuration-properties.md#spark.cleaner.referenceTracking[spark.cleaner.referenceTracking] configuration property enabled.</p> <p>== [[cleaningThread]] Spark Context Cleaner Cleaning Thread</p> <p>ContextCleaner uses a daemon thread Spark Context Cleaner to clean RDD, shuffle, and broadcast states.</p> <p>The Spark Context Cleaner thread is started when ContextCleaner is requested to &lt;&gt;. <p>== [[listeners]][[attachListener]] CleanerListeners</p> <p>ContextCleaner allows attaching core:CleanerListener.md[CleanerListeners] to be informed when objects are cleaned using <code>attachListener</code> method.</p>"},{"location":"core/ContextCleaner/#sourcescala","title":"[source,scala]","text":"<p>attachListener(   listener: CleanerListener): Unit</p> <p>== [[doCleanupRDD]] doCleanupRDD Method</p>"},{"location":"core/ContextCleaner/#source-scala","title":"[source, scala]","text":"<p>doCleanupRDD(   rddId: Int,   blocking: Boolean): Unit</p> <p>doCleanupRDD...FIXME</p> <p>doCleanupRDD is used when ContextCleaner is requested to &lt;&gt; for a CleanRDD. <p>== [[keepCleaning]] keepCleaning Internal Method</p>"},{"location":"core/ContextCleaner/#source-scala_1","title":"[source, scala]","text":""},{"location":"core/ContextCleaner/#keepcleaning-unit","title":"keepCleaning(): Unit","text":"<p>keepCleaning runs indefinitely until ContextCleaner is requested to &lt;&gt;. keepCleaning...FIXME <p>keepCleaning prints out the following DEBUG message to the logs:</p>"},{"location":"core/ContextCleaner/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#got-cleaning-task-task","title":"Got cleaning task [task]","text":"<p>keepCleaning is used in &lt;&gt; that is started once when ContextCleaner is requested to &lt;&gt;. <p>== [[registerRDDCheckpointDataForCleanup]] registerRDDCheckpointDataForCleanup Method</p>"},{"location":"core/ContextCleaner/#source-scala_2","title":"[source, scala]","text":"<p>registerRDDCheckpointDataForCleanupT: Unit</p> <p>registerRDDCheckpointDataForCleanup...FIXME</p> <p>registerRDDCheckpointDataForCleanup is used when ContextCleaner is requested to &lt;&gt; (with configuration-properties.md#spark.cleaner.referenceTracking.cleanCheckpoints[spark.cleaner.referenceTracking.cleanCheckpoints] configuration property enabled). <p>== [[registerBroadcastForCleanup]] registerBroadcastForCleanup Method</p>"},{"location":"core/ContextCleaner/#source-scala_3","title":"[source, scala]","text":"<p>registerBroadcastForCleanupT: Unit</p> <p>registerBroadcastForCleanup...FIXME</p> <p>registerBroadcastForCleanup is used when SparkContext is used to SparkContext.md#broadcast[create a broadcast variable].</p> <p>== [[registerRDDForCleanup]] registerRDDForCleanup Method</p>"},{"location":"core/ContextCleaner/#source-scala_4","title":"[source, scala]","text":"<p>registerRDDForCleanup(   rdd: RDD[_]): Unit</p> <p>registerRDDForCleanup...FIXME</p> <p>registerRDDForCleanup is used for rdd:RDD.md#persist[RDD.persist] operation.</p> <p>== [[registerAccumulatorForCleanup]] registerAccumulatorForCleanup Method</p>"},{"location":"core/ContextCleaner/#source-scala_5","title":"[source, scala]","text":"<p>registerAccumulatorForCleanup(   a: AccumulatorV2[_, _]): Unit</p> <p>registerAccumulatorForCleanup...FIXME</p> <p>registerAccumulatorForCleanup is used when AccumulatorV2 is requested to register.</p> <p>== [[stop]] Stopping ContextCleaner</p>"},{"location":"core/ContextCleaner/#source-scala_6","title":"[source, scala]","text":""},{"location":"core/ContextCleaner/#stop-unit","title":"stop(): Unit","text":"<p>stop...FIXME</p> <p>stop is used when SparkContext is requested to SparkContext.md#stop[stop].</p> <p>== [[start]] Starting ContextCleaner</p>"},{"location":"core/ContextCleaner/#source-scala_7","title":"[source, scala]","text":""},{"location":"core/ContextCleaner/#start-unit","title":"start(): Unit","text":"<p>start starts the &lt;&gt; and an action to request the JVM garbage collector (using <code>System.gc()</code>) on regular basis per configuration-properties.md#spark.cleaner.periodicGC.interval[spark.cleaner.periodicGC.interval] configuration property. <p>The action to request the JVM GC is scheduled on &lt;&gt;. <p><code>start</code> is used when SparkContext is created.</p> <p>== [[periodicGCService]] periodicGCService Single-Thread Executor Service</p> <p>periodicGCService is an internal single-thread {java-javadoc-url}/java/util/concurrent/ScheduledExecutorService.html[executor service] with the name context-cleaner-periodic-gc to request the JVM garbage collector.</p> <p>The periodic runs are started when &lt;&gt; and stopped when &lt;&gt;. <p>== [[registerShuffleForCleanup]] Registering ShuffleDependency for Cleanup</p>"},{"location":"core/ContextCleaner/#source-scala_8","title":"[source, scala]","text":"<p>registerShuffleForCleanup(   shuffleDependency: ShuffleDependency[_, _, _]): Unit</p> <p>registerShuffleForCleanup registers the given ShuffleDependency for cleanup.</p> <p>Internally, registerShuffleForCleanup simply executes &lt;&gt; for the given ShuffleDependency. <p><code>registerShuffleForCleanup</code> is used when ShuffleDependency is created.</p> <p>== [[registerForCleanup]] Registering Object Reference For Cleanup</p>"},{"location":"core/ContextCleaner/#source-scala_9","title":"[source, scala]","text":"<p>registerForCleanup(   objectForCleanup: AnyRef,   task: CleanupTask): Unit</p> <p>registerForCleanup adds the input objectForCleanup to the &lt;&gt; internal queue. <p>Despite the widest-possible <code>AnyRef</code> type of the input <code>objectForCleanup</code>, the type is really <code>CleanupTaskWeakReference</code> which is a custom Java's {java-javadoc-url}/java/lang/ref/WeakReference.html[java.lang.ref.WeakReference].</p> <p>registerForCleanup is used when ContextCleaner is requested to &lt;&gt;, &lt;&gt;, &lt;&gt;, &lt;&gt;, and &lt;&gt;. <p>== [[doCleanupShuffle]] Shuffle Cleanup</p>"},{"location":"core/ContextCleaner/#source-scala_10","title":"[source, scala]","text":"<p>doCleanupShuffle(   shuffleId: Int,   blocking: Boolean): Unit</p> <p>doCleanupShuffle performs a shuffle cleanup which is to remove the shuffle from the current scheduler:MapOutputTrackerMaster.md[MapOutputTrackerMaster] and storage:BlockManagerMaster.md[BlockManagerMaster]. doCleanupShuffle also notifies core:CleanerListener.md[CleanerListeners].</p> <p>Internally, when executed, doCleanupShuffle prints out the following DEBUG message to the logs:</p>"},{"location":"core/ContextCleaner/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#cleaning-shuffle-id","title":"Cleaning shuffle [id]","text":"<p>doCleanupShuffle uses core:SparkEnv.md[SparkEnv] to access the core:SparkEnv.md#mapOutputTracker[MapOutputTracker] to scheduler:MapOutputTracker.md#unregisterShuffle[unregister the given shuffle].</p> <p>doCleanupShuffle uses core:SparkEnv.md[SparkEnv] to access the core:SparkEnv.md#blockManager[BlockManagerMaster] to storage:BlockManagerMaster.md#removeShuffle[remove the shuffle blocks] (for the given shuffleId).</p> <p>doCleanupShuffle informs all registered &lt;&gt; that core:CleanerListener.md#shuffleCleaned[shuffle was cleaned]. <p>In the end, doCleanupShuffle prints out the following DEBUG message to the logs:</p>"},{"location":"core/ContextCleaner/#sourceplaintext_2","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#cleaned-shuffle-id","title":"Cleaned shuffle [id]","text":"<p>In case of any exception, doCleanupShuffle prints out the following ERROR message to the logs and the exception itself:</p>"},{"location":"core/ContextCleaner/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#error-cleaning-shuffle-id","title":"Error cleaning shuffle [id]","text":"<p>doCleanupShuffle is used when ContextCleaner is requested to &lt;&gt; and (interestingly) while fitting an <code>ALSModel</code> (in Spark MLlib). <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.ContextCleaner</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>"},{"location":"core/ContextCleaner/#sourceplaintext_4","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#log4jloggerorgapachesparkcontextcleanerall","title":"log4j.logger.org.apache.spark.ContextCleaner=ALL","text":"<p>Refer to spark-logging.md[Logging].</p> <p>== [[internal-properties]] Internal Properties</p> <p>=== [[referenceBuffer]] referenceBuffer</p> <p>=== [[referenceQueue]] referenceQueue</p>"},{"location":"core/InMemoryStore/","title":"InMemoryStore","text":"<p><code>InMemoryStore</code> is a KVStore.</p>"},{"location":"core/InMemoryStore/#creating-instance","title":"Creating Instance","text":"<p><code>InMemoryStore</code> takes no arguments to be created.</p> <p><code>InMemoryStore</code> is created when:</p> <ul> <li><code>FsHistoryProvider</code> is created and requested to createInMemoryStore</li> <li><code>AppStatusStore</code> utility is used to create an AppStatusStore for a live Spark application</li> </ul>"},{"location":"core/KVStore/","title":"KVStore","text":"<p><code>KVStore</code> is an abstraction of key-value stores.</p> <p><code>KVStore</code> is a Java Closeable.</p>"},{"location":"core/KVStore/#contract","title":"Contract","text":""},{"location":"core/KVStore/#count","title":"count <pre><code>long count(\n  Class&lt;?&gt; type)\nlong count(\n  Class&lt;?&gt; type,\n  String index,\n  Object indexedValue)\n</code></pre>","text":""},{"location":"core/KVStore/#delete","title":"delete <pre><code>void delete(\n  Class&lt;?&gt; type,\n  Object naturalKey)\n</code></pre>","text":""},{"location":"core/KVStore/#getmetadata","title":"getMetadata <pre><code>&lt;T&gt; T getMetadata(\n  Class&lt;T&gt; klass)\n</code></pre>","text":""},{"location":"core/KVStore/#read","title":"read <pre><code>&lt;T&gt; T read(\n  Class&lt;T&gt; klass,\n  Object naturalKey)\n</code></pre>","text":""},{"location":"core/KVStore/#removeallbyindexvalues","title":"removeAllByIndexValues <pre><code>&lt;T&gt; boolean removeAllByIndexValues(\n  Class&lt;T&gt; klass,\n  String index,\n  Collection&lt;?&gt; indexValues)\n</code></pre>","text":""},{"location":"core/KVStore/#setmetadata","title":"setMetadata <pre><code>void setMetadata(\n  Object value)\n</code></pre>","text":""},{"location":"core/KVStore/#view","title":"view <pre><code>&lt;T&gt; KVStoreView&lt;T&gt; view(\n  Class&lt;T&gt; type)\n</code></pre> <p><code>KVStoreView</code> over entities of the given <code>type</code></p>","text":""},{"location":"core/KVStore/#write","title":"write <pre><code>void write(\n  Object value)\n</code></pre>","text":""},{"location":"core/KVStore/#implementations","title":"Implementations","text":"<ul> <li>ElementTrackingStore</li> <li>InMemoryStore</li> <li>LevelDB</li> </ul>"},{"location":"core/LevelDB/","title":"LevelDB","text":"<p><code>LevelDB</code> is a KVStore for FsHistoryProvider.</p>"},{"location":"core/LevelDB/#creating-instance","title":"Creating Instance","text":"<p><code>LevelDB</code> takes the following to be created:</p> <ul> <li> Path <li> <code>KVStoreSerializer</code> <p><code>LevelDB</code> is created\u00a0when:</p> <ul> <li><code>KVUtils</code> utility is used to <code>open</code> (a LevelDB store)</li> </ul>"},{"location":"core/RetryingBlockFetcher/","title":"RetryingBlockFetcher","text":"<p>RetryingBlockFetcher is...FIXME</p> <p>RetryingBlockFetcher is &lt;&gt; and immediately &lt;&gt; when: <ul> <li><code>NettyBlockTransferService</code> is requested to storage:NettyBlockTransferService.md#fetchBlocks[fetchBlocks] (when network:TransportConf.md#io.maxRetries[maxIORetries] is greater than <code>0</code> which it is by default)</li> </ul> <p>RetryingBlockFetcher uses a &lt;&gt; to core:BlockFetchStarter.md#createAndStart[createAndStart] when requested to &lt;&gt; and later &lt;&gt;. <p>[[outstandingBlocksIds]] RetryingBlockFetcher uses <code>outstandingBlocksIds</code> internal registry of outstanding block IDs to fetch that is initially the &lt;&gt; when &lt;&gt;. <p>At &lt;&gt;, RetryingBlockFetcher prints out the following INFO message to the logs (with the number of &lt;&gt;): <pre><code>Retrying fetch ([retryCount]/[maxRetries]) for [size] outstanding blocks after [retryWaitTime] ms\n</code></pre> <p>On &lt;&gt; and &lt;&gt;, &lt;&gt; removes the block ID from &lt;&gt;. <p>[[currentListener]] RetryingBlockFetcher uses a &lt;&gt; to remove block IDs from the &lt;&gt; internal registry. <p>== [[creating-instance]] Creating RetryingBlockFetcher Instance</p> <p>RetryingBlockFetcher takes the following when created:</p> <ul> <li>[[conf]] network:TransportConf.md[]</li> <li>[[fetchStarter]] core:BlockFetchStarter.md[]</li> <li>[[blockIds]] Block IDs to fetch</li> <li>[[listener]] core:BlockFetchingListener.md[]</li> </ul> <p>== [[start]] Starting RetryingBlockFetcher -- <code>start</code> Method</p>"},{"location":"core/RetryingBlockFetcher/#source-java","title":"[source, java]","text":""},{"location":"core/RetryingBlockFetcher/#void-start","title":"void start()","text":"<p><code>start</code> simply &lt;&gt;. <p><code>start</code> is used when:</p> <ul> <li><code>NettyBlockTransferService</code> is requested to storage:NettyBlockTransferService.md#fetchBlocks[fetchBlocks] (when network:TransportConf.md#io.maxRetries[maxIORetries] is greater than <code>0</code> which it is by default)</li> </ul> <p>== [[initiateRetry]] <code>initiateRetry</code> Internal Method</p>"},{"location":"core/RetryingBlockFetcher/#source-java_1","title":"[source, java]","text":""},{"location":"core/RetryingBlockFetcher/#synchronized-void-initiateretry","title":"synchronized void initiateRetry()","text":"<p><code>initiateRetry</code>...FIXME</p>"},{"location":"core/RetryingBlockFetcher/#note","title":"[NOTE]","text":"<p><code>initiateRetry</code> is used when:</p> <ul> <li>RetryingBlockFetcher is requested to &lt;&gt;"},{"location":"core/RetryingBlockFetcher/#retryingblockfetchlistener-is-requested-to","title":"* <code>RetryingBlockFetchListener</code> is requested to &lt;&gt; <p>== [[fetchAllOutstanding]] <code>fetchAllOutstanding</code> Internal Method</p>","text":""},{"location":"core/RetryingBlockFetcher/#source-java_2","title":"[source, java]","text":""},{"location":"core/RetryingBlockFetcher/#void-fetchalloutstanding","title":"void fetchAllOutstanding()","text":"<p><code>fetchAllOutstanding</code> requests &lt;&gt; to core:BlockFetchStarter.md#createAndStart[createAndStart] for the &lt;&gt;. <p>NOTE: <code>fetchAllOutstanding</code> is used when RetryingBlockFetcher is requested to &lt;&gt; and &lt;&gt;. <p>== [[RetryingBlockFetchListener]] RetryingBlockFetchListener</p> <p><code>RetryingBlockFetchListener</code> is a core:BlockFetchingListener.md[] that &lt;&gt; uses to remove block IDs from the &lt;&gt; internal registry. <p>=== [[RetryingBlockFetchListener-onBlockFetchSuccess]] <code>onBlockFetchSuccess</code> Method</p>"},{"location":"core/RetryingBlockFetcher/#source-scala","title":"[source, scala]","text":""},{"location":"core/RetryingBlockFetcher/#void-onblockfetchsuccessstring-blockid-managedbuffer-data","title":"void onBlockFetchSuccess(String blockId, ManagedBuffer data)","text":"<p>NOTE: <code>onBlockFetchSuccess</code> is part of core:BlockFetchingListener.md#onBlockFetchSuccess[BlockFetchingListener Contract].</p> <p><code>onBlockFetchSuccess</code>...FIXME</p> <p>=== [[RetryingBlockFetchListener-onBlockFetchFailure]] <code>onBlockFetchFailure</code> Method</p>"},{"location":"core/RetryingBlockFetcher/#source-scala_1","title":"[source, scala]","text":""},{"location":"core/RetryingBlockFetcher/#void-onblockfetchfailurestring-blockid-throwable-exception","title":"void onBlockFetchFailure(String blockId, Throwable exception)","text":"<p>NOTE: <code>onBlockFetchFailure</code> is part of core:BlockFetchingListener.md#onBlockFetchFailure[BlockFetchingListener Contract].</p> <p><code>onBlockFetchFailure</code>...FIXME</p>"},{"location":"demo/","title":"Demos","text":"<p>The following demos are available:</p> <ul> <li>DiskBlockManager and Block Data</li> </ul>"},{"location":"demo/diskblockmanager-and-block-data/","title":"Demo: DiskBlockManager and Block Data","text":"<p>The demo shows how Spark stores data blocks on local disk (using DiskBlockManager and DiskStore among the services).</p>"},{"location":"demo/diskblockmanager-and-block-data/#configure-local-directories","title":"Configure Local Directories","text":"<p>Spark uses spark.local.dir configuration property for one or more local directories to store data blocks.</p> <p>Start <code>spark-shell</code> with the property set to a directory of your choice (say <code>local-dirs</code>). Use one directory for easier monitoring.</p> <pre><code>$SPARK_HOME/bin/spark-shell --conf spark.local.dir=local-dirs\n</code></pre> <p>When started, Spark will create a proper directory layout. You are interested in <code>blockmgr-[uuid]</code> directory.</p>"},{"location":"demo/diskblockmanager-and-block-data/#create-data-blocks","title":"\"Create\" Data Blocks","text":"<p>Execute the following Spark application that forces persisting (caching) data to disk.</p> <pre><code>import org.apache.spark.storage.StorageLevel\nspark.range(2).persist(StorageLevel.DISK_ONLY).count\n</code></pre>"},{"location":"demo/diskblockmanager-and-block-data/#observe-block-files","title":"Observe Block Files","text":""},{"location":"demo/diskblockmanager-and-block-data/#command-line","title":"Command Line","text":"<p>Go to the <code>blockmgr-[uuid]</code> directory and observe the block files. There should be a few. Do you know how many and why?</p> <pre><code>$ tree local-dirs/blockmgr-b7167b5a-ae8d-404b-8de2-1a0fb101fe00/\nlocal-dirs/blockmgr-b7167b5a-ae8d-404b-8de2-1a0fb101fe00/\n\u251c\u2500\u2500 00\n\u251c\u2500\u2500 04\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_8_0.data\n\u251c\u2500\u2500 06\n\u251c\u2500\u2500 08\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_8_0.index\n...\n\u251c\u2500\u2500 37\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_7_0.index\n\u251c\u2500\u2500 38\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_4_0.data\n\u251c\u2500\u2500 39\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_9_0.index\n\u2514\u2500\u2500 3a\n    \u2514\u2500\u2500 shuffle_0_6_0.data\n\n47 directories, 48 files\n</code></pre>"},{"location":"demo/diskblockmanager-and-block-data/#diskblockmanager","title":"DiskBlockManager","text":"<p>The files are managed by DiskBlockManager that is available to access all the files as well.</p> <pre><code>import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.diskBlockManager.getAllFiles()\n</code></pre>"},{"location":"demo/diskblockmanager-and-block-data/#use-web-ui","title":"Use web UI","text":"<p>Open http://localhost:4040 and switch to Storage tab (at http://localhost:4040/storage/). You should see one RDD cached.</p> <p></p> <p>Click the link in RDD Name column and review the information.</p>"},{"location":"demo/diskblockmanager-and-block-data/#enable-logging","title":"Enable Logging","text":"<p>Enable ALL logging level for org.apache.spark.storage.DiskStore and org.apache.spark.storage.DiskBlockManager loggers to have an even deeper insight on the block storage internals.</p> <pre><code>log4j.logger.org.apache.spark.storage.DiskBlockManager=ALL\nlog4j.logger.org.apache.spark.storage.DiskStore=ALL\n</code></pre>"},{"location":"dynamic-allocation/","title":"Dynamic Allocation of Executors","text":"<p>Dynamic Allocation of Executors (Dynamic Resource Allocation or Elastic Scaling) is a Spark service for adding and removing Spark executors dynamically on demand to match workload.</p> <p>Unlike the \"traditional\" static allocation where a Spark application reserves CPU and memory resources upfront (irrespective of how much it may eventually use), in dynamic allocation you get as much as needed and no more. It scales the number of executors up and down based on workload, i.e. idle executors are removed, and when there are pending tasks waiting for executors to be launched on, dynamic allocation requests them.</p> <p>Dynamic Allocation is enabled (and <code>SparkContext</code> creates an ExecutorAllocationManager) when:</p> <ol> <li> <p>spark.dynamicAllocation.enabled configuration property is enabled</p> </li> <li> <p>spark.master is non-<code>local</code></p> </li> <li> <p>SchedulerBackend is an ExecutorAllocationClient</p> </li> </ol> <p>ExecutorAllocationManager is the heart of Dynamic Resource Allocation.</p> <p>When enabled, it is recommended to use the External Shuffle Service.</p> <p>Dynamic Allocation comes with the policy of scaling executors up and down as follows:</p> <ol> <li>Scale Up Policy requests new executors when there are pending tasks and increases the number of executors exponentially since executors start slow and Spark application may need slightly more.</li> <li>Scale Down Policy removes executors that have been idle for spark.dynamicAllocation.executorIdleTimeout seconds.</li> </ol>"},{"location":"dynamic-allocation/#performance-metrics","title":"Performance Metrics","text":"<p>ExecutorAllocationManagerSource metric source is used to report performance metrics.</p>"},{"location":"dynamic-allocation/#sparkcontextkillexecutors","title":"SparkContext.killExecutors","text":"<p>SparkContext.killExecutors is unsupported with Dynamic Allocation enabled.</p>"},{"location":"dynamic-allocation/#programmable-dynamic-allocation","title":"Programmable Dynamic Allocation","text":"<p><code>SparkContext</code> offers a developer API to scale executors up or down.</p>"},{"location":"dynamic-allocation/#getting-initial-number-of-executors-for-dynamic-allocation","title":"Getting Initial Number of Executors for Dynamic Allocation <pre><code>getDynamicAllocationInitialExecutors(conf: SparkConf): Int\n</code></pre> <p><code>getDynamicAllocationInitialExecutors</code> first makes sure that &lt;&gt; is equal or greater than &lt;&gt;. <p>NOTE: &lt;&gt; falls back to &lt;&gt; if not set. Why to print the WARN message to the logs? <p>If not, you should see the following WARN message in the logs:</p> <pre><code>spark.dynamicAllocation.initialExecutors less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.\n</code></pre> <p><code>getDynamicAllocationInitialExecutors</code> makes sure that executor:Executor.md#spark.executor.instances[spark.executor.instances] is greater than &lt;&gt;. <p>NOTE: Both executor:Executor.md#spark.executor.instances[spark.executor.instances] and &lt;&gt; fall back to <code>0</code> when no defined explicitly. <p>If not, you should see the following WARN message in the logs:</p> <pre><code>spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.\n</code></pre> <p><code>getDynamicAllocationInitialExecutors</code> sets the initial number of executors to be the maximum of:</p> <ul> <li>spark.dynamicAllocation.minExecutors</li> <li>spark.dynamicAllocation.initialExecutors</li> <li>spark.executor.instances</li> <li><code>0</code></li> </ul> <p>You should see the following INFO message in the logs:</p> <pre><code>Using initial executors = [initialExecutors], max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances\n</code></pre> <p><code>getDynamicAllocationInitialExecutors</code> is used when <code>ExecutorAllocationManager</code> is requested to set the initial number of executors.</p>","text":""},{"location":"dynamic-allocation/#resources","title":"Resources","text":""},{"location":"dynamic-allocation/#documentation","title":"Documentation","text":"<ul> <li>Dynamic Allocation in the official documentation of Apache Spark</li> <li>Dynamic allocation in the documentation of Cloudera Data Platform (CDP)</li> </ul>"},{"location":"dynamic-allocation/#slides","title":"Slides","text":"<ul> <li>Dynamic Allocation in Spark by Databricks</li> </ul>"},{"location":"dynamic-allocation/ExecutorAllocationClient/","title":"ExecutorAllocationClient","text":"<p><code>ExecutorAllocationClient</code> is an abstraction of schedulers that can communicate with a cluster manager to request or kill executors.</p>"},{"location":"dynamic-allocation/ExecutorAllocationClient/#contract","title":"Contract","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#active-executor-ids","title":"Active Executor IDs <pre><code>getExecutorIds(): Seq[String]\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested for active executors</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#isexecutoractive","title":"isExecutorActive <pre><code>isExecutorActive(\n  id: String): Boolean\n</code></pre> <p>Whether a given executor (by ID) is active (and can be used to execute tasks)</p> <p>Used when:</p> <ul> <li>FIXME</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#killing-executors","title":"Killing Executors <pre><code>killExecutors(\n  executorIds: Seq[String],\n  adjustTargetNumExecutors: Boolean,\n  countFailures: Boolean,\n  force: Boolean = false): Seq[String]\n</code></pre> <p>Requests a cluster manager to kill given executors and returns whether the request has been acknowledged by the cluster manager (<code>true</code>) or not (<code>false</code>).</p> <p>Used when:</p> <ul> <li><code>ExecutorAllocationClient</code> is requested to kill an executor</li> <li><code>ExecutorAllocationManager</code> is requested to removeExecutors</li> <li><code>SparkContext</code> is requested to kill executors and killAndReplaceExecutor</li> <li><code>BlacklistTracker</code> is requested to kill an executor</li> <li><code>DriverEndpoint</code> is requested to handle a KillExecutorsOnHost message</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#killing-executors-on-host","title":"Killing Executors on Host <pre><code>killExecutorsOnHost(\n  host: String): Boolean\n</code></pre> <p>Used when:</p> <ul> <li><code>BlacklistTracker</code> is requested to kill executors on a blacklisted node</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#requesting-additional-executors","title":"Requesting Additional Executors <pre><code>requestExecutors(\n  numAdditionalExecutors: Int): Boolean\n</code></pre> <p>Requests additional executors from a cluster manager and returns whether the request has been acknowledged by the cluster manager (<code>true</code>) or not (<code>false</code>).</p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested for additional executors</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#updating-total-executors","title":"Updating Total Executors <pre><code>requestTotalExecutors(\n  resourceProfileIdToNumExecutors: Map[Int, Int],\n  numLocalityAwareTasksPerResourceProfileId: Map[Int, Int],\n  hostToLocalTaskCount: Map[Int, Map[String, Int]]): Boolean\n</code></pre> <p>Updates a cluster manager with the exact number of executors desired. Returns whether the request has been acknowledged by the cluster manager (<code>true</code>) or not (<code>false</code>).</p> <p>Used when:</p> <ul> <li> <p><code>SparkContext</code> is requested to update the number of total executors</p> </li> <li> <p><code>ExecutorAllocationManager</code> is requested to start, updateAndSyncNumExecutorsTarget, addExecutors, removeExecutors</p> </li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#implementations","title":"Implementations","text":"<ul> <li>CoarseGrainedSchedulerBackend</li> <li><code>KubernetesClusterSchedulerBackend</code> (Spark on Kubernetes)</li> <li><code>MesosCoarseGrainedSchedulerBackend</code></li> <li><code>StandaloneSchedulerBackend</code> ([Spark Standalone]https://books.japila.pl/spark-standalone-internals/StandaloneSchedulerBackend))</li> <li><code>YarnSchedulerBackend</code></li> </ul>"},{"location":"dynamic-allocation/ExecutorAllocationClient/#killing-single-executor","title":"Killing Single Executor <pre><code>killExecutor(\n  executorId: String): Boolean\n</code></pre> <p><code>killExecutor</code> kill the given executor.</p> <p><code>killExecutor</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> removes an executor.</li> <li><code>SparkContext</code> is requested to kill executors.</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#decommissioning-executors","title":"Decommissioning Executors <pre><code>decommissionExecutors(\n  executorsAndDecomInfo: Array[(String, ExecutorDecommissionInfo)],\n  adjustTargetNumExecutors: Boolean,\n  triggeredByExecutor: Boolean): Seq[String]\n</code></pre> <p><code>decommissionExecutors</code> kills the given executors.</p> <p><code>decommissionExecutors</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationClient</code> is requested to decommission a single executor</li> <li><code>ExecutorAllocationManager</code> is requested to remove executors</li> <li><code>StandaloneSchedulerBackend</code> (Spark Standalone) is requested to <code>executorDecommissioned</code></li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#decommissioning-single-executor","title":"Decommissioning Single Executor <pre><code>decommissionExecutor(\n  executorId: String,\n  decommissionInfo: ExecutorDecommissionInfo,\n  adjustTargetNumExecutors: Boolean,\n  triggeredByExecutor: Boolean = false): Boolean\n</code></pre> <p><code>decommissionExecutor</code>...FIXME</p> <p><code>decommissionExecutor</code>\u00a0is used when:</p> <ul> <li><code>DriverEndpoint</code> is requested to handle a ExecutorDecommissioning message</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationListener/","title":"ExecutorAllocationListener","text":"<p><code>ExecutorAllocationListener</code> is a SparkListener.md[] that intercepts events about stages, tasks, and executors, i.e. onStageSubmitted, onStageCompleted, onTaskStart, onTaskEnd, onExecutorAdded, and onExecutorRemoved. Using the events ExecutorAllocationManager can manage the pool of dynamically managed executors.</p> <p>Internal Class</p> <p><code>ExecutorAllocationListener</code> is an internal class of ExecutorAllocationManager with full access to internal registries.</p>"},{"location":"dynamic-allocation/ExecutorAllocationManager/","title":"ExecutorAllocationManager","text":"<p><code>ExecutorAllocationManager</code> can be used to dynamically allocate executors based on processing workload.</p> <p><code>ExecutorAllocationManager</code> intercepts Spark events using the internal ExecutorAllocationListener that keeps track of the workload.</p>"},{"location":"dynamic-allocation/ExecutorAllocationManager/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorAllocationManager</code> takes the following to be created:</p> <ul> <li>ExecutorAllocationClient</li> <li> LiveListenerBus <li> SparkConf <li> ContextCleaner (default: <code>None</code>) <li> <code>Clock</code> (default: <code>SystemClock</code>) <p><code>ExecutorAllocationManager</code> is created (and started) when SparkContext is created (with Dynamic Allocation of Executors enabled)</p>"},{"location":"dynamic-allocation/ExecutorAllocationManager/#validating-configuration","title":"Validating Configuration <pre><code>validateSettings(): Unit\n</code></pre> <p><code>validateSettings</code> makes sure that the settings for dynamic allocation are correct.</p> <p><code>validateSettings</code> throws a <code>SparkException</code> when the following are not met:</p> <ul> <li> <p>spark.dynamicAllocation.minExecutors must be positive</p> </li> <li> <p>spark.dynamicAllocation.maxExecutors must be <code>0</code> or greater</p> </li> <li> <p>spark.dynamicAllocation.minExecutors must be less than or equal to spark.dynamicAllocation.maxExecutors</p> </li> <li> <p>spark.dynamicAllocation.executorIdleTimeout must be greater than <code>0</code></p> </li> <li> <p>spark.shuffle.service.enabled must be enabled.</p> </li> <li> <p>The number of tasks per core, i.e. spark.executor.cores divided by spark.task.cpus, is not zero.</p> </li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#performance-metrics","title":"Performance Metrics","text":"<p><code>ExecutorAllocationManager</code> uses ExecutorAllocationManagerSource for performance metrics.</p>"},{"location":"dynamic-allocation/ExecutorAllocationManager/#executormonitor","title":"ExecutorMonitor <p><code>ExecutorAllocationManager</code> creates an ExecutorMonitor when created.</p> <p><code>ExecutorMonitor</code> is added to the management queue (of LiveListenerBus) when <code>ExecutorAllocationManager</code> is started.</p> <p><code>ExecutorMonitor</code> is attached (to the ContextCleaner) when <code>ExecutorAllocationManager</code> is started.</p> <p><code>ExecutorMonitor</code> is requested to reset when <code>ExecutorAllocationManager</code> is requested to reset.</p> <p><code>ExecutorMonitor</code> is used for the performance metrics:</p> <ul> <li>numberExecutorsPendingToRemove (based on pendingRemovalCount)</li> <li>numberAllExecutors (based on executorCount)</li> </ul> <p><code>ExecutorMonitor</code> is used for the following:</p> <ul> <li>timedOutExecutors when <code>ExecutorAllocationManager</code> is requested to schedule</li> <li>executorCount when <code>ExecutorAllocationManager</code> is requested to addExecutors</li> <li>executorCount, pendingRemovalCount and executorsKilled when <code>ExecutorAllocationManager</code> is requested to removeExecutors</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#executorallocationlistener","title":"ExecutorAllocationListener <p><code>ExecutorAllocationManager</code> creates an ExecutorAllocationListener when created to intercept Spark events that impact the allocation policy.</p> <p><code>ExecutorAllocationListener</code> is added to the management queue (of LiveListenerBus) when <code>ExecutorAllocationManager</code> is started.</p> <p><code>ExecutorAllocationListener</code> is used to calculate the maximum number of executors needed.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#sparkdynamicallocationexecutorallocationratio","title":"spark.dynamicAllocation.executorAllocationRatio <p><code>ExecutorAllocationManager</code> uses spark.dynamicAllocation.executorAllocationRatio configuration property for maxNumExecutorsNeeded.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#tasksperexecutorforfullparallelism","title":"tasksPerExecutorForFullParallelism <p><code>ExecutorAllocationManager</code> uses spark.executor.cores and spark.task.cpus configuration properties for the number of tasks that can be submitted to an executor for full parallelism.</p> <p>Used when:</p> <ul> <li>maxNumExecutorsNeeded</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#maximum-number-of-executors-needed","title":"Maximum Number of Executors Needed <pre><code>maxNumExecutorsNeeded(): Int\n</code></pre> <p><code>maxNumExecutorsNeeded</code> requests the ExecutorAllocationListener for the number of pending and running tasks.</p> <p><code>maxNumExecutorsNeeded</code> is the smallest integer value that is greater than or equal to the multiplication of the total number of pending and running tasks by executorAllocationRatio divided by tasksPerExecutorForFullParallelism.</p> <p><code>maxNumExecutorsNeeded</code> is used for:</p> <ul> <li>updateAndSyncNumExecutorsTarget</li> <li>numberMaxNeededExecutors performance metric</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#executorallocationclient","title":"ExecutorAllocationClient <p><code>ExecutorAllocationManager</code> is given an ExecutorAllocationClient when created.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#starting-executorallocationmanager","title":"Starting ExecutorAllocationManager <pre><code>start(): Unit\n</code></pre> <p><code>start</code> requests the LiveListenerBus to add to the management queue:</p> <ul> <li>ExecutorAllocationListener</li> <li>ExecutorMonitor</li> </ul> <p><code>start</code> requests the ContextCleaner (if defined) to attach the ExecutorMonitor.</p> <p>creates a <code>scheduleTask</code> (a Java Runnable) for schedule when started.</p> <p><code>start</code> requests the ScheduledExecutorService to schedule the <code>scheduleTask</code> every <code>100</code> ms.</p>  <p>Note</p> <p>The schedule delay of <code>100</code> is not configurable.</p>  <p><code>start</code> requests the ExecutorAllocationClient to request the total executors with the following:</p> <ul> <li>numExecutorsTarget</li> <li>localityAwareTasks</li> <li>hostToLocalTaskCount</li> </ul> <p><code>start</code> is used when <code>SparkContext</code> is created.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#scheduling-executors","title":"Scheduling Executors <pre><code>schedule(): Unit\n</code></pre> <p><code>schedule</code> requests the ExecutorMonitor for timedOutExecutors.</p> <p>If there are executors to be removed, <code>schedule</code> turns the initializing internal flag off.</p> <p><code>schedule</code> updateAndSyncNumExecutorsTarget with the current time.</p> <p>In the end, <code>schedule</code> removes the executors to be removed if there are any.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#updateandsyncnumexecutorstarget","title":"updateAndSyncNumExecutorsTarget <pre><code>updateAndSyncNumExecutorsTarget(\n  now: Long): Int\n</code></pre> <p><code>updateAndSyncNumExecutorsTarget</code> maxNumExecutorsNeeded.</p> <p><code>updateAndSyncNumExecutorsTarget</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#stopping-executorallocationmanager","title":"Stopping ExecutorAllocationManager <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> shuts down &lt;&gt;.  <p>Note</p> <p><code>stop</code> waits 10 seconds for the termination to be complete.</p>  <p><code>stop</code> is used when <code>SparkContext</code> is requested to stop</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#spark-dynamic-executor-allocation-allocation-executor","title":"spark-dynamic-executor-allocation Allocation Executor <p><code>spark-dynamic-executor-allocation</code> allocation executor is a...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#executorallocationmanagersource","title":"ExecutorAllocationManagerSource <p>ExecutorAllocationManagerSource</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#removing-executors","title":"Removing Executors <pre><code>removeExecutors(\n  executors: Seq[(String, Int)]): Seq[String]\n</code></pre> <p><code>removeExecutors</code>...FIXME</p> <p><code>removeExecutors</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is requested to schedule executors</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.ExecutorAllocationManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.ExecutorAllocationManager=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/","title":"ExecutorAllocationManagerSource","text":"<p><code>ExecutorAllocationManagerSource</code> is a metric source for Dynamic Allocation of Executors.</p>"},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#source-name","title":"Source Name <p><code>ExecutorAllocationManagerSource</code> is registered under the name ExecutorAllocationManager.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#gauges","title":"Gauges","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numberexecutorstoadd","title":"numberExecutorsToAdd <p><code>executors/numberExecutorsToAdd</code> for numExecutorsToAdd</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numberexecutorspendingtoremove","title":"numberExecutorsPendingToRemove <p><code>executors/numberExecutorsPendingToRemove</code> for pendingRemovalCount</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numberallexecutors","title":"numberAllExecutors <p><code>executors/numberAllExecutors</code> for executorCount</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numbertargetexecutors","title":"numberTargetExecutors <p><code>executors/numberTargetExecutors</code> for numExecutorsTarget</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numbermaxneededexecutors","title":"numberMaxNeededExecutors <p><code>executors/numberMaxNeededExecutors</code> for maxNumExecutorsNeeded</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/","title":"ExecutorMonitor","text":"<p><code>ExecutorMonitor</code> is a SparkListener and a CleanerListener.</p>"},{"location":"dynamic-allocation/ExecutorMonitor/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorMonitor</code> takes the following to be created:</p> <ul> <li> SparkConf <li> ExecutorAllocationClient <li> LiveListenerBus <li> <code>Clock</code> <p><code>ExecutorMonitor</code> is created\u00a0when:</p> <ul> <li><code>ExecutorAllocationManager</code> is created</li> </ul>"},{"location":"dynamic-allocation/ExecutorMonitor/#shuffleids-registry","title":"shuffleIds Registry <pre><code>shuffleIds: Set[Int]\n</code></pre> <p><code>ExecutorMonitor</code> uses a mutable <code>HashSet</code> to track shuffle IDs...FIXME</p> <p><code>shuffleIds</code> is initialized only when shuffleTrackingEnabled is enabled.</p> <p><code>shuffleIds</code> is used by <code>Tracker</code> internal class for the following:</p> <ul> <li><code>updateTimeout</code>, <code>addShuffle</code>, <code>removeShuffle</code> and <code>updateActiveShuffles</code></li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#executors-registry","title":"Executors Registry <pre><code>executors: ConcurrentHashMap[String, Tracker]\n</code></pre> <p><code>ExecutorMonitor</code> uses a Java ConcurrentHashMap to track available executors.</p> <p>An executor is added when (via ensureExecutorIsTracked):</p> <ul> <li>onBlockUpdated</li> <li>onExecutorAdded</li> <li>onTaskStart</li> </ul> <p>An executor is removed when onExecutorRemoved.</p> <p>All executors are removed when reset.</p> <p><code>executors</code> is used when:</p> <ul> <li>onOtherEvent (cleanupShuffle)</li> <li>executorCount</li> <li>executorsKilled</li> <li>onUnpersistRDD</li> <li>onTaskEnd</li> <li>onJobStart</li> <li>onJobEnd</li> <li>pendingRemovalCount</li> <li>timedOutExecutors</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#fetchfromshufflesvcenabled-flag","title":"fetchFromShuffleSvcEnabled Flag <pre><code>fetchFromShuffleSvcEnabled: Boolean\n</code></pre> <p><code>ExecutorMonitor</code> initializes <code>fetchFromShuffleSvcEnabled</code> internal flag based on the values of spark.shuffle.service.enabled and spark.shuffle.service.fetch.rdd.enabled configuration properties.</p> <p><code>fetchFromShuffleSvcEnabled</code> is enabled (<code>true</code>) when the aforementioned configuration properties are.</p> <p><code>fetchFromShuffleSvcEnabled</code> is used when:</p> <ul> <li>onBlockUpdated</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#shuffletrackingenabled-flag","title":"shuffleTrackingEnabled Flag <pre><code>shuffleTrackingEnabled: Boolean\n</code></pre> <p><code>ExecutorMonitor</code> initializes <code>shuffleTrackingEnabled</code> internal flag based on the values of spark.shuffle.service.enabled and spark.dynamicAllocation.shuffleTracking.enabled configuration properties.</p> <p><code>shuffleTrackingEnabled</code> is enabled (<code>true</code>) when the following holds:</p> <ol> <li>spark.shuffle.service.enabled is disabled</li> <li>spark.dynamicAllocation.shuffleTracking.enabled is enabled</li> </ol> <p>When enabled, <code>shuffleTrackingEnabled</code> is used to skip execution of the following (making them noops):</p> <ul> <li>onJobStart</li> <li>onJobEnd</li> </ul> <p>When disabled, <code>shuffleTrackingEnabled</code> is used for the following:</p> <ul> <li>onTaskEnd</li> <li>shuffleCleaned</li> <li>shuffleIds</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#sparkdynamicallocationcachedexecutoridletimeout","title":"spark.dynamicAllocation.cachedExecutorIdleTimeout <p><code>ExecutorMonitor</code> reads spark.dynamicAllocation.cachedExecutorIdleTimeout configuration property for <code>Tracker</code> to updateTimeout.</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onblockupdated","title":"onBlockUpdated <pre><code>onBlockUpdated(\n  event: SparkListenerBlockUpdated): Unit\n</code></pre> <p><code>onBlockUpdated</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onBlockUpdated</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onexecutoradded","title":"onExecutorAdded <pre><code>onExecutorAdded(\n  event: SparkListenerExecutorAdded): Unit\n</code></pre> <p><code>onExecutorAdded</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onExecutorAdded</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onexecutorremoved","title":"onExecutorRemoved <pre><code>onExecutorRemoved(\n  event: SparkListenerExecutorRemoved): Unit\n</code></pre> <p><code>onExecutorRemoved</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onExecutorRemoved</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onjobend","title":"onJobEnd <pre><code>onJobEnd(\n  event: SparkListenerJobEnd): Unit\n</code></pre> <p><code>onJobEnd</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onJobEnd</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onjobstart","title":"onJobStart <pre><code>onJobStart(\n  event: SparkListenerJobStart): Unit\n</code></pre> <p><code>onJobStart</code>\u00a0is part of the SparkListenerInterface abstraction.</p>  <p>Note</p> <p><code>onJobStart</code> does nothing and simply returns when the shuffleTrackingEnabled flag is turned off (<code>false</code>).</p>  <p><code>onJobStart</code> requests the input <code>SparkListenerJobStart</code> for the StageInfos and converts...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onotherevent","title":"onOtherEvent <pre><code>onOtherEvent(\n  event: SparkListenerEvent): Unit\n</code></pre> <p><code>onOtherEvent</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onOtherEvent</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#cleanupshuffle","title":"cleanupShuffle <pre><code>cleanupShuffle(\n  id: Int): Unit\n</code></pre> <p><code>cleanupShuffle</code>...FIXME</p> <p><code>cleanupShuffle</code>\u00a0is used when onOtherEvent</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#ontaskend","title":"onTaskEnd <pre><code>onTaskEnd(\n  event: SparkListenerTaskEnd): Unit\n</code></pre> <p><code>onTaskEnd</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onTaskEnd</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#ontaskstart","title":"onTaskStart <pre><code>onTaskStart(\n  event: SparkListenerTaskStart): Unit\n</code></pre> <p><code>onTaskStart</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onTaskStart</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onunpersistrdd","title":"onUnpersistRDD <pre><code>onUnpersistRDD(\n  event: SparkListenerUnpersistRDD): Unit\n</code></pre> <p><code>onUnpersistRDD</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onUnpersistRDD</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#reset","title":"reset <pre><code>reset(): Unit\n</code></pre> <p><code>reset</code>...FIXME</p> <p><code>reset</code>\u00a0is used when:</p> <ul> <li>FIXME</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#shufflecleaned","title":"shuffleCleaned <pre><code>shuffleCleaned(\n  shuffleId: Int): Unit\n</code></pre> <p><code>shuffleCleaned</code>\u00a0is part of the CleanerListener abstraction.</p> <p><code>shuffleCleaned</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#timedoutexecutors","title":"timedOutExecutors <pre><code>timedOutExecutors(): Seq[String]\ntimedOutExecutors(\n  when: Long): Seq[String]\n</code></pre> <p><code>timedOutExecutors</code>...FIXME</p> <p><code>timedOutExecutors</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is requested to schedule</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#executorcount","title":"executorCount <pre><code>executorCount: Int\n</code></pre> <p><code>executorCount</code>...FIXME</p> <p><code>executorCount</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is requested to addExecutors and removeExecutors</li> <li><code>ExecutorAllocationManagerSource</code> is requested for numberAllExecutors performance metric</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#pendingremovalcount","title":"pendingRemovalCount <pre><code>pendingRemovalCount: Int\n</code></pre> <p><code>pendingRemovalCount</code>...FIXME</p> <p><code>pendingRemovalCount</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is requested to removeExecutors</li> <li><code>ExecutorAllocationManagerSource</code> is requested for numberExecutorsPendingToRemove performance metric</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#executorskilled","title":"executorsKilled <pre><code>executorsKilled(\n  ids: Seq[String]): Unit\n</code></pre> <p><code>executorsKilled</code>...FIXME</p> <p><code>executorsKilled</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is requested to removeExecutors</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#ensureexecutoristracked","title":"ensureExecutorIsTracked <pre><code>ensureExecutorIsTracked(\n  id: String,\n  resourceProfileId: Int): Tracker\n</code></pre> <p><code>ensureExecutorIsTracked</code>...FIXME</p> <p><code>ensureExecutorIsTracked</code>\u00a0is used when:</p> <ul> <li>onBlockUpdated</li> <li>onExecutorAdded</li> <li>onTaskStart</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#getresourceprofileid","title":"getResourceProfileId <pre><code>getResourceProfileId(\n  executorId: String): Int\n</code></pre> <p><code>getResourceProfileId</code>...FIXME</p> <p><code>getResourceProfileId</code>\u00a0is used for testing only.</p>","text":""},{"location":"dynamic-allocation/Tracker/","title":"Tracker","text":"<p><code>Tracker</code> is a private internal class of ExecutorMonitor.</p>"},{"location":"dynamic-allocation/Tracker/#creating-instance","title":"Creating Instance","text":"<p><code>Tracker</code> takes the following to be created:</p> <ul> <li> resourceProfileId <p><code>Tracker</code> is created\u00a0when:</p> <ul> <li><code>ExecutorMonitor</code> is requested to ensureExecutorIsTracked</li> </ul>"},{"location":"dynamic-allocation/Tracker/#cachedblocks-internal-registry","title":"cachedBlocks Internal Registry <pre><code>cachedBlocks: Map[Int, BitSet]\n</code></pre> <p><code>Tracker</code> uses <code>cachedBlocks</code> internal registry for cached blocks (RDD IDs and partition IDs stored in an executor).</p> <p><code>cachedBlocks</code> is used when:</p> <ul> <li><code>ExecutorMonitor</code> is requested to onBlockUpdated, onUnpersistRDD</li> <li><code>Tracker</code> is requested to updateTimeout</li> </ul>","text":""},{"location":"dynamic-allocation/Tracker/#removeshuffle","title":"removeShuffle <pre><code>removeShuffle(\n  id: Int): Unit\n</code></pre> <p><code>removeShuffle</code>...FIXME</p> <p><code>removeShuffle</code>\u00a0is used when:</p> <ul> <li><code>ExecutorMonitor</code> is requested to cleanupShuffle</li> </ul>","text":""},{"location":"dynamic-allocation/Tracker/#updateactiveshuffles","title":"updateActiveShuffles <pre><code>updateActiveShuffles(\n  ids: Iterable[Int]): Unit\n</code></pre> <p><code>updateActiveShuffles</code>...FIXME</p> <p><code>updateActiveShuffles</code>\u00a0is used when:</p> <ul> <li><code>ExecutorMonitor</code> is requested to onJobStart and onJobEnd</li> </ul>","text":""},{"location":"dynamic-allocation/Tracker/#updaterunningtasks","title":"updateRunningTasks <pre><code>updateRunningTasks(\n  delta: Int): Unit\n</code></pre> <p><code>updateRunningTasks</code>...FIXME</p> <p><code>updateRunningTasks</code>\u00a0is used when:</p> <ul> <li><code>ExecutorMonitor</code> is requested to onTaskStart, onTaskEnd and onExecutorAdded</li> </ul>","text":""},{"location":"dynamic-allocation/Tracker/#updatetimeout","title":"updateTimeout <pre><code>updateTimeout(): Unit\n</code></pre> <p><code>updateTimeout</code>...FIXME</p> <p><code>updateTimeout</code>\u00a0is used when:</p> <ul> <li><code>ExecutorMonitor</code> is requested to onBlockUpdated and onUnpersistRDD</li> <li><code>Tracker</code> is requested to updateRunningTasks, removeShuffle, updateActiveShuffles</li> </ul>","text":""},{"location":"dynamic-allocation/configuration-properties/","title":"Spark Configuration Properties","text":""},{"location":"dynamic-allocation/configuration-properties/#sparkdynamicallocation","title":"spark.dynamicAllocation","text":""},{"location":"dynamic-allocation/configuration-properties/#cachedexecutoridletimeout","title":"cachedExecutorIdleTimeout <p>spark.dynamicAllocation.cachedExecutorIdleTimeout</p> <p>How long (in seconds) to keep blocks cached</p> <p>Default: The largest value representable as an Int</p> <p>Must be &gt;= <code>0</code></p> <p>Used when:</p> <ul> <li><code>ExecutorMonitor</code> is created</li> <li><code>RDD</code> is requested to localCheckpoint (simply to print out a WARN message)</li> </ul>","text":""},{"location":"dynamic-allocation/configuration-properties/#enabled","title":"enabled <p>spark.dynamicAllocation.enabled</p> <p>Enables Dynamic Allocation of Executors</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>BarrierJobAllocationFailed</code> is requested for ERROR_MESSAGE_RUN_BARRIER_WITH_DYN_ALLOCATION (for reporting purposes)</li> <li><code>RDD</code> is requested to localCheckpoint (for reporting purposes)</li> <li><code>SparkSubmitArguments</code> is requested to loadEnvironmentArguments (for validation purposes)</li> <li><code>Utils</code> is requested to isDynamicAllocationEnabled</li> </ul>","text":""},{"location":"dynamic-allocation/configuration-properties/#executorallocationratio","title":"executorAllocationRatio <p>spark.dynamicAllocation.executorAllocationRatio</p> <p>Default: <code>1.0</code></p> <p>Must be between <code>0</code> (exclusive) and <code>1.0</code> (inclusive)</p> <p>Used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is created</li> </ul>","text":""},{"location":"dynamic-allocation/configuration-properties/#executoridletimeout","title":"executorIdleTimeout <p>spark.dynamicAllocation.executorIdleTimeout</p> <p>Default: <code>60</code></p>","text":""},{"location":"dynamic-allocation/configuration-properties/#initialexecutors","title":"initialExecutors <p>spark.dynamicAllocation.initialExecutors</p> <p>Default: spark.dynamicAllocation.minExecutors</p>","text":""},{"location":"dynamic-allocation/configuration-properties/#maxexecutors","title":"maxExecutors <p>spark.dynamicAllocation.maxExecutors</p> <p>Default: <code>Int.MaxValue</code></p>","text":""},{"location":"dynamic-allocation/configuration-properties/#minexecutors","title":"minExecutors <p>spark.dynamicAllocation.minExecutors</p> <p>Default: <code>0</code></p>","text":""},{"location":"dynamic-allocation/configuration-properties/#schedulerbacklogtimeout","title":"schedulerBacklogTimeout <p>spark.dynamicAllocation.schedulerBacklogTimeout</p> <p>(in seconds)</p> <p>Default: <code>1</code></p>","text":""},{"location":"dynamic-allocation/configuration-properties/#shuffletrackingenabled","title":"shuffleTracking.enabled <p>spark.dynamicAllocation.shuffleTracking.enabled</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>ExecutorMonitor</code> is created</li> </ul>","text":""},{"location":"dynamic-allocation/configuration-properties/#shuffletrackingtimeout","title":"shuffleTracking.timeout <p>spark.dynamicAllocation.shuffleTracking.timeout</p> <p>(in millis)</p> <p>Default: The largest value representable as an Int</p>","text":""},{"location":"dynamic-allocation/configuration-properties/#sustainedschedulerbacklogtimeout","title":"sustainedSchedulerBacklogTimeout <p>spark.dynamicAllocation.sustainedSchedulerBacklogTimeout</p> <p>Default: spark.dynamicAllocation.schedulerBacklogTimeout</p>","text":""},{"location":"executor/","title":"Executor","text":"<p>Spark applications start one or more Executors for executing tasks.</p> <p>By default (in Static Allocation of Executors) executors run for the entire lifetime of a Spark application (unlike in Dynamic Allocation).</p> <p>Executors are managed by ExecutorBackend.</p> <p>Executors reports heartbeat and partial metrics for active tasks to the HeartbeatReceiver RPC Endpoint on the driver.</p> <p></p> <p>Executors provide in-memory storage for <code>RDD</code>s that are cached in Spark applications (via BlockManager).</p> <p>When started, an executor first registers itself with the driver that establishes a communication channel directly to the driver to accept tasks for execution.</p> <p></p> <p>Executor offers are described by executor id and the host on which an executor runs.</p> <p>Executors can run multiple tasks over their lifetime, both in parallel and sequentially, and track running tasks.</p> <p>Executors use an Executor task launch worker thread pool for launching tasks.</p> <p>Executors send metrics (and heartbeats) using the Heartbeat Sender Thread.</p>"},{"location":"executor/CoarseGrainedExecutorBackend/","title":"CoarseGrainedExecutorBackend","text":"<p><code>CoarseGrainedExecutorBackend</code> is an ExecutorBackend that controls the lifecycle of a single executor and sends executor status updates to the driver.</p> <p></p> <p><code>CoarseGrainedExecutorBackend</code> is started in a resource container (as a standalone application).</p>"},{"location":"executor/CoarseGrainedExecutorBackend/#creating-instance","title":"Creating Instance","text":"<p><code>CoarseGrainedExecutorBackend</code> takes the following to be created:</p> <ul> <li> RpcEnv <li> Driver URL <li> Executor ID <li> Bind Address (unused) <li> Hostname <li> Number of CPU cores <li> SparkEnv <li> Resources Configuration File <li> ResourceProfile <p>Note</p> <p>driverUrl, executorId, hostname, cores and userClassPath correspond to <code>CoarseGrainedExecutorBackend</code> standalone application's command-line arguments.</p> <p><code>CoarseGrainedExecutorBackend</code> is created upon launching CoarseGrainedExecutorBackend standalone application.</p>"},{"location":"executor/CoarseGrainedExecutorBackend/#executor","title":"Executor","text":"<p><code>CoarseGrainedExecutorBackend</code> manages the lifecycle of a single Executor:</p> <ul> <li>An <code>Executor</code> is created upon receiving a RegisteredExecutor message</li> <li>Stopped upon receiving a Shutdown message (that happens on a separate <code>CoarseGrainedExecutorBackend-stop-executor</code> thread)</li> </ul> <p>The <code>Executor</code> is used for the following:</p> <ul> <li>decommissionSelf</li> <li>Launching a task (upon receiving a LaunchTask message)</li> <li>Killing a task (upon receiving a KillTask message)</li> <li>Reporting the number of CPU cores used for a given task in statusUpdate</li> </ul>"},{"location":"executor/CoarseGrainedExecutorBackend/#statusUpdate","title":"Reporting Task Status","text":"ExecutorBackend <pre><code>statusUpdate(\n  taskId: Long,\n  state: TaskState,\n  data: ByteBuffer): Unit\n</code></pre> <p><code>statusUpdate</code> is part of the ExecutorBackend abstraction.</p> <p><code>statusUpdate</code>...FIXME</p>"},{"location":"executor/CoarseGrainedExecutorBackend/#messages","title":"Messages","text":""},{"location":"executor/CoarseGrainedExecutorBackend/#DecommissionExecutor","title":"DecommissionExecutor","text":"<p><code>DecommissionExecutor</code> is sent out when <code>CoarseGrainedSchedulerBackend</code> is requested to decommissionExecutors</p> <p>When received, <code>CoarseGrainedExecutorBackend</code> decommissionSelf.</p>"},{"location":"executor/CoarseGrainedExecutorBackend/#logging","title":"Logging","text":"<p>Enable <code>ALL</code> logging level for <code>org.apache.spark.executor.CoarseGrainedExecutorBackend</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.CoarseGrainedExecutorBackend.name = org.apache.spark.executor.CoarseGrainedExecutorBackend\nlogger.CoarseGrainedExecutorBackend.level = all\n</code></pre> <p>Refer to Logging.</p>"},{"location":"executor/Executor/","title":"Executor","text":""},{"location":"executor/Executor/#creating-instance","title":"Creating Instance","text":"<p><code>Executor</code> takes the following to be created:</p> <ul> <li> Executor ID <li> Host name <li> SparkEnv <li>User-defined jars</li> <li>isLocal flag</li> <li> <code>UncaughtExceptionHandler</code> (default: <code>SparkUncaughtExceptionHandler</code>) <li> Resources (<code>Map[String, ResourceInformation]</code>) <p><code>Executor</code> is created\u00a0when:</p> <ul> <li><code>CoarseGrainedExecutorBackend</code> is requested to handle a RegisteredExecutor message (after having registered with the driver)</li> <li><code>LocalEndpoint</code> is created</li> </ul>"},{"location":"executor/Executor/#when-created","title":"When Created","text":"<p>When created, <code>Executor</code> prints out the following INFO messages to the logs:</p> <pre><code>Starting executor ID [executorId] on host [executorHostname]\n</code></pre> <p>(only for non-local modes) <code>Executor</code> sets <code>SparkUncaughtExceptionHandler</code> as the default handler invoked when a thread abruptly terminates due to an uncaught exception.</p> <p>(only for non-local modes) <code>Executor</code> requests the BlockManager to initialize (with the Spark application id of the SparkConf).</p> <p> <p>(only for non-local modes) <code>Executor</code> requests the MetricsSystem to register the following metric sources:</p> <ul> <li>ExecutorSource</li> <li><code>JVMCPUSource</code></li> <li>ExecutorMetricsSource</li> <li>ShuffleMetricsSource (of the BlockManager)</li> </ul> <p><code>Executor</code> uses <code>SparkEnv</code> to access the MetricsSystem and BlockManager.</p> <p><code>Executor</code> creates a task class loader (optionally with REPL support) and requests the system <code>Serializer</code> to use as the default classloader (for deserializing tasks).</p> <p><code>Executor</code> starts sending heartbeats with the metrics of active tasks.</p>"},{"location":"executor/Executor/#plugincontainer","title":"PluginContainer <p><code>Executor</code> creates a PluginContainer (with the SparkEnv and the resources).</p> <p>The <code>PluginContainer</code> is used to create a TaskRunner for launching a task.</p> <p>The <code>PluginContainer</code> is requested to shutdown in stop.</p>","text":""},{"location":"executor/Executor/#executorsource","title":"ExecutorSource <p>When created, <code>Executor</code> creates an ExecutorSource (with the threadPool, the executorId and the schemes).</p> <p>The <code>ExecutorSource</code> is then registered with the application's MetricsSystem (in local and non-local modes) to report metrics.</p> <p>The metrics are updated right after a TaskRunner has finished executing a task.</p>","text":""},{"location":"executor/Executor/#executormetricssource","title":"ExecutorMetricsSource <p><code>Executor</code> creates an ExecutorMetricsSource when created with the spark.metrics.executorMetricsSource.enabled enabled.</p> <p><code>Executor</code> uses the <code>ExecutorMetricsSource</code> to create the ExecutorMetricsPoller.</p> <p><code>Executor</code> requests the <code>ExecutorMetricsSource</code> to register immediately when created with the isLocal flag disabled.</p>","text":""},{"location":"executor/Executor/#executormetricspoller","title":"ExecutorMetricsPoller <p><code>Executor</code> creates an ExecutorMetricsPoller when created with the following:</p> <ul> <li>MemoryManager of the SparkEnv</li> <li>spark.executor.metrics.pollingInterval</li> <li>ExecutorMetricsSource</li> </ul> <p><code>Executor</code> requests the <code>ExecutorMetricsPoller</code> to start immediately when created and to stop when requested to stop.</p> <p><code>TaskRunner</code> requests the <code>ExecutorMetricsPoller</code> to onTaskStart and onTaskCompletion at the beginning and the end of run, respectively.</p> <p>When requested to reportHeartBeat with pollOnHeartbeat enabled, <code>Executor</code> requests the <code>ExecutorMetricsPoller</code> to poll.</p>","text":""},{"location":"executor/Executor/#fetching-file-and-jar-dependencies","title":"Fetching File and Jar Dependencies <pre><code>updateDependencies(\n  newFiles: Map[String, Long],\n  newJars: Map[String, Long]): Unit\n</code></pre> <p><code>updateDependencies</code> fetches missing or outdated extra files (in the given <code>newFiles</code>). For every name-timestamp pair that...FIXME..., <code>updateDependencies</code> prints out the following INFO message to the logs:</p> <pre><code>Fetching [name] with timestamp [timestamp]\n</code></pre> <p><code>updateDependencies</code> fetches missing or outdated extra jars (in the given <code>newJars</code>). For every name-timestamp pair that...FIXME..., <code>updateDependencies</code> prints out the following INFO message to the logs:</p> <pre><code>Fetching [name] with timestamp [timestamp]\n</code></pre> <p><code>updateDependencies</code> fetches the file to the SparkFiles root directory.</p> <p><code>updateDependencies</code>...FIXME</p> <p><code>updateDependencies</code> is used when:</p> <ul> <li><code>TaskRunner</code> is requested to start (and run a task)</li> </ul>","text":""},{"location":"executor/Executor/#sparkdrivermaxresultsize","title":"spark.driver.maxResultSize <p><code>Executor</code> uses the spark.driver.maxResultSize for <code>TaskRunner</code> when requested to run a task (and decide on a serialized task result).</p>","text":""},{"location":"executor/Executor/#maximum-size-of-direct-results","title":"Maximum Size of Direct Results <p><code>Executor</code> uses the minimum of spark.task.maxDirectResultSize and spark.rpc.message.maxSize when <code>TaskRunner</code> is requested to run a task (and decide on the type of a serialized task result).</p>","text":""},{"location":"executor/Executor/#islocal-flag","title":"isLocal Flag <p><code>Executor</code> is given the <code>isLocal</code> flag when created to indicate a non-local mode (whether the executor and the Spark application runs with <code>local</code> or cluster-specific master URL).</p> <p><code>isLocal</code> is disabled (<code>false</code>) by default and is off explicitly when <code>CoarseGrainedExecutorBackend</code> is requested to handle a RegisteredExecutor message.</p> <p><code>isLocal</code> is enabled (<code>true</code>) when <code>LocalEndpoint</code> is created</p>","text":""},{"location":"executor/Executor/#sparkexecutoruserclasspathfirst","title":"spark.executor.userClassPathFirst <p><code>Executor</code> reads the value of the spark.executor.userClassPathFirst configuration property when created.</p> <p>When enabled, <code>Executor</code> uses <code>ChildFirstURLClassLoader</code> (not <code>MutableURLClassLoader</code>) when requested to createClassLoader (and addReplClassLoaderIfNeeded).</p>","text":""},{"location":"executor/Executor/#user-defined-jars","title":"User-Defined Jars <p><code>Executor</code> is given user-defined jars when created. No jars are assumed by default.</p> <p>The jars are specified using spark.executor.extraClassPath configuration property (via --user-class-path command-line option of <code>CoarseGrainedExecutorBackend</code>).</p>","text":""},{"location":"executor/Executor/#running-tasks-registry","title":"Running Tasks Registry <pre><code>runningTasks: Map[Long, TaskRunner]\n</code></pre> <p><code>Executor</code> tracks TaskRunners by task IDs.</p>","text":""},{"location":"executor/Executor/#heartbeatreceiver-rpc-endpoint-reference","title":"HeartbeatReceiver RPC Endpoint Reference <p>When created, <code>Executor</code> creates an RPC endpoint reference to HeartbeatReceiver (running on the driver).</p> <p><code>Executor</code> uses the RPC endpoint reference when requested to reportHeartBeat.</p>","text":""},{"location":"executor/Executor/#launching-task","title":"Launching Task <pre><code>launchTask(\n  context: ExecutorBackend,\n  taskDescription: TaskDescription): Unit\n</code></pre> <p><code>launchTask</code> creates a TaskRunner (with the given ExecutorBackend, the TaskDescription and the PluginContainer) and adds it to the runningTasks internal registry.</p> <p><code>launchTask</code> requests the \"Executor task launch worker\" thread pool to execute the <code>TaskRunner</code> (sometime in the future).</p> <p>In case the decommissioned flag is enabled, <code>launchTask</code> prints out the following ERROR message to the logs:</p> <pre><code>Launching a task while in decommissioned state.\n</code></pre> <p></p> <p><code>launchTask</code> is used when:</p> <ul> <li><code>CoarseGrainedExecutorBackend</code> is requested to handle a LaunchTask message</li> <li><code>LocalEndpoint</code> RPC endpoint (of LocalSchedulerBackend) is requested to reviveOffers</li> </ul>","text":""},{"location":"executor/Executor/#sending-heartbeats-and-active-tasks-metrics","title":"Sending Heartbeats and Active Tasks Metrics <p>Executors keep sending metrics for active tasks to the driver every spark.executor.heartbeatInterval (defaults to <code>10s</code> with some random initial delay so the heartbeats from different executors do not pile up on the driver).</p> <p></p> <p>An executor sends heartbeats using the Heartbeat Sender Thread.</p> <p></p> <p>For each task in TaskRunner (in runningTasks internal registry), the task's metrics are computed and become part of the heartbeat (with accumulators).</p> <p>A blocking Heartbeat message that holds the executor id, all accumulator updates (per task id), and BlockManagerId is sent to HeartbeatReceiver RPC endpoint.</p> <p>If the response requests to re-register BlockManager, <code>Executor</code> prints out the following INFO message to the logs:</p> <pre><code>Told to re-register on heartbeat\n</code></pre> <p><code>BlockManager</code> is requested to reregister.</p> <p>The internal heartbeatFailures counter is reset.</p> <p>If there are any issues with communicating with the driver, <code>Executor</code> prints out the following WARN message to the logs:</p> <pre><code>Issue communicating with driver in heartbeater\n</code></pre> <p>The internal heartbeatFailures is incremented and checked to be less than the spark.executor.heartbeat.maxFailures. If the number is greater, the following ERROR is printed out to the logs:</p> <pre><code>Exit as unable to send heartbeats to driver more than [HEARTBEAT_MAX_FAILURES] times\n</code></pre> <p>The executor exits (using <code>System.exit</code> and exit code 56).</p>","text":""},{"location":"executor/Executor/#heartbeat-sender-thread","title":"Heartbeat Sender Thread <p><code>heartbeater</code> is a <code>ScheduledThreadPoolExecutor</code> (Java) with a single thread.</p> <p>The name of the thread pool is driver-heartbeater.</p>","text":""},{"location":"executor/Executor/#executor-task-launch-worker-thread-pool","title":"Executor task launch worker Thread Pool <p>When created, <code>Executor</code> creates <code>threadPool</code> daemon cached thread pool with the name Executor task launch worker-[ID] (with <code>ID</code> being the task id).</p> <p>The <code>threadPool</code> thread pool is used for launching tasks.</p>","text":""},{"location":"executor/Executor/#executor-memory","title":"Executor Memory <p>The amount of memory per executor is configured using spark.executor.memory configuration property. It sets the available memory equally for all executors per application.</p> <p>You can find the value displayed as Memory per Node in the web UI of the standalone Master.</p> <p></p>","text":""},{"location":"executor/Executor/#heartbeating-with-partial-metrics-for-active-tasks-to-driver","title":"Heartbeating With Partial Metrics For Active Tasks To Driver <pre><code>reportHeartBeat(): Unit\n</code></pre> <p><code>reportHeartBeat</code> collects TaskRunners for currently running tasks (active tasks) with their tasks deserialized (i.e. either ready for execution or already started).</p> <p>TaskRunner has task deserialized when it runs the task.</p> <p>For every running task, <code>reportHeartBeat</code> takes the TaskMetrics and:</p> <ul> <li>Requests ShuffleRead metrics to be merged</li> <li>Sets jvmGCTime metrics</li> </ul> <p><code>reportHeartBeat</code> then records the latest values of internal and external accumulators for every task.</p>  <p>Note</p> <p>Internal accumulators are a task's metrics while external accumulators are a Spark application's accumulators that a user has created.</p>  <p><code>reportHeartBeat</code> sends a blocking Heartbeat message to the HeartbeatReceiver (on the driver). <code>reportHeartBeat</code> uses the value of spark.executor.heartbeatInterval configuration property for the RPC timeout.</p>  <p>Note</p> <p>A <code>Heartbeat</code> message contains the executor identifier, the accumulator updates, and the identifier of the BlockManager.</p>  <p>If the response (from HeartbeatReceiver) is to re-register the <code>BlockManager</code>, <code>reportHeartBeat</code> prints out the following INFO message to the logs and requests the <code>BlockManager</code> to re-register (which will register the blocks the <code>BlockManager</code> manages with the driver).</p> <pre><code>Told to re-register on heartbeat\n</code></pre> <p><code>HeartbeatResponse</code> requests the <code>BlockManager</code> to re-register when either TaskScheduler or HeartbeatReceiver know nothing about the executor.</p> <p>When posting the <code>Heartbeat</code> was successful, <code>reportHeartBeat</code> resets heartbeatFailures internal counter.</p> <p>In case of a non-fatal exception, you should see the following WARN message in the logs (followed by the stack trace).</p> <pre><code>Issue communicating with driver in heartbeater\n</code></pre> <p>Every failure <code>reportHeartBeat</code> increments heartbeat failures up to spark.executor.heartbeat.maxFailures configuration property. When the heartbeat failures reaches the maximum, <code>reportHeartBeat</code> prints out the following ERROR message to the logs and the executor terminates with the error code: <code>56</code>.</p> <pre><code>Exit as unable to send heartbeats to driver more than [HEARTBEAT_MAX_FAILURES] times\n</code></pre> <p><code>reportHeartBeat</code> is used when:</p> <ul> <li><code>Executor</code> is requested to schedule reporting heartbeat and partial metrics for active tasks to the driver (that happens every spark.executor.heartbeatInterval).</li> </ul>","text":""},{"location":"executor/Executor/#sparkexecutorheartbeatmaxfailures","title":"spark.executor.heartbeat.maxFailures <p><code>Executor</code> uses spark.executor.heartbeat.maxFailures configuration property in reportHeartBeat.</p>","text":""},{"location":"executor/Executor/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.executor.Executor</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.executor.Executor=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"executor/ExecutorBackend/","title":"ExecutorBackend","text":"<p><code>ExecutorBackend</code> is an abstraction of executor backends (that TaskRunners use to report task status updates to a scheduler).</p> <p></p> <p><code>ExecutorBackend</code> acts as a bridge between executors and the driver.</p>"},{"location":"executor/ExecutorBackend/#contract","title":"Contract","text":""},{"location":"executor/ExecutorBackend/#statusUpdate","title":"Reporting Task Status","text":"<pre><code>statusUpdate(\n  taskId: Long,\n  state: TaskState,\n  data: ByteBuffer): Unit\n</code></pre> <p>Reports task status of the given task to a scheduler</p> <p>See:</p> <ul> <li>CoarseGrainedExecutorBackend</li> </ul> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run a task</li> </ul>"},{"location":"executor/ExecutorBackend/#implementations","title":"Implementations","text":"<ul> <li>CoarseGrainedExecutorBackend</li> <li>LocalSchedulerBackend</li> <li><code>MesosExecutorBackend</code></li> </ul>"},{"location":"executor/ExecutorLogUrlHandler/","title":"ExecutorLogUrlHandler","text":""},{"location":"executor/ExecutorLogUrlHandler/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorLogUrlHandler</code> takes the following to be created:</p> <ul> <li> Optional Log URL Pattern <p><code>ExecutorLogUrlHandler</code> is created\u00a0for the following:</p> <ul> <li>DriverEndpoint</li> <li>HistoryAppStatusStore</li> </ul>"},{"location":"executor/ExecutorLogUrlHandler/#applying-pattern","title":"Applying Pattern <pre><code>applyPattern(\n  logUrls: Map[String, String],\n  attributes: Map[String, String]): Map[String, String]\n</code></pre> <p><code>applyPattern</code> doApplyPattern for logUrlPattern defined or simply returns the given <code>logUrls</code> back.</p> <p><code>applyPattern</code>\u00a0is used when:</p> <ul> <li><code>DriverEndpoint</code> is requested to handle a RegisterExecutor message (and creates a ExecutorData)</li> <li><code>HistoryAppStatusStore</code> is requested to replaceLogUrls</li> </ul>","text":""},{"location":"executor/ExecutorLogUrlHandler/#doapplypattern","title":"doApplyPattern <pre><code>doApplyPattern(\n  logUrls: Map[String, String],\n  attributes: Map[String, String],\n  urlPattern: String): Map[String, String]\n</code></pre> <p><code>doApplyPattern</code>...FIXME</p>","text":""},{"location":"executor/ExecutorMetricType/","title":"ExecutorMetricType","text":"<p><code>ExecutorMetricType</code>  is an abstraction of executor metric types.</p>"},{"location":"executor/ExecutorMetricType/#contract","title":"Contract","text":""},{"location":"executor/ExecutorMetricType/#metric-values","title":"Metric Values <pre><code>getMetricValues(\n  memoryManager: MemoryManager): Array[Long]\n</code></pre> <p>Used when:</p> <ul> <li><code>ExecutorMetrics</code> utility is used for the current metric values</li> </ul>","text":""},{"location":"executor/ExecutorMetricType/#metric-names","title":"Metric Names <pre><code>names: Seq[String]\n</code></pre> <p>Used when:</p> <ul> <li><code>ExecutorMetricType</code> utility is used for the metricToOffset and number of metrics</li> </ul>","text":""},{"location":"executor/ExecutorMetricType/#implementations","title":"Implementations","text":"Sealed Trait <p><code>ExecutorMetricType</code> is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).</p> <p>Learn more in the Scala Language Specification.</p> <ul> <li><code>GarbageCollectionMetrics</code></li> <li><code>ProcessTreeMetrics</code></li> <li><code>SingleValueExecutorMetricType</code></li> <li><code>JVMHeapMemory</code></li> <li><code>JVMOffHeapMemory</code></li> <li><code>MBeanExecutorMetricType</code></li> <li><code>DirectPoolMemory</code></li> <li><code>MappedPoolMemory</code></li> <li><code>MemoryManagerExecutorMetricType</code></li> <li><code>OffHeapExecutionMemory</code></li> <li><code>OffHeapStorageMemory</code></li> <li><code>OffHeapUnifiedMemory</code></li> <li><code>OnHeapExecutionMemory</code></li> <li><code>OnHeapStorageMemory</code></li> <li><code>OnHeapUnifiedMemory</code></li> </ul>"},{"location":"executor/ExecutorMetricType/#executor-metric-getters-ordered-executormetrictypes","title":"Executor Metric Getters (Ordered ExecutorMetricTypes) <p><code>ExecutorMetricType</code> defines an ordered collection of ExecutorMetricTypes:</p> <ol> <li><code>JVMHeapMemory</code></li> <li><code>JVMOffHeapMemory</code></li> <li><code>OnHeapExecutionMemory</code></li> <li><code>OffHeapExecutionMemory</code></li> <li><code>OnHeapStorageMemory</code></li> <li><code>OffHeapStorageMemory</code></li> <li><code>OnHeapUnifiedMemory</code></li> <li><code>OffHeapUnifiedMemory</code></li> <li><code>DirectPoolMemory</code></li> <li><code>MappedPoolMemory</code></li> <li><code>ProcessTreeMetrics</code></li> <li><code>GarbageCollectionMetrics</code></li> </ol> <p>This ordering allows for passing metric values as arrays (to save space) with indices being a metric of a metric type.</p> <p><code>metricGetters</code> is used when:</p> <ul> <li><code>ExecutorMetrics</code> utility is used for the current metric values</li> <li><code>ExecutorMetricType</code> utility is used to get the metricToOffset and the numMetrics</li> </ul>","text":""},{"location":"executor/ExecutorMetrics/","title":"ExecutorMetrics","text":"<p><code>ExecutorMetrics</code> is a collection of executor metrics.</p>","tags":["DeveloperApi"]},{"location":"executor/ExecutorMetrics/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorMetrics</code> takes the following to be created:</p> <ul> <li> Metrics <p><code>ExecutorMetrics</code> is created when:</p> <ul> <li><code>SparkContext</code> is requested to reportHeartBeat</li> <li><code>DAGScheduler</code> is requested to post a SparkListenerTaskEnd event</li> <li><code>ExecutorMetricsPoller</code> is requested to getExecutorUpdates</li> <li><code>ExecutorMetricsJsonDeserializer</code> is requested to <code>deserialize</code></li> <li><code>JsonProtocol</code> is requested to executorMetricsFromJson</li> </ul>","tags":["DeveloperApi"]},{"location":"executor/ExecutorMetrics/#current-metric-values","title":"Current Metric Values <pre><code>getCurrentMetrics(\n  memoryManager: MemoryManager): Array[Long]\n</code></pre> <p><code>getCurrentMetrics</code> gives metric values for every metric getter.</p> <p>Given that one metric getter (type) can report multiple metrics, the length of the result collection is the number of metrics (and at least the number of metric getters). The order matters and is exactly as metricGetters.</p> <p><code>getCurrentMetrics</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to reportHeartBeat</li> <li><code>ExecutorMetricsPoller</code> is requested to poll</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"executor/ExecutorMetricsPoller/","title":"ExecutorMetricsPoller","text":""},{"location":"executor/ExecutorMetricsPoller/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorMetricsPoller</code> takes the following to be created:</p> <ul> <li> MemoryManager <li> spark.executor.metrics.pollingInterval <li> ExecutorMetricsSource <p><code>ExecutorMetricsPoller</code> is created when:</p> <ul> <li><code>Executor</code> is created</li> </ul>"},{"location":"executor/ExecutorMetricsPoller/#executor-metrics-poller","title":"executor-metrics-poller <p><code>ExecutorMetricsPoller</code> creates a <code>ScheduledExecutorService</code> (Java) when created with the spark.executor.metrics.pollingInterval greater than <code>0</code>.</p> <p>The <code>ScheduledExecutorService</code> manages 1 daemon thread with <code>executor-metrics-poller</code> name prefix.</p> <p>The <code>ScheduledExecutorService</code> is requested to poll at every pollingInterval when <code>ExecutorMetricsPoller</code> is requested to start until stop.</p>","text":""},{"location":"executor/ExecutorMetricsPoller/#poll","title":"poll <pre><code>poll(): Unit\n</code></pre> <p><code>poll</code>...FIXME</p> <p><code>poll</code> is used when:</p> <ul> <li><code>Executor</code> is requested to reportHeartBeat</li> <li><code>ExecutorMetricsPoller</code> is requested to start</li> </ul>","text":""},{"location":"executor/ExecutorMetricsSource/","title":"ExecutorMetricsSource","text":"<p><code>ExecutorMetricsSource</code> is a metrics source.</p>"},{"location":"executor/ExecutorMetricsSource/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorMetricsSource</code> takes no arguments to be created.</p> <p><code>ExecutorMetricsSource</code> is created when:</p> <ul> <li><code>SparkContext</code> is created (with spark.metrics.executorMetricsSource.enabled enabled)</li> <li><code>Executor</code> is created (with spark.metrics.executorMetricsSource.enabled enabled)</li> </ul>"},{"location":"executor/ExecutorMetricsSource/#source-name","title":"Source Name <pre><code>sourceName: String\n</code></pre> <p><code>sourceName</code> is ExecutorMetrics.</p> <p><code>sourceName</code> is part of the Source abstraction.</p>","text":""},{"location":"executor/ExecutorMetricsSource/#registering-with-metricssystem","title":"Registering with MetricsSystem <pre><code>register(\n  metricsSystem: MetricsSystem): Unit\n</code></pre> <p><code>register</code> creates <code>ExecutorMetricGauge</code>s for every executor metric.</p> <p><code>register</code> requests the MetricRegistry to register every metric type.</p> <p>In the end, <code>register</code> requests the MetricRegistry to register this <code>ExecutorMetricsSource</code>.</p> <p><code>register</code> is used when:</p> <ul> <li><code>SparkContext</code> is created</li> <li><code>Executor</code> is created (for non-local mode)</li> </ul>","text":""},{"location":"executor/ExecutorMetricsSource/#metrics-snapshot","title":"Metrics Snapshot <p><code>ExecutorMetricsSource</code> defines <code>metricsSnapshot</code> internal registry of values of every metric.</p> <p>The values are updated in updateMetricsSnapshot and read using <code>ExecutorMetricGauge</code>s.</p>","text":""},{"location":"executor/ExecutorMetricsSource/#updatemetricssnapshot","title":"updateMetricsSnapshot <pre><code>updateMetricsSnapshot(\n  metricsUpdates: Array[Long]): Unit\n</code></pre> <p><code>updateMetricsSnapshot</code> updates the metricsSnapshot registry with the given <code>metricsUpdates</code>.</p> <p><code>updateMetricsSnapshot</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to reportHeartBeat</li> <li><code>ExecutorMetricsPoller</code> is requested to poll</li> </ul>","text":""},{"location":"executor/ExecutorSource/","title":"ExecutorSource","text":"<p><code>ExecutorSource</code> is a Source of Executors.</p> <p></p>"},{"location":"executor/ExecutorSource/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorSource</code> takes the following to be created:</p> <ul> <li> ThreadPoolExecutor <li> Executor ID (unused) <li> File System Schemes (to report based on spark.executor.metrics.fileSystemSchemes) <p><code>ExecutorSource</code> is created\u00a0when:</p> <ul> <li><code>Executor</code> is created</li> </ul>"},{"location":"executor/ExecutorSource/#name","title":"Name <p><code>ExecutorSource</code> is known under the name executor.</p>","text":""},{"location":"executor/ExecutorSource/#metrics","title":"Metrics <pre><code>metricRegistry: MetricRegistry\n</code></pre> <p><code>metricRegistry</code> is part of the Source abstraction.</p>    Name Description     threadpool.activeTasks Approximate number of threads that are actively executing tasks (based on ThreadPoolExecutor.getActiveCount)   others","text":""},{"location":"executor/ShuffleReadMetrics/","title":"ShuffleReadMetrics","text":"<p><code>ShuffleReadMetrics</code> is a collection of metrics (accumulators) on reading shuffle data.</p>","tags":["DeveloperApi"]},{"location":"executor/ShuffleReadMetrics/#taskmetrics","title":"TaskMetrics <p><code>ShuffleReadMetrics</code> is available using TaskMetrics.shuffleReadMetrics.</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/ShuffleReadMetrics/#serializable","title":"Serializable <p><code>ShuffleReadMetrics</code> is a <code>Serializable</code> (Java).</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/","title":"ShuffleWriteMetrics","text":"<p><code>ShuffleWriteMetrics</code> is a ShuffleWriteMetricsReporter of metrics (accumulators) related to writing shuffle data (in shuffle map tasks):</p> <ul> <li>Shuffle Bytes Written</li> <li>Shuffle Write Time</li> <li>Shuffle Records Written</li> </ul>","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleWriteMetrics</code> takes no input arguments to be created.</p> <p><code>ShuffleWriteMetrics</code> is created\u00a0when:</p> <ul> <li><code>TaskMetrics</code> is created</li> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile</li> <li><code>MapIterator</code> (of BytesToBytesMap) is requested to <code>spill</code></li> <li><code>ExternalAppendOnlyMap</code> is created</li> <li><code>ExternalSorter</code> is requested to spillMemoryIteratorToDisk</li> <li><code>UnsafeExternalSorter</code> is requested to spill</li> <li><code>SpillableIterator</code> (of UnsafeExternalSorter) is requested to <code>spill</code></li> </ul>","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/#taskmetrics","title":"TaskMetrics <p><code>ShuffleWriteMetrics</code> is available using TaskMetrics.shuffleWriteMetrics.</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/#serializable","title":"Serializable <p><code>ShuffleWriteMetrics</code> is a <code>Serializable</code> (Java).</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/","title":"TaskMetrics","text":"<p><code>TaskMetrics</code> is a collection of metrics (accumulators) tracked during execution of a task.</p>","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#creating-instance","title":"Creating Instance","text":"<p><code>TaskMetrics</code> takes no input arguments to be created.</p> <p><code>TaskMetrics</code> is created\u00a0when:</p> <ul> <li><code>Stage</code> is requested to makeNewStageAttempt</li> </ul>","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#metrics","title":"Metrics","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#shufflewritemetrics","title":"ShuffleWriteMetrics <p>ShuffleWriteMetrics</p> <ul> <li>shuffle.write.bytesWritten</li> <li>shuffle.write.recordsWritten</li> <li>shuffle.write.writeTime</li> </ul> <p><code>ShuffleWriteMetrics</code> is exposed using Dropwizard metrics system using ExecutorSource (when <code>TaskRunner</code> is about to finish running):</p> <ul> <li>shuffleBytesWritten</li> <li>shuffleRecordsWritten</li> <li>shuffleWriteTime</li> </ul> <p><code>ShuffleWriteMetrics</code> can be monitored using:</p> <ul> <li>StatsReportListener (when a stage completes)<ul> <li>shuffle bytes written</li> </ul> </li> <li>JsonProtocol (when requested to taskMetricsToJson)<ul> <li>Shuffle Bytes Written</li> <li>Shuffle Write Time</li> <li>Shuffle Records Written</li> </ul> </li> </ul> <p><code>shuffleWriteMetrics</code> is used when:</p> <ul> <li><code>ShuffleWriteProcessor</code> is requested for a ShuffleWriteMetricsReporter</li> <li><code>SortShuffleWriter</code> is created</li> <li><code>AppStatusListener</code> is requested to handle a SparkListenerTaskEnd</li> <li><code>LiveTask</code> is requested to <code>updateMetrics</code></li> <li><code>ExternalSorter</code> is requested to writePartitionedFile (to create a DiskBlockObjectWriter), writePartitionedMapOutput</li> <li><code>ShuffleExchangeExec</code> (Spark SQL) is requested for a <code>ShuffleWriteProcessor</code> (to create a ShuffleDependency)</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#memory-bytes-spilled","title":"Memory Bytes Spilled <p>Number of in-memory bytes spilled by the tasks (of a stage)</p> <p><code>_memoryBytesSpilled</code> is a <code>LongAccumulator</code> with <code>internal.metrics.memoryBytesSpilled</code> name.</p> <p><code>memoryBytesSpilled</code> metric is exposed using ExecutorSource as memoryBytesSpilled (using Dropwizard metrics system).</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#memorybytesspilled","title":"memoryBytesSpilled","text":"<pre><code>memoryBytesSpilled: Long\n</code></pre> <p><code>memoryBytesSpilled</code> is the sum of all memory bytes spilled across all tasks.</p> <p><code>memoryBytesSpilled</code> is used when:</p> <ul> <li><code>SpillListener</code> is requested to onStageCompleted</li> <li><code>TaskRunner</code> is requested to run (and updates task metrics in the Dropwizard metrics system)</li> <li><code>LiveTask</code> is requested to <code>updateMetrics</code></li> <li><code>JsonProtocol</code> is requested to taskMetricsToJson</li> </ul>","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#incmemorybytesspilled","title":"incMemoryBytesSpilled","text":"<pre><code>incMemoryBytesSpilled(\n  v: Long): Unit\n</code></pre> <p><code>incMemoryBytesSpilled</code> adds the <code>v</code> value to the _memoryBytesSpilled metric.</p> <p><code>incMemoryBytesSpilled</code> is used when:</p> <ul> <li><code>Aggregator</code> is requested to updateMetrics</li> <li><code>BasePythonRunner.ReaderIterator</code> is requested to <code>handleTimingData</code></li> <li><code>CoGroupedRDD</code> is requested to compute a partition</li> <li><code>ShuffleExternalSorter</code> is requested to spill</li> <li><code>JsonProtocol</code> is requested to taskMetricsFromJson</li> <li><code>ExternalSorter</code> is requested to insertAllAndUpdateMetrics, writePartitionedFile, writePartitionedMapOutput</li> <li><code>UnsafeExternalSorter</code> is requested to createWithExistingInMemorySorter, spill</li> <li><code>UnsafeExternalSorter.SpillableIterator</code> is requested to <code>spill</code></li> </ul>","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#taskcontext","title":"TaskContext <p><code>TaskMetrics</code> is available using TaskContext.taskMetrics.</p> <pre><code>TaskContext.get.taskMetrics\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#serializable","title":"Serializable <p><code>TaskMetrics</code> is a <code>Serializable</code> (Java).</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#task","title":"Task <p><code>TaskMetrics</code> is part of Task.</p> <pre><code>task.metrics\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#sparklistener","title":"SparkListener <p><code>TaskMetrics</code> is available using SparkListener and intercepting SparkListenerTaskEnd events.</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#statsreportlistener","title":"StatsReportListener <p>StatsReportListener can be used for summary statistics at runtime (after a stage completes).</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#spark-history-server","title":"Spark History Server <p>Spark History Server uses EventLoggingListener to intercept post-execution statistics (incl. <code>TaskMetrics</code>).</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskRunner/","title":"TaskRunner","text":"<p><code>TaskRunner</code> is a thread of execution to run a task.</p> <p></p> <p>Internal Class</p> <p><code>TaskRunner</code> is an internal class of Executor with full access to internal registries.</p> <p><code>TaskRunner</code> is a java.lang.Runnable so once a TaskRunner has completed execution it must not be restarted.</p>"},{"location":"executor/TaskRunner/#creating-instance","title":"Creating Instance","text":"<p><code>TaskRunner</code> takes the following to be created:</p> <ul> <li> ExecutorBackend (that manages the parent Executor) <li> TaskDescription <li>PluginContainer</li> <p><code>TaskRunner</code> is created\u00a0when:</p> <ul> <li><code>Executor</code> is requested to launch a task</li> </ul>"},{"location":"executor/TaskRunner/#plugincontainer","title":"PluginContainer <p><code>TaskRunner</code> may be given a PluginContainer when created.</p> <p>The <code>PluginContainer</code> is used when <code>TaskRunner</code> is requested to run (for the Task to run).</p>","text":""},{"location":"executor/TaskRunner/#demo","title":"Demo <pre><code>./bin/spark-shell --conf spark.driver.maxResultSize=1m\n</code></pre> <pre><code>scala&gt; println(sc.version)\n3.0.1\n</code></pre> <pre><code>val maxResultSize = sc.getConf.get(\"spark.driver.maxResultSize\")\nassert(maxResultSize == \"1m\")\n</code></pre> <pre><code>val rddOver1m = sc.range(0, 1024 * 1024 + 10, 1)\n</code></pre> <pre><code>scala&gt; rddOver1m.collect\nERROR TaskSetManager: Total size of serialized results of 2 tasks (1030.8 KiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nERROR TaskSetManager: Total size of serialized results of 3 tasks (1546.2 KiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nERROR TaskSetManager: Total size of serialized results of 4 tasks (2.0 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nWARN TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\nWARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\nWARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\nERROR TaskSetManager: Total size of serialized results of 5 tasks (2.5 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nWARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\n...\norg.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 2 tasks (1030.8 KiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\n  at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)\n  ...\n</code></pre>","text":""},{"location":"executor/TaskRunner/#thread-name","title":"Thread Name <p><code>TaskRunner</code> uses the following thread name (with the taskId of the TaskDescription):</p> <pre><code>Executor task launch worker for task [taskId]\n</code></pre>","text":""},{"location":"executor/TaskRunner/#running-task","title":"Running Task <pre><code>run(): Unit\n</code></pre> <p><code>run</code> is part of the java.lang.Runnable abstraction.</p>","text":""},{"location":"executor/TaskRunner/#initialization","title":"Initialization <p><code>run</code> initializes the threadId internal registry as the current thread identifier (using Thread.getId).</p> <p><code>run</code> sets the name of the current thread of execution as the threadName.</p> <p><code>run</code> creates a TaskMemoryManager (for the current MemoryManager and taskId). <code>run</code> uses <code>SparkEnv</code> to access the current MemoryManager.</p> <p><code>run</code> starts tracking the time to deserialize a task and sets the current thread's context classloader.</p> <p><code>run</code> creates a closure Serializer. <code>run</code> uses <code>SparkEnv</code> to access the closure Serializer.</p> <p><code>run</code> prints out the following INFO message to the logs (with the taskName and taskId):</p> <pre><code>Running [taskName] (TID [taskId])\n</code></pre> <p><code>run</code> notifies the ExecutorBackend that the status of the task has changed to <code>RUNNING</code> (for the taskId).</p> <p><code>run</code> computes the total amount of time this JVM process has spent in garbage collection.</p> <p><code>run</code> uses the addedFiles and addedJars (of the given TaskDescription) to update dependencies.</p> <p><code>run</code> takes the serializedTask of the given TaskDescription and requests the closure <code>Serializer</code> to deserialize the task. <code>run</code> sets the task internal reference to hold the deserialized task.</p> <p>For non-local environments, <code>run</code> prints out the following DEBUG message to the logs before requesting the <code>MapOutputTrackerWorker</code> to update the epoch (using the epoch of the Task to be executed). <code>run</code> uses <code>SparkEnv</code> to access the MapOutputTrackerWorker.</p> <pre><code>Task [taskId]'s epoch is [epoch]\n</code></pre> <p><code>run</code> requests the <code>metricsPoller</code>...FIXME</p> <p><code>run</code> records the current time as the task's start time (<code>taskStartTimeNs</code>).</p> <p><code>run</code> requests the Task to run (with <code>taskAttemptId</code> as taskId, <code>attemptNumber</code> from <code>TaskDescription</code>, and <code>metricsSystem</code> as the current MetricsSystem).</p>  <p>Note</p> <p><code>run</code> uses <code>SparkEnv</code> to access the MetricsSystem.</p>   <p>Note</p> <p>The task runs inside a \"monitored\" block (<code>try-finally</code> block) to detect any memory and lock leaks after the task's run finishes regardless of the final outcome - the computed value or an exception thrown.</p>  <p><code>run</code> creates a Serializer and requests it to serialize the task result (<code>valueBytes</code>).</p>  <p>Note</p> <p><code>run</code> uses <code>SparkEnv</code> to access the Serializer.</p>  <p><code>run</code> updates the metrics of the Task executed.</p> <p><code>run</code> updates the metric counters in the ExecutorSource.</p> <p><code>run</code> requests the Task executed for accumulator updates and the ExecutorMetricsPoller for metric peaks.</p>","text":""},{"location":"executor/TaskRunner/#serialized-task-result","title":"Serialized Task Result <p><code>run</code> creates a DirectTaskResult (with the serialized task result, the accumulator updates and the metric peaks) and requests the closure Serializer to serialize it.</p>  <p>Note</p> <p>The serialized <code>DirectTaskResult</code> is a java.nio.ByteBuffer.</p>  <p><code>run</code> selects between the <code>DirectTaskResult</code> and an IndirectTaskResult based on the size of the serialized task result (limit of this <code>serializedDirectResult</code> byte buffer):</p> <ol> <li> <p>With the size above spark.driver.maxResultSize, <code>run</code> prints out the following WARN message to the logs and serializes an <code>IndirectTaskResult</code> with a TaskResultBlockId.</p> <pre><code>Finished [taskName] (TID [taskId]). Result is larger than maxResultSize ([resultSize] &gt; [maxResultSize]), dropping it.\n</code></pre> </li> <li> <p>With the size above maxDirectResultSize, <code>run</code> creates an <code>TaskResultBlockId</code> and requests the <code>BlockManager</code> to store the task result locally (with <code>MEMORY_AND_DISK_SER</code>). <code>run</code> prints out the following INFO message to the logs and serializes an <code>IndirectTaskResult</code> with a <code>TaskResultBlockId</code>.</p> <pre><code>Finished [taskName] (TID [taskId]). [resultSize] bytes result sent via BlockManager)\n</code></pre> </li> <li> <p><code>run</code> prints out the following INFO message to the logs and uses the <code>DirectTaskResult</code> created earlier.</p> <pre><code>Finished [taskName] (TID [taskId]). [resultSize] bytes result sent to driver\n</code></pre> </li> </ol>  <p>Note</p> <p><code>serializedResult</code> is either a IndirectTaskResult (possibly with the block stored in <code>BlockManager</code>) or a DirectTaskResult.</p>","text":""},{"location":"executor/TaskRunner/#incrementing-succeededtasks-counter","title":"Incrementing succeededTasks Counter <p><code>run</code> requests the ExecutorSource to increment <code>succeededTasks</code> counter.</p>","text":""},{"location":"executor/TaskRunner/#marking-task-finished","title":"Marking Task Finished <p><code>run</code> setTaskFinishedAndClearInterruptStatus.</p>","text":""},{"location":"executor/TaskRunner/#notifying-executorbackend-that-task-finished","title":"Notifying ExecutorBackend that Task Finished <p><code>run</code> notifies the ExecutorBackend that the status of the taskId has changed to <code>FINISHED</code>.</p>  <p>Note</p> <p><code>ExecutorBackend</code> is given when the TaskRunner is created.</p>","text":""},{"location":"executor/TaskRunner/#wrapping-up","title":"Wrapping Up <p>In the end, regardless of the task's execution status (successful or failed), <code>run</code> removes the taskId from runningTasks registry.</p> <p>In case a onTaskStart notification was sent out, <code>run</code> requests the ExecutorMetricsPoller to onTaskCompletion.</p>","text":""},{"location":"executor/TaskRunner/#exceptions","title":"Exceptions <p><code>run</code> handles certain exceptions.</p>    Exception Type TaskState Serialized ByteBuffer     FetchFailedException <code>FAILED</code> <code>TaskFailedReason</code>   TaskKilledException <code>KILLED</code> <code>TaskKilled</code>   InterruptedException <code>KILLED</code> <code>TaskKilled</code>   CommitDeniedException <code>FAILED</code> <code>TaskFailedReason</code>   Throwable <code>FAILED</code> <code>ExceptionFailure</code>","text":""},{"location":"executor/TaskRunner/#fetchfailedexception","title":"FetchFailedException <p>When shuffle:FetchFailedException.md[FetchFailedException] is reported while running a task, run &lt;&gt;. <p>run shuffle:FetchFailedException.md#toTaskFailedReason[requests <code>FetchFailedException</code> for the <code>TaskFailedReason</code>], serializes it and ExecutorBackend.md#statusUpdate[notifies <code>ExecutorBackend</code> that the task has failed] (with &lt;&gt;, <code>TaskState.FAILED</code>, and a serialized reason). <p>NOTE: <code>ExecutorBackend</code> was specified when &lt;&gt;. <p>NOTE:  run uses a closure serializer:Serializer.md[Serializer] to serialize the failure reason. The <code>Serializer</code> was created before run ran the task.</p>","text":""},{"location":"executor/TaskRunner/#taskkilledexception","title":"TaskKilledException <p>When <code>TaskKilledException</code> is reported while running a task, you should see the following INFO message in the logs:</p> <pre><code>Executor killed [taskName] (TID [taskId]), reason: [reason]\n</code></pre> <p>run then &lt;&gt; and ExecutorBackend.md#statusUpdate[notifies <code>ExecutorBackend</code> that the task has been killed] (with &lt;&gt;, <code>TaskState.KILLED</code>, and a serialized <code>TaskKilled</code> object).","text":""},{"location":"executor/TaskRunner/#interruptedexception-with-task-killed","title":"InterruptedException (with Task Killed) <p>When <code>InterruptedException</code> is reported while running a task, and the task has been killed, you should see the following INFO message in the logs:</p> <pre><code>Executor interrupted and killed [taskName] (TID [taskId]), reason: [killReason]\n</code></pre> <p>run then &lt;&gt; and ExecutorBackend.md#statusUpdate[notifies <code>ExecutorBackend</code> that the task has been killed] (with &lt;&gt;, <code>TaskState.KILLED</code>, and a serialized <code>TaskKilled</code> object). <p>NOTE: The difference between this <code>InterruptedException</code> and &lt;&gt; is the INFO message in the logs.","text":""},{"location":"executor/TaskRunner/#commitdeniedexception","title":"CommitDeniedException <p>When <code>CommitDeniedException</code> is reported while running a task, run &lt;&gt; and ExecutorBackend.md#statusUpdate[notifies <code>ExecutorBackend</code> that the task has failed] (with &lt;&gt;, <code>TaskState.FAILED</code>, and a serialized <code>TaskKilled</code> object). <p>NOTE: The difference between this <code>CommitDeniedException</code> and &lt;&gt; is just the reason being sent to <code>ExecutorBackend</code>.","text":""},{"location":"executor/TaskRunner/#throwable","title":"Throwable <p>When run catches a <code>Throwable</code>, you should see the following ERROR message in the logs (followed by the exception).</p> <pre><code>Exception in [taskName] (TID [taskId])\n</code></pre> <p>run then records the following task metrics (only when &lt;&gt; is available): <ul> <li>TaskMetrics.md#setExecutorRunTime[executorRunTime]</li> <li>TaskMetrics.md#setJvmGCTime[jvmGCTime]</li> </ul> <p>run then scheduler:Task.md#collectAccumulatorUpdates[collects the latest values of internal and external accumulators] (with <code>taskFailed</code> flag enabled to inform that the collection is for a failed task).</p> <p>Otherwise, when &lt;&gt; is not available, the accumulator collection is empty. <p>run converts the task accumulators to collection of <code>AccumulableInfo</code>, creates a <code>ExceptionFailure</code> (with the accumulators), and serializer:Serializer.md#serialize[serializes them].</p> <p>NOTE: run uses a closure serializer:Serializer.md[Serializer] to serialize the <code>ExceptionFailure</code>.</p> <p>CAUTION: FIXME Why does run create <code>new ExceptionFailure(t, accUpdates).withAccums(accums)</code>, i.e. accumulators occur twice in the object.</p> <p>run &lt;&gt; and ExecutorBackend.md#statusUpdate[notifies <code>ExecutorBackend</code> that the task has failed] (with &lt;&gt;, <code>TaskState.FAILED</code>, and the serialized <code>ExceptionFailure</code>). <p>run may also trigger <code>SparkUncaughtExceptionHandler.uncaughtException(t)</code> if this is a fatal error.</p> <p>NOTE: The difference between this most <code>Throwable</code> case and other <code>FAILED</code> cases (i.e. &lt;&gt; and &lt;&gt;) is just the serialized <code>ExceptionFailure</code> vs a reason being sent to <code>ExecutorBackend</code>, respectively.","text":""},{"location":"executor/TaskRunner/#collectaccumulatorsandresetstatusonfailure","title":"collectAccumulatorsAndResetStatusOnFailure <pre><code>collectAccumulatorsAndResetStatusOnFailure(\n  taskStartTimeNs: Long)\n</code></pre> <p><code>collectAccumulatorsAndResetStatusOnFailure</code>...FIXME</p>","text":""},{"location":"executor/TaskRunner/#killing-task","title":"Killing Task <pre><code>kill(\n  interruptThread: Boolean,\n  reason: String): Unit\n</code></pre> <p><code>kill</code> marks the TaskRunner as &lt;&gt; and scheduler:Task.md#kill[kills the task] (if available and not &lt;&gt; already). <p>NOTE: <code>kill</code> passes the input <code>interruptThread</code> on to the task itself while killing it.</p> <p>When executed, you should see the following INFO message in the logs:</p> <pre><code>Executor is trying to kill [taskName] (TID [taskId]), reason: [reason]\n</code></pre> <p>NOTE: &lt;&gt; flag is checked periodically in &lt;&gt; to stop executing the task. Once killed, the task will eventually stop.","text":""},{"location":"executor/TaskRunner/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.executor.Executor</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.executor.Executor=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"executor/TaskRunner/#internal-properties","title":"Internal Properties","text":""},{"location":"executor/TaskRunner/#finished-flag","title":"finished Flag <p>finished flag says whether the &lt;&gt; has finished (<code>true</code>) or not (<code>false</code>) <p>Default: <code>false</code></p> <p>Enabled (<code>true</code>) after TaskRunner has been requested to &lt;&gt; <p>Used when TaskRunner is requested to &lt;&gt;","text":""},{"location":"executor/TaskRunner/#reasonifkilled","title":"reasonIfKilled <p>Reason to &lt;&gt; (and avoid &lt;&gt;) <p>Default: <code>(empty)</code> (<code>None</code>)</p>","text":""},{"location":"executor/TaskRunner/#startgctime-timestamp","title":"startGCTime Timestamp <p>Timestamp (which is really the Executor.md#computeTotalGcTime[total amount of time this Executor JVM process has already spent in garbage collection]) that is used to mark the GC \"zero\" time (when &lt;&gt;) and then compute the JVM GC time metric when: <ul> <li> <p>TaskRunner is requested to &lt;&gt; and &lt;&gt;  <li> <p><code>Executor</code> is requested to Executor.md#reportHeartBeat[reportHeartBeat]</p> </li>","text":""},{"location":"executor/TaskRunner/#task","title":"Task <p>Deserialized scheduler:Task.md[task] to execute</p> <p>Used when:</p> <ul> <li> <p>TaskRunner is requested to &lt;&gt;, &lt;&gt;, &lt;&gt;, &lt;&gt;  <li> <p><code>Executor</code> is requested to Executor.md#reportHeartBeat[reportHeartBeat]</p> </li>","text":""},{"location":"executor/TaskRunner/#task-name","title":"Task Name <p>The name of the task (of the TaskDescription) that is used exclusively for &lt;&gt; purposes when TaskRunner is requested to &lt;&gt; and &lt;&gt; the task","text":""},{"location":"executor/TaskRunner/#thread-id","title":"Thread Id <p>Current thread ID</p> <p>Default: <code>-1</code></p> <p>Set immediately when TaskRunner is requested to &lt;&gt; and used exclusively when <code>TaskReaper</code> is requested for the thread info of the current thread (aka thread dump)","text":""},{"location":"exercises/spark-examples-wordcount-spark-shell/","title":"WordCount using Spark shell","text":"<p>== WordCount using Spark shell</p> <p>It is like any introductory big data example should somehow demonstrate how to count words in distributed fashion.</p> <p>In the following example you're going to count the words in <code>README.md</code> file that sits in your Spark distribution and save the result under <code>README.count</code> directory.</p> <p>You're going to use spark-shell.md[the Spark shell] for the example. Execute <code>spark-shell</code>.</p>"},{"location":"exercises/spark-examples-wordcount-spark-shell/#sourcescala","title":"[source,scala]","text":"<p>val lines = sc.textFile(\"README.md\")               // &lt;1&gt;</p> <p>val words = lines.flatMap(_.split(\"\\s+\"))         // &lt;2&gt;</p> <p>val wc = words.map(w =&gt; (w, 1)).reduceByKey(_ + _) // &lt;3&gt;</p>"},{"location":"exercises/spark-examples-wordcount-spark-shell/#wcsaveastextfilereadmecount-4","title":"wc.saveAsTextFile(\"README.count\")                  // &lt;4&gt;","text":"<p>&lt;1&gt; Read the text file - refer to spark-io.md[Using Input and Output (I/O)]. &lt;2&gt; Split each line into words and flatten the result. &lt;3&gt; Map each word into a pair and count them by word (key). &lt;4&gt; Save the result into text files - one per partition.</p> <p>After you have executed the example, see the contents of the <code>README.count</code> directory:</p> <pre><code>$ ls -lt README.count\ntotal 16\n-rw-r--r--  1 jacek  staff     0  9 pa\u017a 13:36 _SUCCESS\n-rw-r--r--  1 jacek  staff  1963  9 pa\u017a 13:36 part-00000\n-rw-r--r--  1 jacek  staff  1663  9 pa\u017a 13:36 part-00001\n</code></pre> <p>The files <code>part-0000x</code> contain the pairs of word and the count.</p> <pre><code>$ cat README.count/part-00000\n(package,1)\n(this,1)\n(Version\"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1)\n(Because,1)\n(Python,2)\n(cluster.,1)\n(its,1)\n([run,1)\n...\n</code></pre> <p>=== Further (self-)development</p> <p>Please read the questions and give answers first before looking at the link given.</p> <ol> <li>Why are there two files under the directory?</li> <li>How could you have only one?</li> <li>How to <code>filter</code> out words by name?</li> <li>How to <code>count</code> words?</li> </ol> <p>Please refer to the chapter spark-rdd-partitions.md[Partitions] to find some of the answers.</p>"},{"location":"exercises/spark-exercise-custom-scheduler-listener/","title":"Developing Custom SparkListener to monitor DAGScheduler in Scala","text":"<p>== Exercise: Developing Custom SparkListener to monitor DAGScheduler in Scala</p> <p>The example shows how to develop a custom Spark Listener. You should read SparkListener.md[] first to understand the motivation for the example.</p> <p>=== Requirements</p> <ol> <li>https://www.jetbrains.com/idea/[IntelliJ IDEA] (or eventually http://www.scala-sbt.org/[sbt] alone if you're adventurous).</li> <li>Access to Internet to download Apache Spark's dependencies.</li> </ol> <p>=== Setting up Scala project using IntelliJ IDEA</p> <p>Create a new project <code>custom-spark-listener</code>.</p> <p>Add the following line to <code>build.sbt</code> (the main configuration file for the sbt project) that adds the dependency on Apache Spark.</p> <pre><code>libraryDependencies += \"org.apache.spark\" %% \"spark-core\" % \"2.0.1\"\n</code></pre> <p><code>build.sbt</code> should look as follows:</p>"},{"location":"exercises/spark-exercise-custom-scheduler-listener/#source-scala","title":"[source, scala]","text":"<p>name := \"custom-spark-listener\" organization := \"pl.jaceklaskowski.spark\" version := \"1.0\"</p> <p>scalaVersion := \"2.11.8\"</p>"},{"location":"exercises/spark-exercise-custom-scheduler-listener/#librarydependencies-orgapachespark-spark-core-201","title":"libraryDependencies += \"org.apache.spark\" %% \"spark-core\" % \"2.0.1\"","text":"<p>=== Custom Listener - pl.jaceklaskowski.spark.CustomSparkListener</p> <p>Create a Scala class -- <code>CustomSparkListener</code> -- for your custom <code>SparkListener</code>. It should be under <code>src/main/scala</code> directory (create one if it does not exist).</p> <p>The aim of the class is to intercept scheduler events about jobs being started and tasks completed.</p>"},{"location":"exercises/spark-exercise-custom-scheduler-listener/#sourcescala","title":"[source,scala]","text":"<p>package pl.jaceklaskowski.spark</p> <p>import org.apache.spark.scheduler.{SparkListenerStageCompleted, SparkListener, SparkListenerJobStart}</p> <p>class CustomSparkListener extends SparkListener {   override def onJobStart(jobStart: SparkListenerJobStart) {     println(s\"Job started with ${jobStart.stageInfos.size} stages: $jobStart\")   }</p> <p>override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = {     println(s\"Stage ${stageCompleted.stageInfo.stageId} completed with ${stageCompleted.stageInfo.numTasks} tasks.\")   } }</p> <p>=== Creating deployable package</p> <p>Package the custom Spark listener. Execute <code>sbt package</code> command in the <code>custom-spark-listener</code> project's main directory.</p> <pre><code>$ sbt package\n[info] Loading global plugins from /Users/jacek/.sbt/0.13/plugins\n[info] Loading project definition from /Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/project\n[info] Updating {file:/Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/project/}custom-spark-listener-build...\n[info] Resolving org.fusesource.jansi#jansi;1.4 ...\n[info] Done updating.\n[info] Set current project to custom-spark-listener (in build file:/Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/)\n[info] Updating {file:/Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/}custom-spark-listener...\n[info] Resolving jline#jline;2.12.1 ...\n[info] Done updating.\n[info] Compiling 1 Scala source to /Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/target/scala-2.11/classes...\n[info] Packaging /Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/target/scala-2.11/custom-spark-listener_2.11-1.0.jar ...\n[info] Done packaging.\n[success] Total time: 8 s, completed Oct 27, 2016 11:23:50 AM\n</code></pre> <p>You should find the result jar file with the custom scheduler listener ready under <code>target/scala-2.11</code> directory, e.g. <code>target/scala-2.11/custom-spark-listener_2.11-1.0.jar</code>.</p> <p>=== Activating Custom Listener in Spark shell</p> <p>Start ../spark-shell.md[spark-shell] with additional configurations for the extra custom listener and the jar that includes the class.</p> <pre><code>$ spark-shell \\\n  --conf spark.logConf=true \\\n  --conf spark.extraListeners=pl.jaceklaskowski.spark.CustomSparkListener \\\n  --driver-class-path target/scala-2.11/custom-spark-listener_2.11-1.0.jar\n</code></pre> <p>Create a ../spark-sql-Dataset.md#implicits[Dataset] and execute an action like <code>show</code> to start a job as follows:</p> <pre><code>scala&gt; spark.read.text(\"README.md\").count\n[CustomSparkListener] Job started with 2 stages: SparkListenerJobStart(1,1473946006715,WrappedArray(org.apache.spark.scheduler.StageInfo@71515592, org.apache.spark.scheduler.StageInfo@6852819d),{spark.rdd.scope.noOverride=true, spark.rdd.scope={\"id\":\"14\",\"name\":\"collect\"}, spark.sql.execution.id=2})\n[CustomSparkListener] Stage 1 completed with 1 tasks.\n[CustomSparkListener] Stage 2 completed with 1 tasks.\nres0: Long = 7\n</code></pre> <p>The lines with <code>[CustomSparkListener]</code> came from your custom Spark listener. Congratulations! The exercise's over.</p> <p>=== BONUS Activating Custom Listener in Spark Application</p> <p>TIP: Read SparkContext.md#addSparkListener[Registering SparkListener].</p> <p>=== Questions</p> <ol> <li>What are the pros and cons of using the command line version vs inside a Spark application?</li> </ol>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/","title":"Working with Datasets from JDBC Data Sources (and PostgreSQL)","text":"<p>== Working with Datasets from JDBC Data Sources (and PostgreSQL)</p> <p>Start <code>spark-shell</code> with the JDBC driver for the database you want to use. In our case, it is PostgreSQL JDBC Driver.</p> <p>NOTE: Download the jar for PostgreSQL JDBC Driver 42.1.1 directly from the http://central.maven.org/maven2/org/postgresql/postgresql/42.1.1/postgresql-42.1.1.jar[Maven repository].</p>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#tip","title":"[TIP]","text":"<p>Execute the command to have the jar downloaded into <code>~/.ivy2/jars</code> directory by <code>spark-shell</code> itself:</p> <pre><code>./bin/spark-shell --packages org.postgresql:postgresql:42.1.1\n</code></pre> <p>The entire path to the driver file is then like <code>/Users/jacek/.ivy2/jars/org.postgresql_postgresql-42.1.1.jar</code>.</p> <p>You should see the following while <code>spark-shell</code> downloads the driver.</p>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#ivy-default-cache-set-to-usersjacekivy2cache-the-jars-for-the-packages-stored-in-usersjacekivy2jars-loading-settings-url-jarfileusersjacekdevosssparkassemblytargetscala-211jarsivy-240jarorgapacheivycoresettingsivysettingsxml-orgpostgresqlpostgresql-added-as-a-dependency-resolving-dependencies-orgapachesparkspark-submit-parent10-confs-default-found-orgpostgresqlpostgresql4211-in-central-downloading-httpsrepo1mavenorgmaven2orgpostgresqlpostgresql4211postgresql-4211jar-successful-orgpostgresqlpostgresql4211postgresqljarbundle-205ms-resolution-report-resolve-1887ms-artifacts-dl-207ms-modules-in-use-orgpostgresqlpostgresql4211-from-central-in-default-modules-artifacts-conf-number-searchdwnldedevicted-numberdwnlded-default-1-1-1-0-1-1-retrieving-orgapachesparkspark-submit-parent-confs-default-1-artifacts-copied-0-already-retrieved-695kb8ms","title":"<pre><code>Ivy Default Cache set to: /Users/jacek/.ivy2/cache\nThe jars for the packages stored in: /Users/jacek/.ivy2/jars\n:: loading settings :: url = jar:file:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\norg.postgresql#postgresql added as a dependency\n:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0\n    confs: [default]\n    found org.postgresql#postgresql;42.1.1 in central\ndownloading https://repo1.maven.org/maven2/org/postgresql/postgresql/42.1.1/postgresql-42.1.1.jar ...\n    [SUCCESSFUL ] org.postgresql#postgresql;42.1.1!postgresql.jar(bundle) (205ms)\n:: resolution report :: resolve 1887ms :: artifacts dl 207ms\n    :: modules in use:\n    org.postgresql#postgresql;42.1.1 from central in [default]\n    ---------------------------------------------------------------------\n    |                  |            modules            ||   artifacts   |\n    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|\n    ---------------------------------------------------------------------\n    |      default     |   1   |   1   |   1   |   0   ||   1   |   1   |\n    ---------------------------------------------------------------------\n:: retrieving :: org.apache.spark#spark-submit-parent\n    confs: [default]\n    1 artifacts copied, 0 already retrieved (695kB/8ms)\n</code></pre>","text":"<p>Start <code>./bin/spark-shell</code> with spark-submit/index.md#driver-class-path[--driver-class-path] command line option and the driver jar.</p> <pre><code>SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell --driver-class-path /Users/jacek/.ivy2/jars/org.postgresql_postgresql-42.1.1.jar\n</code></pre> <p>It will give you the proper setup for accessing PostgreSQL using the JDBC driver.</p> <p>Execute the following to access <code>projects</code> table in <code>sparkdb</code>.</p>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#source-scala","title":"[source, scala]","text":"<p>// that gives an one-partition Dataset val opts = Map(   \"url\" -&gt; \"jdbc:postgresql:sparkdb\",   \"dbtable\" -&gt; \"projects\") val df = spark.   read.   format(\"jdbc\").   options(opts).   load</p> <p>NOTE: Use <code>user</code> and <code>password</code> options to specify the credentials if needed.</p>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#source-scala_1","title":"[source, scala]","text":"<p>// Note the number of partition (aka numPartitions) scala&gt; df.explain == Physical Plan == *Scan JDBCRelation(projects) [numPartitions=1] [id#0,name#1,website#2] ReadSchema: struct <p>scala&gt; df.show(truncate = false) +---+------------+-----------------------+ |id |name        |website                | +---+------------+-----------------------+ |1  |Apache Spark|http://spark.apache.org| |2  |Apache Hive |http://hive.apache.org | |3  |Apache Kafka|http://kafka.apache.org| |4  |Apache Flink|http://flink.apache.org| +---+------------+-----------------------+</p> <p>// use jdbc method with predicates to define partitions import java.util.Properties val df4parts = spark.   read.   jdbc(     url = \"jdbc:postgresql:sparkdb\",     table = \"projects\",     predicates = Array(\"id=1\", \"id=2\", \"id=3\", \"id=4\"),     connectionProperties = new Properties())</p> <p>// Note the number of partitions (aka numPartitions) scala&gt; df4parts.explain == Physical Plan == *Scan JDBCRelation(projects) [numPartitions=4] [id#16,name#17,website#18] ReadSchema: struct <p>scala&gt; df4parts.show(truncate = false) +---+------------+-----------------------+ |id |name        |website                | +---+------------+-----------------------+ |1  |Apache Spark|http://spark.apache.org| |2  |Apache Hive |http://hive.apache.org | |3  |Apache Kafka|http://kafka.apache.org| |4  |Apache Flink|http://flink.apache.org| +---+------------+-----------------------+</p> <p>=== Troubleshooting</p> <p>If things can go wrong, they sooner or later go wrong. Here is a list of possible issues and their solutions.</p> <p>==== java.sql.SQLException: No suitable driver</p> <p>Ensure that the JDBC driver sits on the CLASSPATH. Use spark-submit/index.md#driver-class-path[--driver-class-path] as described above (<code>--packages</code> or <code>--jars</code> do not work).</p> <pre><code>scala&gt; val df = spark.\n     |   read.\n     |   format(\"jdbc\").\n     |   options(opts).\n     |   load\njava.sql.SQLException: No suitable driver\n  at java.sql.DriverManager.getDriver(DriverManager.java:315)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)\n  at scala.Option.getOrElse(Option.scala:121)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.&lt;init&gt;(JDBCOptions.scala:83)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.&lt;init&gt;(JDBCOptions.scala:34)\n  at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)\n  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:301)\n  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)\n  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:158)\n  ... 52 elided\n</code></pre> <p>=== PostgreSQL Setup</p> <p>NOTE: I'm on Mac OS X so YMMV (aka Your Mileage May Vary).</p> <p>Use the sections to have a properly configured PostgreSQL database.</p> <ul> <li>&lt;&gt; <li>&lt;&gt; <li>&lt;&gt; <li>&lt;&gt; <li>&lt;&gt; <li>&lt;&gt; <li>&lt;&gt; <p>==== [[installation]] Installation</p> <p>Install PostgreSQL as described in...TK</p> <p>CAUTION: This page serves as a cheatsheet for the author so he does not have to search Internet to find the installation steps.</p> <pre><code>$ initdb /usr/local/var/postgres -E utf8\nThe files belonging to this database system will be owned by user \"jacek\".\nThis user must also own the server process.\n\nThe database cluster will be initialized with locale \"pl_pl.utf-8\".\ninitdb: could not find suitable text search configuration for locale \"pl_pl.utf-8\"\nThe default text search configuration will be set to \"simple\".\n\nData page checksums are disabled.\n\ncreating directory /usr/local/var/postgres ... ok\ncreating subdirectories ... ok\nselecting default max_connections ... 100\nselecting default shared_buffers ... 128MB\nselecting dynamic shared memory implementation ... posix\ncreating configuration files ... ok\ncreating template1 database in /usr/local/var/postgres/base/1 ... ok\ninitializing pg_authid ... ok\ninitializing dependencies ... ok\ncreating system views ... ok\nloading system objects' descriptions ... ok\ncreating collations ... ok\ncreating conversions ... ok\ncreating dictionaries ... ok\nsetting privileges on built-in objects ... ok\ncreating information schema ... ok\nloading PL/pgSQL server-side language ... ok\nvacuuming database template1 ... ok\ncopying template1 to template0 ... ok\ncopying template1 to postgres ... ok\nsyncing data to disk ... ok\n\nWARNING: enabling \"trust\" authentication for local connections\nYou can change this by editing pg_hba.conf or using the option -A, or\n--auth-local and --auth-host, the next time you run initdb.\n\nSuccess. You can now start the database server using:\n\n    pg_ctl -D /usr/local/var/postgres -l logfile start\n</code></pre> <p>==== [[starting-database-server]] Starting Database Server</p> <p>NOTE: Consult http://www.postgresql.org/docs/current/static/server-start.html[17.3. Starting the Database Server] in the official documentation.</p>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#tip_1","title":"[TIP]","text":"<p>Enable <code>all</code> logs in PostgreSQL to see query statements.</p> <pre><code>log_statement = 'all'\n</code></pre>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#add-log_statement-all-to-usrlocalvarpostgrespostgresqlconf-on-mac-os-x-with-postgresql-installed-using-brew","title":"Add <code>log_statement = 'all'</code> to <code>/usr/local/var/postgres/postgresql.conf</code> on Mac OS X with PostgreSQL installed using <code>brew</code>.","text":"<p>Start the database server using <code>pg_ctl</code>.</p> <pre><code>$ pg_ctl -D /usr/local/var/postgres -l logfile start\nserver starting\n</code></pre> <p>Alternatively, you can run the database server using <code>postgres</code>.</p> <pre><code>$ postgres -D /usr/local/var/postgres\n</code></pre> <p>==== [[creating-database]] Create Database</p> <pre><code>$ createdb sparkdb\n</code></pre> <p>TIP: Consult http://www.postgresql.org/docs/current/static/app-createdb.html[createdb] in the official documentation.</p> <p>==== Accessing Database</p> <p>Use <code>psql sparkdb</code> to access the database.</p> <pre><code>$ psql sparkdb\npsql (9.6.2)\nType \"help\" for help.\n\nsparkdb=#\n</code></pre> <p>Execute <code>SELECT version()</code> to know the version of the database server you have connected to.</p> <pre><code>sparkdb=# SELECT version();\n                                                   version\n--------------------------------------------------------------------------------------------------------------\n PostgreSQL 9.6.2 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.2 (clang-700.1.81), 64-bit\n(1 row)\n</code></pre> <p>Use <code>\\h</code> for help and <code>\\q</code> to leave a session.</p> <p>==== Creating Table</p> <p>Create a table using <code>CREATE TABLE</code> command.</p> <pre><code>CREATE TABLE projects (\n  id SERIAL PRIMARY KEY,\n  name text,\n  website text\n);\n</code></pre> <p>Insert rows to initialize the table with data.</p> <pre><code>INSERT INTO projects (name, website) VALUES ('Apache Spark', 'http://spark.apache.org');\nINSERT INTO projects (name, website) VALUES ('Apache Hive', 'http://hive.apache.org');\nINSERT INTO projects VALUES (DEFAULT, 'Apache Kafka', 'http://kafka.apache.org');\nINSERT INTO projects VALUES (DEFAULT, 'Apache Flink', 'http://flink.apache.org');\n</code></pre> <p>Execute <code>select * from projects;</code> to ensure that you have the following records in <code>projects</code> table:</p> <pre><code>sparkdb=# select * from projects;\n id |     name     |         website\n----+--------------+-------------------------\n  1 | Apache Spark | http://spark.apache.org\n  2 | Apache Hive  | http://hive.apache.org\n  3 | Apache Kafka | http://kafka.apache.org\n  4 | Apache Flink | http://flink.apache.org\n(4 rows)\n</code></pre> <p>==== Dropping Database</p> <pre><code>$ dropdb sparkdb\n</code></pre> <p>TIP: Consult http://www.postgresql.org/docs/current/static/app-dropdb.html[dropdb] in the official documentation.</p> <p>==== Stopping Database Server</p> <pre><code>pg_ctl -D /usr/local/var/postgres stop\n</code></pre>"},{"location":"exercises/spark-exercise-failing-stage/","title":"Causing Stage to Fail","text":"<p>== Exercise: Causing Stage to Fail</p> <p>The example shows how Spark re-executes a stage in case of stage failure.</p> <p>=== Recipe</p> <p>Start a Spark cluster, e.g. 1-node Hadoop YARN.</p> <pre><code>start-yarn.sh\n</code></pre> <pre><code>// 2-stage job -- it _appears_ that a stage can be failed only when there is a shuffle\nsc.parallelize(0 to 3e3.toInt, 2).map(n =&gt; (n % 2, n)).groupByKey.count\n</code></pre> <p>Use 2 executors at least so you can kill one and keep the application up and running (on one executor).</p> <pre><code>YARN_CONF_DIR=hadoop-conf ./bin/spark-shell --master yarn \\\n  -c spark.shuffle.service.enabled=true \\\n  --num-executors 2\n</code></pre>"},{"location":"exercises/spark-exercise-pairrddfunctions-oneliners/","title":"One-liners using PairRDDFunctions","text":"<p>== Exercise: One-liners using PairRDDFunctions</p> <p>This is a set of one-liners to give you a entry point into using rdd:PairRDDFunctions.md[PairRDDFunctions].</p> <p>=== Exercise</p> <p>How would you go about solving a requirement to pair elements of the same key and creating a new RDD out of the matched values?</p>"},{"location":"exercises/spark-exercise-pairrddfunctions-oneliners/#source-scala","title":"[source, scala]","text":"<p>val users = Seq((1, \"user1\"), (1, \"user2\"), (2, \"user1\"), (2, \"user3\"), (3,\"user2\"), (3,\"user4\"), (3,\"user1\"))</p> <p>// Input RDD val us = sc.parallelize(users)</p> <p>// ...your code here</p> <p>// Desired output Seq(\"user1\",\"user2\"),(\"user1\",\"user3\"),(\"user1\",\"user4\"),(\"user2\",\"user4\"))</p>"},{"location":"exercises/spark-exercise-standalone-master-ha/","title":"Spark Standalone - Using ZooKeeper for High-Availability of Master","text":"<p>== Spark Standalone - Using ZooKeeper for High-Availability of Master</p> <p>TIP: Read  ../spark-standalone-Master.md#recovery-mode[Recovery Mode] to know the theory.</p> <p>You're going to start two standalone Masters.</p> <p>You'll need 4 terminals (adjust addresses as needed):</p> <p>Start ZooKeeper.</p> <p>Create a configuration file <code>ha.conf</code> with the content as follows:</p> <pre><code>spark.deploy.recoveryMode=ZOOKEEPER\nspark.deploy.zookeeper.url=&lt;zookeeper_host&gt;:2181\nspark.deploy.zookeeper.dir=/spark\n</code></pre> <p>Start the first standalone Master.</p> <pre><code>./sbin/start-master.sh -h localhost -p 7077 --webui-port 8080 --properties-file ha.conf\n</code></pre> <p>Start the second standalone Master.</p> <p>NOTE: It is not possible to start another instance of standalone Master on the same machine using <code>./sbin/start-master.sh</code>. The reason is that the script assumes one instance per machine only. We're going to change the script to make it possible.</p> <pre><code>$ cp ./sbin/start-master{,-2}.sh\n\n$ grep \"CLASS 1\" ./sbin/start-master-2.sh\n\"$\\{SPARK_HOME}/sbin\"/spark-daemon.sh start $CLASS 1 \\\n\n$ sed -i -e 's/CLASS 1/CLASS 2/' sbin/start-master-2.sh\n\n$ grep \"CLASS 1\" ./sbin/start-master-2.sh\n\n$ grep \"CLASS 2\" ./sbin/start-master-2.sh\n\"$\\{SPARK_HOME}/sbin\"/spark-daemon.sh start $CLASS 2 \\\n\n$ ./sbin/start-master-2.sh -h localhost -p 17077 --webui-port 18080 --properties-file ha.conf\n</code></pre> <p>You can check how many instances you're currently running using <code>jps</code> command as follows:</p> <pre><code>$ jps -lm\n5024 sun.tools.jps.Jps -lm\n4994 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080 -h localhost -p 17077 --webui-port 18080 --properties-file ha.conf\n4808 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080 -h localhost -p 7077 --webui-port 8080 --properties-file ha.conf\n4778 org.apache.zookeeper.server.quorum.QuorumPeerMain config/zookeeper.properties\n</code></pre> <p>Start a standalone Worker.</p> <pre><code>./sbin/start-slave.sh spark://localhost:7077,localhost:17077\n</code></pre> <p>Start Spark shell.</p> <pre><code>./bin/spark-shell --master spark://localhost:7077,localhost:17077\n</code></pre> <p>Wait till the Spark shell connects to an active standalone Master.</p> <p>Find out which standalone Master is active (there can only be one). Kill it. Observe how the other standalone Master takes over and lets the Spark shell register with itself. Check out the master's UI.</p> <p>Optionally, kill the worker, make sure it goes away instantly in the active master's logs.</p>"},{"location":"exercises/spark-exercise-take-multiple-jobs/","title":"Learning Jobs and Partitions Using take Action","text":"<p>== Exercise: Learning Jobs and Partitions Using take Action</p> <p>The exercise aims for introducing <code>take</code> action and using <code>spark-shell</code> and web UI. It should introduce you to the concepts of partitions and jobs.</p> <p>The following snippet creates an RDD of 16 elements with 16 partitions.</p> <pre><code>scala&gt; val r1 = sc.parallelize(0 to 15, 16)\nr1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at &lt;console&gt;:18\n\nscala&gt; r1.partitions.size\nres63: Int = 16\n\nscala&gt; r1.foreachPartition(it =&gt; println(\"&gt;&gt;&gt; partition size: \" + it.size))\n...\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n... // the machine has 8 cores\n... // so first 8 tasks get executed immediately\n... // with the others after a core is free to take on new tasks.\n&gt;&gt;&gt; partition size: 1\n...\n&gt;&gt;&gt; partition size: 1\n...\n&gt;&gt;&gt; partition size: 1\n...\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n...\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n</code></pre> <p>All 16 partitions have one element.</p> <p>When you execute <code>r1.take(1)</code> only one job gets run since it is enough to compute one task on one partition.</p> <p>CAUTION: FIXME Snapshot from web UI - note the number of tasks</p> <p>However, when you execute <code>r1.take(2)</code> two jobs get run as the implementation assumes one job with one partition, and if the elements didn't total to the number of elements requested in <code>take</code>, quadruple the partitions to work on in the following jobs.</p> <p>CAUTION: FIXME Snapshot from web UI - note the number of tasks</p> <p>Can you guess how many jobs are run for <code>r1.take(15)</code>? How many tasks per job?</p> <p>CAUTION: FIXME Snapshot from web UI - note the number of tasks</p> <p>Answer: 3.</p>"},{"location":"exercises/spark-first-app/","title":"Your first complete Spark application (using Scala and sbt)","text":"<p>== Your first Spark application (using Scala and sbt)</p> <p>This page gives you the exact steps to develop and run a complete Spark application using http://www.scala-lang.org/[Scala] programming language and http://www.scala-sbt.org/[sbt] as the build tool.</p> <p>[TIP] Refer to Quick Start's  http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/quick-start.html#self-contained-applications[Self-Contained Applications] in the official documentation.</p> <p>The sample application called SparkMe App is...FIXME</p> <p>=== Overview</p> <p>You're going to use http://www.scala-sbt.org/[sbt] as the project build tool. It uses <code>build.sbt</code> for the project's description as well as the dependencies, i.e. the version of Apache Spark and others.</p> <p>The application's main code is under <code>src/main/scala</code> directory, in <code>SparkMeApp.scala</code> file.</p> <p>With the files in a directory, executing <code>sbt package</code> results in a package that can be deployed onto a Spark cluster using <code>spark-submit</code>.</p> <p>In this example, you're going to use Spark's local/spark-local.md[local mode].</p> <p>=== Project's build - build.sbt</p> <p>Any Scala project managed by sbt uses <code>build.sbt</code> as the central place for configuration, including project dependencies denoted as <code>libraryDependencies</code>.</p> <p>build.sbt <pre><code>name         := \"SparkMe Project\"\nversion      := \"1.0\"\norganization := \"pl.japila\"\n\nscalaVersion := \"2.11.7\"\n\nlibraryDependencies += \"org.apache.spark\" %% \"spark-core\" % \"1.6.0-SNAPSHOT\"  // &lt;1&gt;\nresolvers += Resolver.mavenLocal\n</code></pre> &lt;1&gt; Use the development version of Spark 1.6.0-SNAPSHOT</p> <p>=== SparkMe Application</p> <p>The application uses a single command-line parameter (as <code>args(0)</code>) that is the file to process. The file is read and the number of lines printed out.</p> <pre><code>package pl.japila.spark\n\nimport org.apache.spark.{SparkContext, SparkConf}\n\nobject SparkMeApp {\n  def main(args: Array[String]) {\n    val conf = new SparkConf().setAppName(\"SparkMe Application\")\n    val sc = new SparkContext(conf)\n\n    val fileName = args(0)\n    val lines = sc.textFile(fileName).cache\n\n    val c = lines.count\n    println(s\"There are $c lines in $fileName\")\n  }\n}\n</code></pre> <p>=== sbt version - project/build.properties</p> <p>sbt (launcher) uses <code>project/build.properties</code> file to set (the real) sbt up</p> <pre><code>sbt.version=0.13.9\n</code></pre> <p>TIP: With the file the build is more predictable as the version of sbt doesn't depend on the sbt launcher.</p> <p>=== Packaging Application</p> <p>Execute <code>sbt package</code> to package the application.</p> <pre><code>\u279c  sparkme-app  sbt package\n[info] Loading global plugins from /Users/jacek/.sbt/0.13/plugins\n[info] Loading project definition from /Users/jacek/dev/sandbox/sparkme-app/project\n[info] Set current project to SparkMe Project (in build file:/Users/jacek/dev/sandbox/sparkme-app/)\n[info] Compiling 1 Scala source to /Users/jacek/dev/sandbox/sparkme-app/target/scala-2.11/classes...\n[info] Packaging /Users/jacek/dev/sandbox/sparkme-app/target/scala-2.11/sparkme-project_2.11-1.0.jar ...\n[info] Done packaging.\n[success] Total time: 3 s, completed Sep 23, 2015 12:47:52 AM\n</code></pre> <p>The application uses only classes that comes with Spark so <code>package</code> is enough.</p> <p>In <code>target/scala-2.11/sparkme-project_2.11-1.0.jar</code> there is the final application ready for deployment.</p> <p>=== Submitting Application to Spark (local)</p> <p>NOTE: The application is going to be deployed to <code>local[*]</code>. Change it to whatever cluster you have available (refer to spark-cluster.md[Running Spark in cluster]).</p> <p><code>spark-submit</code> the SparkMe application and specify the file to process (as it is the only and required input parameter to the application), e.g. <code>build.sbt</code> of the project.</p> <p>NOTE: <code>build.sbt</code> is sbt's build definition and is only used as an input file for demonstration purposes. Any file is going to work fine.</p> <pre><code>\u279c  sparkme-app  ~/dev/oss/spark/bin/spark-submit --master \"local[*]\" --class pl.japila.spark.SparkMeApp target/scala-2.11/sparkme-project_2.11-1.0.jar build.sbt\nUsing Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties\nTo adjust logging level use sc.setLogLevel(\"INFO\")\n15/09/23 01:06:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n15/09/23 01:06:04 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.\nThere are 8 lines in build.sbt\n</code></pre> <p>NOTE: Disregard the two above WARN log messages.</p> <p>You're done. Sincere congratulations!</p>"},{"location":"exercises/spark-hello-world-using-spark-shell/","title":"Spark's Hello World using Spark shell and Scala","text":"<p>== Exercise: Spark's Hello World using Spark shell and Scala</p> <p>Run Spark shell and count the number of words in a file using MapReduce pattern.</p> <ul> <li>Use <code>sc.textFile</code> to read the file into memory</li> <li>Use <code>RDD.flatMap</code> for a mapper step</li> <li>Use <code>reduceByKey</code> for a reducer step</li> </ul>"},{"location":"exercises/spark-sql-hive-orc-example/","title":"Using Spark SQL to update data in Hive using ORC files","text":"<p>== Using Spark SQL to update data in Hive using ORC files</p> <p>The example has showed up on Spark's users mailing list.</p>"},{"location":"exercises/spark-sql-hive-orc-example/#caution","title":"[CAUTION]","text":"<ul> <li>FIXME Offer a complete working solution in Scala</li> <li>FIXME Load ORC files into dataframe ** <code>val df = hiveContext.read.format(\"orc\").load(to/path)</code> ====</li> </ul> <p>Solution was to use Hive in ORC format with partitions:</p> <ul> <li>A table in Hive stored as an ORC file (using partitioning)</li> <li>Using <code>SQLContext.sql</code> to insert data into the table</li> <li>Using <code>SQLContext.sql</code> to periodically run <code>ALTER TABLE...CONCATENATE</code> to merge your many small files into larger files optimized for your HDFS block size ** Since the <code>CONCATENATE</code> command operates on files in place it is transparent to any downstream processing</li> <li>Hive solution is just to concatenate the files ** it does not alter or change records. ** it's possible to update data in Hive using ORC format ** With transactional tables in Hive together with insert, update, delete, it does the \"concatenate \" for you automatically in regularly intervals. Currently this works only with tables in orc.format (stored as orc) ** Alternatively, use Hbase with Phoenix as the SQL layer on top ** Hive was originally not designed for updates,  because it was.purely warehouse focused, the most recent one can do updates, deletes etc in a transactional way.</li> </ul> <p>Criteria:</p> <ul> <li>spark-streaming/spark-streaming.md[Spark Streaming] jobs are receiving a lot of small events (avg 10kb)</li> <li>Events are stored to HDFS, e.g. for Pig jobs</li> <li>There are a lot of small files in HDFS (several millions)</li> </ul>"},{"location":"external-shuffle-service/","title":"External Shuffle Service","text":"<p>External Shuffle Service is a Spark service to serve RDD and shuffle blocks outside and for Executors.</p> <p>ExternalShuffleService can be started as a command-line application or automatically as part of a worker node in a Spark cluster (e.g. Spark Standalone).</p> <p>External Shuffle Service is enabled in a Spark application using spark.shuffle.service.enabled configuration property.</p>"},{"location":"external-shuffle-service/ExecutorShuffleInfo/","title":"ExecutorShuffleInfo","text":"<p><code>ExecutorShuffleInfo</code> is...FIXME</p>"},{"location":"external-shuffle-service/ExternalBlockHandler/","title":"ExternalBlockHandler","text":"<p><code>ExternalBlockHandler</code> is an <code>RpcHandler</code>.</p>"},{"location":"external-shuffle-service/ExternalBlockHandler/#creating-instance","title":"Creating Instance","text":"<p><code>ExternalBlockHandler</code> takes the following to be created:</p> <ul> <li> TransportConf <li>Registered Executors File</li> <p><code>ExternalBlockHandler</code> creates the following:</p> <ul> <li>ShuffleMetrics</li> <li>OneForOneStreamManager</li> <li>ExternalShuffleBlockResolver</li> </ul> <p><code>ExternalBlockHandler</code> is created\u00a0when:</p> <ul> <li><code>ExternalShuffleService</code> is requested for an ExternalBlockHandler</li> <li><code>YarnShuffleService</code> is requested to <code>serviceInit</code></li> </ul>"},{"location":"external-shuffle-service/ExternalBlockHandler/#oneforonestreammanager","title":"OneForOneStreamManager <p><code>ExternalBlockHandler</code> can be given or creates an <code>OneForOneStreamManager</code> when created.</p>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#externalshuffleblockresolver","title":"ExternalShuffleBlockResolver <p><code>ExternalBlockHandler</code> can be given or creates an ExternalShuffleBlockResolver to be created.</p> <p><code>ExternalShuffleBlockResolver</code> is used for the following:</p> <ul> <li>registerExecutor when <code>ExternalBlockHandler</code> is requested to handle a RegisterExecutor message</li> <li>removeBlocks when <code>ExternalBlockHandler</code> is requested to handle a RemoveBlocks message</li> <li>getLocalDirs when <code>ExternalBlockHandler</code> is requested to handle a GetLocalDirsForExecutors message</li> <li>applicationRemoved when <code>ExternalBlockHandler</code> is requested to applicationRemoved</li> <li>executorRemoved when <code>ExternalBlockHandler</code> is requested to executorRemoved</li> <li>registerExecutor when <code>ExternalBlockHandler</code> is requested to reregisterExecutor</li> </ul> <p><code>ExternalShuffleBlockResolver</code> is used for the following:</p> <ul> <li>getBlockData and getRddBlockData for <code>ManagedBufferIterator</code></li> <li>getBlockData and getContinuousBlocksData for <code>ShuffleManagedBufferIterator</code></li> </ul> <p><code>ExternalShuffleBlockResolver</code> is closed when is ExternalBlockHandler.</p>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#registered-executors-file","title":"Registered Executors File <p><code>ExternalBlockHandler</code> can be given a Java's File (or <code>null</code>) to be created.</p> <p>This file is simply to create an ExternalShuffleBlockResolver.</p>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#messages","title":"Messages","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#fetchshuffleblocks","title":"FetchShuffleBlocks <p>Request to read a set of blocks</p> <p>\"Posted\" (created) when:</p> <ul> <li><code>OneForOneBlockFetcher</code> is requested to createFetchShuffleBlocksMsg</li> </ul> <p>When received, <code>ExternalBlockHandler</code> requests the OneForOneStreamManager to <code>registerStream</code> (with a <code>ShuffleManagedBufferIterator</code>).</p> <p><code>ExternalBlockHandler</code> prints out the following TRACE message to the logs:</p> <pre><code>Registered streamId [streamId] with [numBlockIds] buffers for client [clientId] from host [remoteAddress]\n</code></pre> <p>In the end, <code>ExternalBlockHandler</code> responds with a <code>StreamHandle</code> (of <code>streamId</code> and <code>numBlockIds</code>).</p>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#getlocaldirsforexecutors","title":"GetLocalDirsForExecutors","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#openblocks","title":"OpenBlocks  <p>Note</p> <p>For backward compatibility and like FetchShuffleBlocks.</p>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#registerexecutor","title":"RegisterExecutor","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#removeblocks","title":"RemoveBlocks","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#shufflemetrics","title":"ShuffleMetrics","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#executor-removed-notification","title":"Executor Removed Notification <pre><code>void executorRemoved(\n  String executorId,\n  String appId)\n</code></pre> <p><code>executorRemoved</code> requests the ExternalShuffleBlockResolver to executorRemoved.</p> <p><code>executorRemoved</code>\u00a0is used when:</p> <ul> <li><code>ExternalShuffleService</code> is requested to executorRemoved</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#application-finished-notification","title":"Application Finished Notification <pre><code>void applicationRemoved(\n  String appId,\n  boolean cleanupLocalDirs)\n</code></pre> <p><code>applicationRemoved</code> requests the ExternalShuffleBlockResolver to applicationRemoved.</p> <p><code>applicationRemoved</code>\u00a0is used when:</p> <ul> <li><code>ExternalShuffleService</code> is requested to applicationRemoved</li> <li><code>YarnShuffleService</code> (Spark on YARN) is requested to <code>stopApplication</code></li> </ul>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.network.shuffle.ExternalBlockHandler</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.network.shuffle.ExternalBlockHandler=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/","title":"ExternalShuffleBlockResolver","text":"<p><code>ExternalShuffleBlockResolver</code> manages converting shuffle BlockIds into physical segments of local files (from a process outside of Executors).</p>"},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#creating-instance","title":"Creating Instance","text":"<p><code>ExternalShuffleBlockResolver</code> takes the following to be created:</p> <ul> <li> TransportConf <li> <code>registeredExecutor</code> File (Java's File) <li>Directory Cleaner</li> <p><code>ExternalShuffleBlockResolver</code> is created\u00a0when:</p> <ul> <li><code>ExternalBlockHandler</code> is created</li> </ul>"},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#executors","title":"Executors <p><code>ExternalShuffleBlockResolver</code> uses a mapping of <code>ExecutorShuffleInfo</code>s by <code>AppExecId</code>.</p> <p><code>ExternalShuffleBlockResolver</code> can (re)load this mapping from a registeredExecutor file or simply start from scratch.</p> <p>A new mapping is added when registering an executor.</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#directory-cleaner-executor","title":"Directory Cleaner Executor <p><code>ExternalShuffleBlockResolver</code> can be given a Java Executor or use a single worker thread executor (with spark-shuffle-directory-cleaner thread prefix).</p> <p>The <code>Executor</code> is used to schedule a thread to clean up executor's local directories and non-shuffle and non-RDD files in executor's local directories.</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#sparkshuffleservicefetchrddenabled","title":"spark.shuffle.service.fetch.rdd.enabled <p><code>ExternalShuffleBlockResolver</code> uses spark.shuffle.service.fetch.rdd.enabled configuration property to control whether or not to remove cached RDD files (alongside shuffle output files).</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#registering-executor","title":"Registering Executor <pre><code>void registerExecutor(\n  String appId,\n  String execId,\n  ExecutorShuffleInfo executorInfo)\n</code></pre> <p><code>registerExecutor</code>...FIXME</p> <p><code>registerExecutor</code> is used when:</p> <ul> <li><code>ExternalBlockHandler</code> is requested to handle a RegisterExecutor message and reregisterExecutor</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#cleaning-up-local-directories-for-removed-executor","title":"Cleaning Up Local Directories for Removed Executor <pre><code>void executorRemoved(\n  String executorId,\n  String appId)\n</code></pre> <p><code>executorRemoved</code> prints out the following INFO message to the logs:</p> <pre><code>Clean up non-shuffle and non-RDD files associated with the finished executor [executorId]\n</code></pre> <p><code>executorRemoved</code> looks up the executor in the executors internal registry.</p> <p>When found, <code>executorRemoved</code> prints out the following INFO message to the logs and requests the Directory Cleaner Executor to execute asynchronous deletion of the executor's local directories (on a separate thread).</p> <pre><code>Cleaning up non-shuffle and non-RDD files in executor [AppExecId]'s [localDirs] local dirs\n</code></pre> <p>When not found, <code>executorRemoved</code> prints out the following INFO message to the logs:</p> <pre><code>Executor is not registered (appId=[appId], execId=[executorId])\n</code></pre> <p><code>executorRemoved</code>\u00a0is used when:</p> <ul> <li><code>ExternalBlockHandler</code> is requested to executorRemoved</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#deletenonshuffleserviceservedfiles","title":"deleteNonShuffleServiceServedFiles <pre><code>void deleteNonShuffleServiceServedFiles(\n  String[] dirs)\n</code></pre> <p><code>deleteNonShuffleServiceServedFiles</code> creates a Java FilenameFilter for files that meet all of the following:</p> <ol> <li>A file name does not end with <code>.index</code> or <code>.data</code></li> <li>With rddFetchEnabled is enabled, a file name does not start with <code>rdd_</code> prefix</li> </ol> <p><code>deleteNonShuffleServiceServedFiles</code> deletes files and directories (based on the <code>FilenameFilter</code>) in every directory (in the input <code>dirs</code>).</p> <p><code>deleteNonShuffleServiceServedFiles</code> prints out the following DEBUG message to the logs:</p> <pre><code>Successfully cleaned up files not served by shuffle service in directory: [localDir]\n</code></pre> <p>In case of any exceptions, <code>deleteNonShuffleServiceServedFiles</code> prints out the following ERROR message to the logs:</p> <pre><code>Failed to delete files not served by shuffle service in directory: [localDir]\n</code></pre>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#application-removed-notification","title":"Application Removed Notification <pre><code>void applicationRemoved(\n  String appId,\n  boolean cleanupLocalDirs)\n</code></pre> <p><code>applicationRemoved</code>...FIXME</p> <p><code>applicationRemoved</code> is used when:</p> <ul> <li><code>ExternalBlockHandler</code> is requested to applicationRemoved</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#deleteexecutordirs","title":"deleteExecutorDirs <pre><code>void deleteExecutorDirs(\n  String[] dirs)\n</code></pre> <p><code>deleteExecutorDirs</code>...FIXME</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#fetching-block-data","title":"Fetching Block Data <pre><code>ManagedBuffer getBlockData(\n  String appId,\n  String execId,\n  int shuffleId,\n  long mapId,\n  int reduceId)\n</code></pre> <p><code>getBlockData</code>...FIXME</p> <p><code>getBlockData</code> is used when:</p> <ul> <li><code>ManagedBufferIterator</code> is created</li> <li><code>ShuffleManagedBufferIterator</code> is requested for next <code>ManagedBuffer</code></li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.network.shuffle.ExternalShuffleBlockResolver</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.network.shuffle.ExternalShuffleBlockResolver=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/","title":"ExternalShuffleService","text":"<p><code>ExternalShuffleService</code> is a Spark service that can serve RDD and shuffle blocks.</p> <p><code>ExternalShuffleService</code> manages shuffle output files so they are available to executors. As the shuffle output files are managed externally to the executors it offers an uninterrupted access to the shuffle output files regardless of executors being killed or down (esp. with Dynamic Allocation of Executors).</p> <p><code>ExternalShuffleService</code> can be launched from command line.</p> <p><code>ExternalShuffleService</code> is enabled on the driver and executors using spark.shuffle.service.enabled configuration property.</p> <p>Note</p> <p>Spark on YARN uses a custom external shuffle service (<code>YarnShuffleService</code>).</p>"},{"location":"external-shuffle-service/ExternalShuffleService/#launching-externalshuffleservice","title":"Launching ExternalShuffleService <p><code>ExternalShuffleService</code> can be launched as a standalone application using spark-class.</p> <pre><code>spark-class org.apache.spark.deploy.ExternalShuffleService\n</code></pre>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#main-entry-point","title":"main Entry Point <pre><code>main(\n  args: Array[String]): Unit\n</code></pre> <p><code>main</code> is the entry point of <code>ExternalShuffleService</code> standalone application.</p> <p><code>main</code> prints out the following INFO message to the logs:</p> <pre><code>Started daemon with process name: [name]\n</code></pre> <p><code>main</code> registers signal handlers for <code>TERM</code>, <code>HUP</code>, <code>INT</code> signals.</p> <p><code>main</code> loads the default Spark properties.</p> <p><code>main</code> creates a <code>SecurityManager</code>.</p> <p><code>main</code> turns spark.shuffle.service.enabled to <code>true</code> explicitly (since this service is started from the command line for a reason).</p> <p><code>main</code> creates an ExternalShuffleService and starts it.</p> <p><code>main</code> prints out the following DEBUG message to the logs:</p> <pre><code>Adding shutdown hook\n</code></pre> <p><code>main</code> registers a shutdown hook. When triggered, the shutdown hook prints the following INFO message to the logs and requests the <code>ExternalShuffleService</code> to stop.</p> <pre><code>Shutting down shuffle service.\n</code></pre>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#creating-instance","title":"Creating Instance <p><code>ExternalShuffleService</code> takes the following to be created:</p> <ul> <li> SparkConf <li> <code>SecurityManager</code>  <p><code>ExternalShuffleService</code> is created\u00a0when:</p> <ul> <li><code>ExternalShuffleService</code> standalone application is started</li> <li><code>Worker</code> (Spark Standalone) is created (and initializes an <code>ExternalShuffleService</code>)</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#transportserver","title":"TransportServer <pre><code>server: TransportServer\n</code></pre> <p><code>ExternalShuffleService</code> uses an internal reference to a <code>TransportServer</code> that is created when <code>ExternalShuffleService</code> is started.</p> <p><code>ExternalShuffleService</code> uses an ExternalBlockHandler to handle RPC messages (and serve RDD blocks and shuffle blocks).</p> <p><code>TransportServer</code> is requested to <code>close</code> when <code>ExternalShuffleService</code> is requested to stop.</p> <p><code>TransportServer</code> is used for metrics.</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#port","title":"Port <p><code>ExternalShuffleService</code> uses spark.shuffle.service.port configuration property for the port to listen to when started.</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#sparkshuffleserviceenabled","title":"spark.shuffle.service.enabled <p><code>ExternalShuffleService</code> uses spark.shuffle.service.enabled configuration property to control whether or not is enabled (and should be started when requested).</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#externalblockhandler","title":"ExternalBlockHandler <pre><code>blockHandler: ExternalBlockHandler\n</code></pre> <p><code>ExternalShuffleService</code> creates an ExternalBlockHandler when created.</p> <p>With spark.shuffle.service.db.enabled and spark.shuffle.service.enabled configuration properties enabled, the <code>ExternalBlockHandler</code> is given a local directory with a registeredExecutors.ldb file.</p> <p><code>blockHandler</code>\u00a0is used to create a TransportContext that creates the TransportServer.</p> <p><code>blockHandler</code>\u00a0is used when:</p> <ul> <li>applicationRemoved</li> <li>executorRemoved</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#findregisteredexecutorsdbfile","title":"findRegisteredExecutorsDBFile <pre><code>findRegisteredExecutorsDBFile(\n  dbName: String): File\n</code></pre> <p><code>findRegisteredExecutorsDBFile</code> returns one of the local directories (defined using spark.local.dir configuration property) with the input <code>dbName</code> file or <code>null</code> when no directories defined.</p> <p><code>findRegisteredExecutorsDBFile</code> searches the local directories (defined using spark.local.dir configuration property) for the input <code>dbName</code> file. Unless found, <code>findRegisteredExecutorsDBFile</code> takes the first local directory.</p> <p>With no local directories defined in spark.local.dir configuration property, <code>findRegisteredExecutorsDBFile</code> prints out the following WARN message to the logs and returns <code>null</code>.</p> <pre><code>'spark.local.dir' should be set first when we use db in ExternalShuffleService. Note that this only affects standalone mode.\n</code></pre>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#starting-externalshuffleservice","title":"Starting ExternalShuffleService <pre><code>start(): Unit\n</code></pre> <p><code>start</code> prints out the following INFO message to the logs:</p> <pre><code>Starting shuffle service on port [port] (auth enabled = [authEnabled])\n</code></pre> <p><code>start</code> creates a <code>AuthServerBootstrap</code> with authentication enabled (using SecurityManager).</p> <p><code>start</code> creates a TransportContext (with the ExternalBlockHandler) and requests it to create a server (on the port).</p> <p><code>start</code>...FIXME</p> <p><code>start</code>\u00a0is used when:</p> <ul> <li><code>ExternalShuffleService</code> is requested to startIfEnabled and is launched (as a command-line application)</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#startifenabled","title":"startIfEnabled <pre><code>startIfEnabled(): Unit\n</code></pre> <p><code>startIfEnabled</code> starts the external shuffle service if enabled.</p> <p><code>startIfEnabled</code>\u00a0is used when:</p> <ul> <li><code>Worker</code> (Spark Standalone) is requested to <code>startExternalShuffleService</code></li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#executor-removed-notification","title":"Executor Removed Notification <pre><code>executorRemoved(\n  executorId: String,\n  appId: String): Unit\n</code></pre> <p><code>executorRemoved</code> requests the ExternalBlockHandler to executorRemoved.</p> <p><code>executorRemoved</code>\u00a0is used when:</p> <ul> <li><code>Worker</code> (Spark Standalone) is requested to <code>handleExecutorStateChanged</code></li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#application-finished-notification","title":"Application Finished Notification <pre><code>applicationRemoved(\n  appId: String): Unit\n</code></pre> <p><code>applicationRemoved</code> requests the ExternalBlockHandler to applicationRemoved (with <code>cleanupLocalDirs</code> flag enabled).</p> <p><code>applicationRemoved</code>\u00a0is used when:</p> <ul> <li><code>Worker</code> (Spark Standalone) is requested to handle <code>WorkDirCleanup</code> message and <code>maybeCleanupApplication</code></li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.deploy.ExternalShuffleService</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.deploy.ExternalShuffleService=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"external-shuffle-service/configuration-properties/","title":"Spark Configuration Properties of External Shuffle Service","text":"<p>The following are configuration properties of External Shuffle Service.</p>"},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleservicedbenabled","title":"spark.shuffle.service.db.enabled <p>Whether to use db in ExternalShuffleService. Note that this only affects standalone mode.</p> <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li><code>ExternalShuffleService</code> is requested for an ExternalBlockHandler</li> <li><code>Worker</code> (Spark Standalone) is requested to handle a <code>WorkDirCleanup</code> message</li> </ul>","text":""},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleserviceenabled","title":"spark.shuffle.service.enabled <p>Controls whether to use the External Shuffle Service</p> <p>Default: <code>false</code></p>  <p>Note</p> <p><code>LocalSparkCluster</code> turns this property off explicitly when started.</p>  <p>Used when:</p> <ul> <li><code>BlacklistTracker</code> is requested to updateBlacklistForFetchFailure</li> <li><code>ExecutorMonitor</code> is created</li> <li><code>ExecutorAllocationManager</code> is requested to validateSettings</li> <li><code>SparkEnv</code> utility is requested to create a \"base\" SparkEnv</li> <li><code>ExternalShuffleService</code> is created and started</li> <li><code>Worker</code> (Spark Standalone) is requested to handle a <code>WorkDirCleanup</code> message or started</li> <li><code>ExecutorRunnable</code> (Spark on YARN) is requested to <code>startContainer</code></li> </ul>","text":""},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleservicefetchrddenabled","title":"spark.shuffle.service.fetch.rdd.enabled <p>Enables ExternalShuffleService for fetching disk persisted RDD blocks.</p> <p>When enabled with Dynamic Resource Allocation executors having only disk persisted blocks are considered idle after spark.dynamicAllocation.executorIdleTimeout and will be released accordingly.</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>ExternalShuffleBlockResolver</code> is created</li> <li><code>SparkEnv</code> utility is requested to create a \"base\" SparkEnv</li> <li><code>ExecutorMonitor</code> is created</li> </ul>","text":""},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleserviceport","title":"spark.shuffle.service.port <p>Port of the external shuffle service</p> <p>Default: <code>7337</code></p> <p>Used when:</p> <ul> <li><code>ExternalShuffleService</code> is created</li> <li><code>StorageUtils</code> utility is requested for the port of an external shuffle service</li> </ul>","text":""},{"location":"features/","title":"Features","text":""},{"location":"history-server/","title":"Spark History Server","text":"<p>Spark History Server is the web UI of Spark applications with event log collection enabled (based on spark.eventLog.enabled configuration property).</p> <p></p> <p>Spark History Server is an extension of Spark's web UI.</p> <p>Spark History Server can be started using start-history-server.sh and stopped using stop-history-server.sh shell scripts.</p> <p>Spark History Server supports custom configuration properties that can be defined using <code>--properties-file [propertiesFile]</code> command-line option. The properties file can have any valid <code>spark.</code>-prefixed Spark property.</p> <pre><code>$ ./sbin/start-history-server.sh --properties-file history.properties\n</code></pre> <p>If not specified explicitly, Spark History Server uses the default configuration file, i.e. spark-defaults.conf.</p> <p>Spark History Server can replay events from event log files recorded by EventLoggingListener.</p>"},{"location":"history-server/#start-history-serversh-shell-script","title":"start-history-server.sh Shell Script <p><code>$SPARK_HOME/sbin/start-history-server.sh</code> shell script (where <code>SPARK_HOME</code> is the directory of your Spark installation) is used to start a Spark History Server instance.</p> <pre><code>$ ./sbin/start-history-server.sh\nstarting org.apache.spark.deploy.history.HistoryServer, logging to .../spark/logs/spark-jacek-org.apache.spark.deploy.history.HistoryServer-1-japila.out\n</code></pre> <p>Internally, <code>start-history-server.sh</code> script starts org.apache.spark.deploy.history.HistoryServer standalone application (using <code>spark-daemon.sh</code> shell script).</p> <pre><code>$ ./bin/spark-class org.apache.spark.deploy.history.HistoryServer\n</code></pre>  <p>Tip</p> <p>Using the more explicit approach with <code>spark-class</code> to start Spark History Server could be easier to trace execution by seeing the logs printed out to the standard output and hence terminal directly.</p>  <p>When started, <code>start-history-server.sh</code> prints out the following INFO message to the logs:</p> <pre><code>Started daemon with process name: [processName]\n</code></pre> <p><code>start-history-server.sh</code> registers signal handlers (using <code>SignalUtils</code>) for <code>TERM</code>, <code>HUP</code>, <code>INT</code> to log their execution:</p> <pre><code>RECEIVED SIGNAL [signal]\n</code></pre> <p><code>start-history-server.sh</code> inits security if enabled (based on spark.history.kerberos.enabled configuration property).</p> <p><code>start-history-server.sh</code> creates a <code>SecurityManager</code>.</p> <p><code>start-history-server.sh</code> creates a ApplicationHistoryProvider (based on spark.history.provider configuration property).</p> <p>In the end, <code>start-history-server.sh</code> creates a HistoryServer and requests it to bind to the port (based on spark.history.ui.port configuration property).</p>  <p>Note</p> <p>The host's IP can be specified using <code>SPARK_LOCAL_IP</code> environment variable (defaults to <code>0.0.0.0</code>).</p>  <p><code>start-history-server.sh</code> prints out the following INFO message to the logs:</p> <pre><code>Bound HistoryServer to [host], and started at [webUrl]\n</code></pre> <p><code>start-history-server.sh</code> registers a shutdown hook to call <code>stop</code> on the <code>HistoryServer</code> instance.</p>","text":""},{"location":"history-server/#stop-history-serversh-shell-script","title":"stop-history-server.sh Shell Script <p><code>$SPARK_HOME/sbin/stop-history-server.sh</code> shell script (where <code>SPARK_HOME</code> is the directory of your Spark installation) is used to stop a running instance of Spark History Server.</p> <pre><code>$ ./sbin/stop-history-server.sh\nstopping org.apache.spark.deploy.history.HistoryServer\n</code></pre>","text":""},{"location":"history-server/ApplicationCache/","title":"ApplicationCache","text":"<p>== [[ApplicationCache]] ApplicationCache</p> <p><code>ApplicationCache</code> is...FIXME</p> <p><code>ApplicationCache</code> is &lt;&gt; exclusively when <code>HistoryServer</code> is HistoryServer.md#appCache[created]. <p><code>ApplicationCache</code> uses https://github.com/google/guava/wiki/Release14[Google Guava 14.0.1] library for the internal &lt;&gt;. <p>[[internal-registries]] .ApplicationCache's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| <code>appLoader</code> | [[appLoader]] Google Guava's https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/CacheLoader.html[CacheLoader] with a custom ++https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/CacheLoader.html#load(K)++[load] which is simply &lt;&gt;. <p>Used when...FIXME</p> <p>| <code>removalListener</code> | [[removalListener]]</p> <p>| <code>appCache</code> a| [[appCache]] Google Guava's https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/LoadingCache.html[LoadingCache] of <code>CacheKey</code> keys and <code>CacheEntry</code> entries</p> <p>Used when <code>ApplicationCache</code> is requested for the following:</p> <ul> <li> <p>&lt;&gt; given <code>appId</code> and <code>attemptId</code> IDs <li> <p>FIXME (other uses)</p> </li> <p>| <code>metrics</code> | [[metrics]] |===</p> <p>=== [[creating-instance]] Creating ApplicationCache Instance</p> <p><code>ApplicationCache</code> takes the following when created:</p> <ul> <li>[[operations]] ApplicationCacheOperations.md[ApplicationCacheOperations]</li> <li>[[retainedApplications]] <code>retainedApplications</code></li> <li>[[clock]] <code>Clock</code></li> </ul> <p><code>ApplicationCache</code> initializes the &lt;&gt;. <p>=== [[loadApplicationEntry]] <code>loadApplicationEntry</code> Internal Method</p>"},{"location":"history-server/ApplicationCache/#source-scala","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#loadapplicationentryappid-string-attemptid-optionstring-cacheentry","title":"loadApplicationEntry(appId: String, attemptId: Option[String]): CacheEntry","text":"<p><code>loadApplicationEntry</code>...FIXME</p> <p>NOTE: <code>loadApplicationEntry</code> is used exclusively when <code>ApplicationCache</code> is requested to &lt;&gt;. <p>=== [[load]] Loading Cached Spark Application UI -- <code>load</code> Method</p>"},{"location":"history-server/ApplicationCache/#source-scala_1","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#loadkey-cachekey-cacheentry","title":"load(key: CacheKey): CacheEntry","text":"<p>NOTE: <code>load</code> is part of Google Guava's https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/CacheLoader.html[CacheLoader] to retrieve a <code>CacheEntry</code>, based on a <code>CacheKey</code>, for &lt;&gt;. <p><code>load</code> simply relays to &lt;&gt; with the <code>appId</code> and <code>attemptId</code> of the input <code>CacheKey</code>. <p>=== [[get]] Requesting Cached UI of Spark Application (CacheEntry) -- <code>get</code> Method</p>"},{"location":"history-server/ApplicationCache/#source-scala_2","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#getappid-string-attemptid-optionstring-none-cacheentry","title":"get(appId: String, attemptId: Option[String] = None): CacheEntry","text":"<p><code>get</code>...FIXME</p> <p>NOTE: <code>get</code> is used exclusively when <code>ApplicationCache</code> is requested to &lt;&gt;. <p>=== [[withSparkUI]] Executing Closure While Holding Application's UI Read Lock -- <code>withSparkUI</code> Method</p>"},{"location":"history-server/ApplicationCache/#source-scala_3","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#withsparkuitfn-sparkui-t-t","title":"withSparkUIT(fn: SparkUI =&gt; T): T","text":"<p><code>withSparkUI</code>...FIXME</p> <p>NOTE: <code>withSparkUI</code> is used when <code>HistoryServer</code> is requested to HistoryServer.md#withSparkUI[withSparkUI] and HistoryServer.md#loadAppUi[loadAppUi].</p>"},{"location":"history-server/ApplicationCacheOperations/","title":"ApplicationCacheOperations","text":"<p>== [[ApplicationCacheOperations]] ApplicationCacheOperations</p> <p><code>ApplicationCacheOperations</code> is the &lt;&gt; of...FIXME <p>[[contract]] [source, scala]</p> <p>package org.apache.spark.deploy.history</p> <p>trait ApplicationCacheOperations {   // only required methods that have no implementation   // the others follow   def getAppUI(appId: String, attemptId: Option[String]): Option[LoadedAppUI]   def attachSparkUI(     appId: String,     attemptId: Option[String],     ui: SparkUI,     completed: Boolean): Unit   def detachSparkUI(appId: String, attemptId: Option[String], ui: SparkUI): Unit }</p> <p>NOTE: <code>ApplicationCacheOperations</code> is a <code>private[history]</code> contract.</p> <p>.(Subset of) ApplicationCacheOperations Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| <code>getAppUI</code> | [[getAppUI]] spark-webui-SparkUI.md[SparkUI] (the UI of a Spark application)</p> <p>Used exclusively when <code>ApplicationCache</code> is requested for ApplicationCache.md#loadApplicationEntry[loadApplicationEntry]</p> <p>| <code>attachSparkUI</code> | [[attachSparkUI]]</p> <p>| <code>detachSparkUI</code> | [[detachSparkUI]] |===</p> <p>[[implementations]] NOTE: HistoryServer.md[HistoryServer] is the one and only known implementation of &lt;&gt; in Apache Spark."},{"location":"history-server/ApplicationHistoryProvider/","title":"ApplicationHistoryProvider","text":"<p><code>ApplicationHistoryProvider</code> is an abstraction of history providers.</p>"},{"location":"history-server/ApplicationHistoryProvider/#contract","title":"Contract","text":""},{"location":"history-server/ApplicationHistoryProvider/#getapplicationinfo","title":"getApplicationInfo <pre><code>getApplicationInfo(\n  appId: String): Option[ApplicationInfo]\n</code></pre> <p>Used when...FIXME</p>","text":""},{"location":"history-server/ApplicationHistoryProvider/#getappui","title":"getAppUI <pre><code>getAppUI(\n  appId: String,\n  attemptId: Option[String]): Option[LoadedAppUI]\n</code></pre> <p>SparkUI for a given application (by <code>appId</code>)</p> <p>Used when <code>HistoryServer</code> is requested for the UI of a Spark application</p>","text":""},{"location":"history-server/ApplicationHistoryProvider/#getlisting","title":"getListing <pre><code>getListing(): Iterator[ApplicationInfo]\n</code></pre> <p>Used when...FIXME</p>","text":""},{"location":"history-server/ApplicationHistoryProvider/#onuidetached","title":"onUIDetached <pre><code>onUIDetached(\n  appId: String,\n  attemptId: Option[String],\n  ui: SparkUI): Unit\n</code></pre> <p>Used when...FIXME</p>","text":""},{"location":"history-server/ApplicationHistoryProvider/#writeeventlogs","title":"writeEventLogs <pre><code>writeEventLogs(\n  appId: String,\n  attemptId: Option[String],\n  zipStream: ZipOutputStream): Unit\n</code></pre> <p>Writes events to a stream</p> <p>Used when...FIXME</p>","text":""},{"location":"history-server/ApplicationHistoryProvider/#implementations","title":"Implementations","text":"<ul> <li>FsHistoryProvider</li> </ul>"},{"location":"history-server/EventLogFileWriter/","title":"EventLogFileWriter","text":"<p><code>EventLogFileWriter</code> is...FIXME</p>"},{"location":"history-server/EventLoggingListener/","title":"EventLoggingListener","text":"<p><code>EventLoggingListener</code> is a SparkListener that writes out JSON-encoded events of a Spark application with event logging enabled (based on spark.eventLog.enabled configuration property).</p> <p><code>EventLoggingListener</code> supports custom configuration properties.</p> <p><code>EventLoggingListener</code> writes out log files to a directory (based on spark.eventLog.dir configuration property).</p>"},{"location":"history-server/EventLoggingListener/#creating-instance","title":"Creating Instance","text":"<p><code>EventLoggingListener</code> takes the following to be created:</p> <ul> <li> Application ID <li> Application Attempt ID <li> Log Directory <li> SparkConf <li> Hadoop Configuration <p><code>EventLoggingListener</code> is created\u00a0when <code>SparkContext</code> is created (with spark.eventLog.enabled enabled).</p>"},{"location":"history-server/EventLoggingListener/#eventlogfilewriter","title":"EventLogFileWriter <pre><code>logWriter: EventLogFileWriter\n</code></pre> <p><code>EventLoggingListener</code> creates a EventLogFileWriter when created.</p>  <p>Note</p> <p>All arguments to create an EventLoggingListener are passed to the <code>EventLogFileWriter</code>.</p>  <p>The <code>EventLogFileWriter</code> is started when <code>EventLoggingListener</code> is started.</p> <p>The <code>EventLogFileWriter</code> is stopped when <code>EventLoggingListener</code> is stopped.</p> <p>The <code>EventLogFileWriter</code> is requested to writeEvent when <code>EventLoggingListener</code> is requested to start and log an event.</p>","text":""},{"location":"history-server/EventLoggingListener/#starting-eventlogginglistener","title":"Starting EventLoggingListener <pre><code>start(): Unit\n</code></pre> <p><code>start</code> requests the EventLogFileWriter to start and initEventLog.</p>","text":""},{"location":"history-server/EventLoggingListener/#initeventlog","title":"initEventLog <pre><code>initEventLog(): Unit\n</code></pre> <p><code>initEventLog</code>...FIXME</p>","text":""},{"location":"history-server/EventLoggingListener/#logging-event","title":"Logging Event <pre><code>logEvent(\n  event: SparkListenerEvent,\n  flushLogger: Boolean = false): Unit\n</code></pre> <p><code>logEvent</code> persists the given SparkListenerEvent in JSON format.</p> <p><code>logEvent</code> converts the event to JSON format and requests the EventLogFileWriter to write it out.</p>","text":""},{"location":"history-server/EventLoggingListener/#stopping-eventlogginglistener","title":"Stopping EventLoggingListener <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> requests the EventLogFileWriter to stop.</p> <p><code>stop</code> is used when <code>SparkContext</code> is requested to stop.</p>","text":""},{"location":"history-server/EventLoggingListener/#inprogress-file-extension","title":"inprogress File Extension <p><code>EventLoggingListener</code> uses .inprogress file extension for in-flight event log files of active Spark applications.</p>","text":""},{"location":"history-server/EventLoggingListener/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.EventLoggingListener</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.EventLoggingListener=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"history-server/FsHistoryProvider/","title":"FsHistoryProvider","text":"<p><code>FsHistoryProvider</code> is the default ApplicationHistoryProvider for Spark History Server.</p>"},{"location":"history-server/FsHistoryProvider/#creating-instance","title":"Creating Instance","text":"<p><code>FsHistoryProvider</code> takes the following to be created:</p> <ul> <li> SparkConf <li> <code>Clock</code> (default: <code>SystemClock</code>) <p><code>FsHistoryProvider</code> is created\u00a0when <code>HistoryServer</code> standalone application is started (and no spark.history.provider configuration property was defined).</p>"},{"location":"history-server/FsHistoryProvider/#path-of-application-history-cache","title":"Path of Application History Cache <pre><code>storePath: Option[File]\n</code></pre> <p><code>FsHistoryProvider</code> uses spark.history.store.path configuration property for the directory to cache application history.</p> <p>With <code>storePath</code> defined, <code>FsHistoryProvider</code> uses a LevelDB as the KVStore. Otherwise, a InMemoryStore.</p> <p>With <code>storePath</code> defined, <code>FsHistoryProvider</code> uses a HistoryServerDiskManager as the disk manager.</p>","text":""},{"location":"history-server/FsHistoryProvider/#disk-manager","title":"Disk Manager <pre><code>diskManager: Option[HistoryServerDiskManager]\n</code></pre> <p><code>FsHistoryProvider</code> creates a HistoryServerDiskManager when created (with storePath defined based on spark.history.store.path configuration property).</p> <p><code>FsHistoryProvider</code> uses the <code>HistoryServerDiskManager</code> for the following:</p> <ul> <li>startPolling</li> <li>getAppUI</li> <li>onUIDetached</li> <li>cleanAppData</li> </ul>","text":""},{"location":"history-server/FsHistoryProvider/#sparkui-of-spark-application","title":"SparkUI of Spark Application <pre><code>getAppUI(\n  appId: String,\n  attemptId: Option[String]): Option[LoadedAppUI]\n</code></pre> <p><code>getAppUI</code> is part of the ApplicationHistoryProvider abstraction.</p> <p><code>getAppUI</code>...FIXME</p>","text":""},{"location":"history-server/FsHistoryProvider/#onuidetached","title":"onUIDetached <pre><code>onUIDetached(): Unit\n</code></pre> <p><code>onUIDetached</code> is part of the ApplicationHistoryProvider abstraction.</p> <p><code>onUIDetached</code>...FIXME</p>","text":""},{"location":"history-server/FsHistoryProvider/#loaddiskstore","title":"loadDiskStore <pre><code>loadDiskStore(\n  dm: HistoryServerDiskManager,\n  appId: String,\n  attempt: AttemptInfoWrapper): KVStore\n</code></pre> <p><code>loadDiskStore</code>...FIXME</p> <p><code>loadDiskStore</code> is used in getAppUI (with HistoryServerDiskManager available).</p>","text":""},{"location":"history-server/FsHistoryProvider/#createinmemorystore","title":"createInMemoryStore <pre><code>createInMemoryStore(\n  attempt: AttemptInfoWrapper): KVStore\n</code></pre> <p><code>createInMemoryStore</code>...FIXME</p> <p><code>createInMemoryStore</code> is used in getAppUI.</p>","text":""},{"location":"history-server/FsHistoryProvider/#rebuildappstore","title":"rebuildAppStore <pre><code>rebuildAppStore(\n  store: KVStore,\n  reader: EventLogFileReader,\n  lastUpdated: Long): Unit\n</code></pre> <p><code>rebuildAppStore</code>...FIXME</p> <p><code>rebuildAppStore</code> is used in loadDiskStore and createInMemoryStore.</p>","text":""},{"location":"history-server/FsHistoryProvider/#cleanappdata","title":"cleanAppData <pre><code>cleanAppData(\n  appId: String,\n  attemptId: Option[String],\n  logPath: String): Unit\n</code></pre> <p><code>cleanAppData</code>...FIXME</p> <p><code>cleanAppData</code> is used in checkForLogs and deleteAttemptLogs.</p>","text":""},{"location":"history-server/FsHistoryProvider/#polling-for-logs","title":"Polling for Logs <pre><code>startPolling(): Unit\n</code></pre> <p><code>startPolling</code>...FIXME</p> <p><code>startPolling</code> is used in initialize and startSafeModeCheckThread.</p>","text":""},{"location":"history-server/FsHistoryProvider/#checking-available-event-logs","title":"Checking Available Event Logs <pre><code>checkForLogs(): Unit\n</code></pre> <p><code>checkForLogs</code>...FIXME</p>","text":""},{"location":"history-server/FsHistoryProvider/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.deploy.history.FsHistoryProvider</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.deploy.history.FsHistoryProvider=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"history-server/HistoryAppStatusStore/","title":"HistoryAppStatusStore","text":"<p><code>HistoryAppStatusStore</code> is an AppStatusStore for SparkUIs in Spark History Server.</p>"},{"location":"history-server/HistoryAppStatusStore/#creating-instance","title":"Creating Instance","text":"<p><code>HistoryAppStatusStore</code> takes the following to be created:</p> <ul> <li> SparkConf <li> KVStore <p><code>HistoryAppStatusStore</code> is created\u00a0when:</p> <ul> <li><code>FsHistoryProvider</code> is requested for a SparkUI (of a Spark application)</li> </ul>"},{"location":"history-server/HistoryAppStatusStore/#executorlogurlhandler","title":"ExecutorLogUrlHandler <pre><code>logUrlHandler: ExecutorLogUrlHandler\n</code></pre> <p><code>HistoryAppStatusStore</code> creates an ExecutorLogUrlHandler (for the logUrlPattern) when created.</p> <p><code>HistoryAppStatusStore</code> uses it when requested to replaceLogUrls.</p>","text":""},{"location":"history-server/HistoryAppStatusStore/#executorlist","title":"executorList <pre><code>executorList(\n  exec: v1.ExecutorSummary,\n  urlPattern: String): v1.ExecutorSummary\n</code></pre> <p><code>executorList</code>...FIXME</p> <p><code>executorList</code>\u00a0is part of the AppStatusStore abstraction.</p>","text":""},{"location":"history-server/HistoryAppStatusStore/#executorsummary","title":"executorSummary <pre><code>executorSummary(\n  executorId: String): v1.ExecutorSummary\n</code></pre> <p><code>executorSummary</code>...FIXME</p> <p><code>executorSummary</code>\u00a0is part of the AppStatusStore abstraction.</p>","text":""},{"location":"history-server/HistoryAppStatusStore/#replacelogurls","title":"replaceLogUrls <pre><code>replaceLogUrls(\n  exec: v1.ExecutorSummary,\n  urlPattern: String): v1.ExecutorSummary\n</code></pre> <p><code>replaceLogUrls</code>...FIXME</p> <p><code>replaceLogUrls</code>\u00a0is used when <code>HistoryAppStatusStore</code> is requested to executorList and executorSummary.</p>","text":""},{"location":"history-server/HistoryServer/","title":"HistoryServer","text":"<p><code>HistoryServer</code> is an extension of the web UI for reviewing event logs of running (active) and completed Spark applications with event log collection enabled (based on spark.eventLog.enabled configuration property).</p>"},{"location":"history-server/HistoryServer/#starting-historyserver-standalone-application","title":"Starting HistoryServer Standalone Application <pre><code>main(\n  argStrings: Array[String]): Unit\n</code></pre> <p><code>main</code> creates a HistoryServerArguments (with the given <code>argStrings</code> arguments).</p> <p><code>main</code> initializes security.</p> <p><code>main</code> creates an ApplicationHistoryProvider (based on spark.history.provider configuration property).</p> <p><code>main</code> creates a HistoryServer (with the <code>ApplicationHistoryProvider</code> and spark.history.ui.port configuration property) and requests it to bind.</p> <p><code>main</code> requests the <code>ApplicationHistoryProvider</code> to start.</p> <p><code>main</code> registers a shutdown hook that requests the <code>HistoryServer</code> to stop and sleeps...till the end of the world (giving the daemon thread a go).</p>","text":""},{"location":"history-server/HistoryServer/#creating-instance","title":"Creating Instance <p><code>HistoryServer</code> takes the following to be created:</p> <ul> <li> SparkConf <li> ApplicationHistoryProvider <li> <code>SecurityManager</code> <li> Port number  <p>When created, <code>HistoryServer</code> initializes itself.</p> <p><code>HistoryServer</code> is created\u00a0when HistoryServer standalone application is started.</p>","text":""},{"location":"history-server/HistoryServer/#applicationcacheoperations","title":"ApplicationCacheOperations <p><code>HistoryServer</code> is a ApplicationCacheOperations.</p>","text":""},{"location":"history-server/HistoryServer/#uiroot","title":"UIRoot <p><code>HistoryServer</code> is a UIRoot.</p>","text":""},{"location":"history-server/HistoryServer/#initializing-historyserver","title":"Initializing HistoryServer <pre><code>initialize(): Unit\n</code></pre> <p><code>initialize</code> is part of the WebUI abstraction.</p> <p><code>initialize</code>...FIXME</p>","text":""},{"location":"history-server/HistoryServer/#attaching-sparkui","title":"Attaching SparkUI <pre><code>attachSparkUI(\n  appId: String,\n  attemptId: Option[String],\n  ui: SparkUI,\n  completed: Boolean): Unit\n</code></pre> <p><code>attachSparkUI</code> is part of the ApplicationCacheOperations abstraction.</p> <p><code>attachSparkUI</code>...FIXME</p>","text":""},{"location":"history-server/HistoryServer/#spark-ui","title":"Spark UI <pre><code>getAppUI(\n  appId: String,\n  attemptId: Option[String]): Option[LoadedAppUI]\n</code></pre> <p><code>getAppUI</code> is part of the ApplicationCacheOperations abstraction.</p> <p><code>getAppUI</code> requests the ApplicationHistoryProvider for the Spark UI of a Spark application (based on the <code>appId</code> and <code>attemptId</code>).</p>","text":""},{"location":"history-server/HistoryServer/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.deploy.history.HistoryServer</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.deploy.history.HistoryServer=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"history-server/HistoryServerArguments/","title":"HistoryServerArguments","text":"<p>== HistoryServerArguments</p> <p><code>HistoryServerArguments</code> is the command-line parser for the index.md[History Server].</p> <p>When <code>HistoryServerArguments</code> is executed with a single command-line parameter it is assumed to be the event logs directory.</p> <pre><code>$ ./sbin/start-history-server.sh /tmp/spark-events\n</code></pre> <p>This is however deprecated since Spark 1.1.0 and you should see the following WARN message in the logs:</p> <pre><code>WARN HistoryServerArguments: Setting log directory through the command line is deprecated as of Spark 1.1.0. Please set this through spark.history.fs.logDirectory instead.\n</code></pre> <p>The same WARN message shows up for <code>--dir</code> and <code>-d</code> command-line options.</p> <p><code>--properties-file [propertiesFile]</code> command-line option specifies the file with the custom spark-properties.md[Spark properties].</p> <p>NOTE: When not specified explicitly, History Server uses the default configuration file, i.e. spark-properties.md#spark-defaults-conf[spark-defaults.conf].</p>"},{"location":"history-server/HistoryServerArguments/#tip","title":"[TIP]","text":"<p>Enable <code>WARN</code> logging level for <code>org.apache.spark.deploy.history.HistoryServerArguments</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.deploy.history.HistoryServerArguments=WARN\n</code></pre>"},{"location":"history-server/HistoryServerArguments/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":""},{"location":"history-server/HistoryServerDiskManager/","title":"HistoryServerDiskManager","text":"<p><code>HistoryServerDiskManager</code> is a disk manager for FsHistoryProvider.</p>"},{"location":"history-server/HistoryServerDiskManager/#creating-instance","title":"Creating Instance","text":"<p><code>HistoryServerDiskManager</code> takes the following to be created:</p> <ul> <li> SparkConf <li> Path <li> KVStore <li> <code>Clock</code> <p><code>HistoryServerDiskManager</code> is created\u00a0when:</p> <ul> <li><code>FsHistoryProvider</code> is created (and initializes a diskManager)</li> </ul>"},{"location":"history-server/HistoryServerDiskManager/#initializing","title":"Initializing <pre><code>initialize(): Unit\n</code></pre> <p><code>initialize</code>...FIXME</p> <p><code>initialize</code>\u00a0is used when:</p> <ul> <li><code>FsHistoryProvider</code> is requested to startPolling</li> </ul>","text":""},{"location":"history-server/HistoryServerDiskManager/#releasing-application-store","title":"Releasing Application Store <pre><code>release(\n  appId: String,\n  attemptId: Option[String],\n  delete: Boolean = false): Unit\n</code></pre> <p><code>release</code>...FIXME</p> <p><code>release</code>\u00a0is used when:</p> <ul> <li><code>FsHistoryProvider</code> is requested to onUIDetached, cleanAppData and loadDiskStore</li> </ul>","text":""},{"location":"history-server/JsonProtocol/","title":"JsonProtocol Utility","text":"<p><code>JsonProtocol</code> is an utility to convert SparkListenerEvents to and from JSON format.</p>"},{"location":"history-server/JsonProtocol/#objectmapper","title":"ObjectMapper <p><code>JsonProtocol</code> uses an Jackson Databind ObjectMapper for performing conversions to and from JSON.</p>","text":""},{"location":"history-server/JsonProtocol/#converting-spark-event-to-json","title":"Converting Spark Event to JSON <pre><code>sparkEventToJson(\n  event: SparkListenerEvent): JValue\n</code></pre> <p><code>sparkEventToJson</code> converts the given SparkListenerEvent to JSON format.</p> <p><code>sparkEventToJson</code>\u00a0is used when...FIXME</p>","text":""},{"location":"history-server/JsonProtocol/#converting-json-to-spark-event","title":"Converting JSON to Spark Event <pre><code>sparkEventFromJson(\n  json: JValue): SparkListenerEvent\n</code></pre> <p><code>sparkEventFromJson</code> converts a JSON-encoded event to a SparkListenerEvent.</p> <p><code>sparkEventFromJson</code>\u00a0is used when...FIXME</p>","text":""},{"location":"history-server/ReplayListenerBus/","title":"ReplayListenerBus","text":"<p><code>ReplayListenerBus</code> is a SparkListenerBus that can replay JSON-encoded <code>SparkListenerEvent</code> events.</p> <p><code>ReplayListenerBus</code> is used by FsHistoryProvider.</p>"},{"location":"history-server/ReplayListenerBus/#replaying-json-encoded-sparklistenerevents","title":"Replaying JSON-encoded SparkListenerEvents <pre><code>replay(\n  logData: InputStream,\n  sourceName: String,\n  maybeTruncated: Boolean = false): Unit\n</code></pre> <p><code>replay</code> reads JSON-encoded SparkListener.md#SparkListenerEvent[SparkListenerEvent] events from <code>logData</code> (one event per line) and posts them to all registered SparkListenerInterfaces.</p> <p><code>replay</code> uses spark-history-server:JsonProtocol.md#sparkEventFromJson[<code>JsonProtocol</code> to convert JSON-encoded events to <code>SparkListenerEvent</code> objects].</p> <p>NOTE: <code>replay</code> uses jackson from http://json4s.org/[json4s] library to parse the AST for JSON.</p> <p>When there is an exception parsing a JSON event, you may see the following WARN message in the logs (for the last line) or a <code>JsonParseException</code>.</p> <pre><code>WARN Got JsonParseException from log file $sourceName at line [lineNumber], the file might not have finished writing cleanly.\n</code></pre> <p>Any other non-IO exceptions end up with the following ERROR messages in the logs:</p> <pre><code>ERROR Exception parsing Spark event log: [sourceName]\nERROR Malformed line #[lineNumber]: [currentLine]\n</code></pre> <p>NOTE: The <code>sourceName</code> input argument is only used for messages.</p>","text":""},{"location":"history-server/SQLHistoryListener/","title":"SQLHistoryListener","text":"<p>== SQLHistoryListener</p> <p><code>SQLHistoryListener</code> is a custom spark-sql-SQLListener.md[SQLListener] for index.md[History Server]. It attaches spark-sql-webui.md#creating-instance[SQL tab] to History Server's web UI only when the first spark-sql-SQLListener.md#SparkListenerSQLExecutionStart[SparkListenerSQLExecutionStart] arrives and shuts &lt;&gt; off. It also handles &lt;&gt;. <p>NOTE: Support for SQL UI in History Server was added in SPARK-11206 Support SQL UI on the history server.</p> <p>CAUTION: FIXME Add the link to the JIRA.</p> <p>=== [[onOtherEvent]] onOtherEvent</p>"},{"location":"history-server/SQLHistoryListener/#source-scala","title":"[source, scala]","text":""},{"location":"history-server/SQLHistoryListener/#onothereventevent-sparklistenerevent-unit","title":"onOtherEvent(event: SparkListenerEvent): Unit","text":"<p>When <code>SparkListenerSQLExecutionStart</code> event comes, <code>onOtherEvent</code> attaches spark-sql-webui.md#creating-instance[SQL tab] to web UI and passes the call to the parent spark-sql-SQLListener.md[SQLListener].</p> <p>=== [[onTaskEnd]] onTaskEnd</p> <p>CAUTION: FIXME</p> <p>=== [[creating-instance]] Creating SQLHistoryListener Instance</p> <p><code>SQLHistoryListener</code> is created using a (<code>private[sql]</code>) <code>SQLHistoryListenerFactory</code> class (which is <code>SparkHistoryListenerFactory</code>).</p> <p>The <code>SQLHistoryListenerFactory</code> class is registered when spark-webui-SparkUI.md#createHistoryUI[<code>SparkUI</code> creates a web UI for History Server] as a Java service in <code>META-INF/services/org.apache.spark.scheduler.SparkHistoryListenerFactory</code>:</p> <pre><code>org.apache.spark.sql.execution.ui.SQLHistoryListenerFactory\n</code></pre> <p>NOTE: Loading the service uses Java's https://docs.oracle.com/javase/8/docs/api/java/util/ServiceLoader.html#load-java.lang.Class-java.lang.ClassLoader-[ServiceLoader.load] method.</p> <p>=== [[onExecutorMetricsUpdate]] onExecutorMetricsUpdate</p> <p><code>onExecutorMetricsUpdate</code> does nothing.</p>"},{"location":"history-server/configuration-properties/","title":"Configuration Properties","text":"<p>The following contains the configuration properties of EventLoggingListener and HistoryServer.</p>"},{"location":"history-server/configuration-properties/#sparkeventlog","title":"spark.eventLog","text":""},{"location":"history-server/configuration-properties/#bufferkb","title":"buffer.kb <p>spark.eventLog.buffer.kb</p> <p>Size of the buffer to use when writing to output streams</p> <p>Default: <code>100k</code></p>","text":""},{"location":"history-server/configuration-properties/#compress","title":"compress <p>spark.eventLog.compress</p> <p>Enables event compression (using a <code>CompressionCodec</code>)</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#compressioncodec","title":"compression.codec <p>spark.eventLog.compression.codec</p> <p>The codec used to compress event log (with spark.eventLog.compress enabled). By default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also use fully qualified class names to specify the codec.</p> <p>Default: <code>zstd</code></p>","text":""},{"location":"history-server/configuration-properties/#dir","title":"dir <p>spark.eventLog.dir</p> <p>Directory where Spark events are logged to (e.g. <code>hdfs://namenode:8021/directory</code>)</p> <p>Default: <code>/tmp/spark-events</code></p> <p>The directory must exist before SparkContext can be created</p>","text":""},{"location":"history-server/configuration-properties/#enabled","title":"enabled <p>spark.eventLog.enabled</p> <p>Enables persisting Spark events</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#erasurecodingenabled","title":"erasureCoding.enabled <p>spark.eventLog.erasureCoding.enabled</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#gcmetricsyounggenerationgarbagecollectors","title":"gcMetrics.youngGenerationGarbageCollectors <p>spark.eventLog.gcMetrics.youngGenerationGarbageCollectors</p> <p>Names of supported young generation garbage collectors. A name usually is the output of <code>GarbageCollectorMXBean.getName</code>.</p> <p>Default: <code>Copy</code>, <code>PS Scavenge</code>, <code>ParNew</code>, <code>G1 Young Generation</code> (the built-in young generation garbage collectors)</p>","text":""},{"location":"history-server/configuration-properties/#gcmetricsoldgenerationgarbagecollectors","title":"gcMetrics.oldGenerationGarbageCollectors <p>spark.eventLog.gcMetrics.oldGenerationGarbageCollectors</p> <p>Names of supported old generation garbage collectors. A name usually is the output of <code>GarbageCollectorMXBean.getName</code>.</p> <p>Default: <code>MarkSweepCompact</code>, <code>PS MarkSweep</code>, <code>ConcurrentMarkSweep</code>, <code>G1 Old Generation</code> (the built-in old generation garbage collectors)</p>","text":""},{"location":"history-server/configuration-properties/#logblockupdatesenabled","title":"logBlockUpdates.enabled <p>spark.eventLog.logBlockUpdates.enabled</p> <p>Enables log RDD block updates using EventLoggingListener</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#logstageexecutormetrics","title":"logStageExecutorMetrics <p>spark.eventLog.logStageExecutorMetrics</p> <p>Enables logging of per-stage peaks of executor metrics (for each executor) to the event log</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#longformenabled","title":"longForm.enabled <p>spark.eventLog.longForm.enabled</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#overwrite","title":"overwrite <p>spark.eventLog.overwrite</p> <p>Enables deleting (or at least overwriting) an existing .inprogress event log files</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#rollingenabled","title":"rolling.enabled <p>spark.eventLog.rolling.enabled</p> <p>Enables rolling over event log files. When enabled, cuts down each event log file to spark.eventLog.rolling.maxFileSize</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#rollingmaxfilesize","title":"rolling.maxFileSize <p>spark.eventLog.rolling.maxFileSize</p> <p>Max size of event log file to be rolled over (with spark.eventLog.rolling.enabled enabled)</p> <p>Default: <code>128m</code></p> <p>Must be at least 10 MiB</p>","text":""},{"location":"history-server/configuration-properties/#sparkhistory","title":"spark.history","text":""},{"location":"history-server/configuration-properties/#fslogdirectory","title":"fs.logDirectory <p>spark.history.fs.logDirectory</p> <p>The directory for event log files. The directory has to exist before starting History Server.</p> <p>Default: <code>file:/tmp/spark-events</code></p>","text":""},{"location":"history-server/configuration-properties/#kerberosenabled","title":"kerberos.enabled <p>spark.history.kerberos.enabled</p> <p>Whether to enable (<code>true</code>) or disable (<code>false</code>) security when working with HDFS with security enabled (Kerberos).</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#kerberoskeytab","title":"kerberos.keytab <p>spark.history.kerberos.keytab</p> <p>Keytab to use for login to Kerberos. Required when <code>spark.history.kerberos.enabled</code> is enabled.</p> <p>Default: (empty)</p>","text":""},{"location":"history-server/configuration-properties/#kerberosprincipal","title":"kerberos.principal <p>spark.history.kerberos.principal</p> <p>Kerberos principal. Required when <code>spark.history.kerberos.enabled</code> is enabled.</p> <p>Default: (empty)</p>","text":""},{"location":"history-server/configuration-properties/#provider","title":"provider <p>spark.history.provider</p> <p>Fully-qualified class name of an ApplicationHistoryProvider for HistoryServer.</p> <p>Default: org.apache.spark.deploy.history.FsHistoryProvider</p>","text":""},{"location":"history-server/configuration-properties/#retainedapplications","title":"retainedApplications <p>spark.history.retainedApplications</p> <p>How many Spark applications HistoryServer should retain</p> <p>Default: <code>50</code></p>","text":""},{"location":"history-server/configuration-properties/#storepath","title":"store.path <p>spark.history.store.path</p> <p>Local directory where to cache application history information (by )</p> <p>Default: (undefined) (i.e. all history information will be kept in memory)</p>","text":""},{"location":"history-server/configuration-properties/#uimaxapplications","title":"ui.maxApplications <p>spark.history.ui.maxApplications</p> <p>How many Spark applications HistoryServer should show in the UI</p> <p>Default: (unbounded)</p>","text":""},{"location":"history-server/configuration-properties/#uiport","title":"ui.port <p>spark.history.ui.port</p> <p>The port of History Server's web UI.</p> <p>Default: <code>18080</code></p>","text":""},{"location":"local/","title":"Spark local","text":"<p>Spark local is one of the available runtime environments in Apache Spark. It is the only available runtime with no need for a proper cluster manager (and hence many call it a pseudo-cluster, however such concept do exist in Spark and is a bit different).</p> <p>Spark local is used for the following master URLs (as specified using &lt;&lt;../SparkConf.md#, SparkConf.setMaster&gt;&gt; method or &lt;&lt;../configuration-properties.md#spark.master, spark.master&gt;&gt; configuration property):</p> <ul> <li> <p>local (with exactly 1 CPU core)</p> </li> <li> <p>local[n] (with exactly <code>n</code> CPU cores)</p> </li> <li> <p>++local[]++* (with the total number of CPU cores that is the number of available CPU cores on the local machine)</p> </li> <li> <p>local[n, m] (with exactly <code>n</code> CPU cores and <code>m</code> retries when a task fails)</p> </li> <li> <p>++local[, m]++* (with the total number of CPU cores that is the number of available CPU cores on the local machine)</p> </li> </ul> <p>Internally, Spark local uses &lt;&gt; as the &lt;&lt;../SchedulerBackend.md#, SchedulerBackend&gt;&gt; and executor:ExecutorBackend.md[]. <p>.Architecture of Spark local image::../diagrams/spark-local-architecture.png[align=\"center\"]</p> <p>In this non-distributed multi-threaded runtime environment, Spark spawns all the main execution components - the spark-driver.md[driver] and an executor:Executor.md[] - in the same single JVM.</p> <p>The default parallelism is the number of threads as specified in the &lt;&gt;. This is the only mode where a driver is used for execution (as it acts both as the driver and the only executor). <p>The local mode is very convenient for testing, debugging or demonstration purposes as it requires no earlier setup to launch Spark applications.</p> <p>This mode of operation is also called  http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark[Spark in-process] or (less commonly) a local version of Spark.</p> <p><code>SparkContext.isLocal</code> returns <code>true</code> when Spark runs in local mode.</p> <pre><code>scala&gt; sc.isLocal\nres0: Boolean = true\n</code></pre> <p>Spark shell defaults to local mode with <code>local[*]</code> as the the master URL.</p> <pre><code>scala&gt; sc.master\nres0: String = local[*]\n</code></pre> <p>Tasks are not re-executed on failure in local mode (unless &lt;&gt; is used). <p>The scheduler:TaskScheduler.md[task scheduler] in local mode works with local/spark-LocalSchedulerBackend.md[LocalSchedulerBackend] task scheduler backend.</p>"},{"location":"local/#master-url","title":"Master URL","text":"<p>You can run Spark in local mode using <code>local</code>, <code>local[n]</code> or the most general <code>local[*]</code> for the master URL.</p> <p>The URL says how many threads can be used in total:</p> <ul> <li> <p><code>local</code> uses 1 thread only.</p> </li> <li> <p><code>local[n]</code> uses <code>n</code> threads.</p> </li> <li> <p><code>local[*]</code> uses as many threads as the number of processors available to the Java virtual machine (it uses https://docs.oracle.com/javase/8/docs/api/java/lang/Runtime.html#availableProcessors--[Runtime.getRuntime.availableProcessors()] to know the number).</p> </li> </ul> <p>NOTE: What happens when there are less cores than <code>n</code> in <code>local[n]</code> master URL? \"Breaks\" scheduling as Spark assumes more CPU cores available to execute tasks.</p> <ul> <li>[[local-with-retries]] <code>local[N, maxFailures]</code> (called local-with-retries) with <code>N</code> being <code>*</code> or the number of threads to use (as explained above) and <code>maxFailures</code> being the value of &lt;&lt;../configuration-properties.md#spark.task.maxFailures, spark.task.maxFailures&gt;&gt; configuration property.</li> </ul> <p>== [[task-submission]] Task Submission a.k.a. reviveOffers</p> <p>.TaskSchedulerImpl.submitTasks in local mode image::taskscheduler-submitTasks-local-mode.png[align=\"center\"]</p> <p>When <code>ReviveOffers</code> or <code>StatusUpdate</code> messages are received, local/spark-LocalEndpoint.md[LocalEndpoint] places an offer to <code>TaskSchedulerImpl</code> (using <code>TaskSchedulerImpl.resourceOffers</code>).</p> <p>If there is one or more tasks that match the offer, they are launched (using <code>executor.launchTask</code> method).</p> <p>The number of tasks to be launched is controlled by the number of threads as specified in &lt;&gt;. The executor uses threads to spawn the tasks."},{"location":"local/LauncherBackend/","title":"LauncherBackend","text":"<p>== [[LauncherBackend]] LauncherBackend</p> <p><code>LauncherBackend</code> is the &lt;&gt; of &lt;&gt; that can &lt;&gt;. <p>[[contract]] .LauncherBackend Contract (Abstract Methods Only) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| conf a| [[conf]]</p>"},{"location":"local/LauncherBackend/#source-scala","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#conf-sparkconf","title":"conf: SparkConf","text":"<p>SparkConf.md[]</p> <p>Used exclusively when <code>LauncherBackend</code> is requested to &lt;&gt; (to access configuration-properties.md#spark.launcher.port[spark.launcher.port] and configuration-properties.md#spark.launcher.secret[spark.launcher.secret] configuration properties) <p>| onStopRequest a| [[onStopRequest]]</p>"},{"location":"local/LauncherBackend/#source-scala_1","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#onstoprequest-unit","title":"onStopRequest(): Unit","text":"<p>Handles stop requests (to stop the Spark application as gracefully as possible)</p> <p>Used exclusively when <code>LauncherBackend</code> is requested to &lt;&gt; <p>|===</p> <p>[[creating-instance]] <code>LauncherBackend</code> takes no arguments to be created.</p> <p>NOTE: <code>LauncherBackend</code> is a Scala abstract class and cannot be &lt;&gt; directly. It is created indirectly for the &lt;&gt;. <p>[[internal-registries]] .LauncherBackend's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| _isConnected a| [[_isConnected]][[isConnected]] Flag that says whether...FIXME (<code>true</code>) or not (<code>false</code>)</p> <p>Default: <code>false</code></p> <p>Used when...FIXME</p> <p>| clientThread a| [[clientThread]] Java's https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html[java.lang.Thread]</p> <p>Used when...FIXME</p> <p>| connection a| [[connection]] <code>BackendConnection</code></p> <p>Used when...FIXME</p> <p>| lastState a| [[lastState]] <code>SparkAppHandle.State</code></p> <p>Used when...FIXME</p> <p>|===</p> <p>[[implementations]] <code>LauncherBackend</code> is &lt;&gt; (as an anonymous class) for the following: <ul> <li> <p>Spark on YARN's &lt;&gt; <li> <p>Spark local's &lt;&gt; <li> <p>Spark on Mesos' &lt;&gt; <li> <p>Spark Standalone's &lt;&gt; <p>=== [[close]] Closing -- <code>close</code> Method</p>"},{"location":"local/LauncherBackend/#source-scala_2","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#close-unit","title":"close(): Unit","text":"<p><code>close</code>...FIXME</p> <p>NOTE: <code>close</code> is used when...FIXME</p> <p>=== [[connect]] Connecting -- <code>connect</code> Method</p>"},{"location":"local/LauncherBackend/#source-scala_3","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#connect-unit","title":"connect(): Unit","text":"<p><code>connect</code>...FIXME</p>"},{"location":"local/LauncherBackend/#note","title":"[NOTE]","text":"<p><code>connect</code> is used when:</p> <ul> <li> <p>Spark Standalone's <code>StandaloneSchedulerBackend</code> is requested to &lt;&gt; (in <code>client</code> deploy mode) <li> <p>Spark local's <code>LocalSchedulerBackend</code> is &lt;&gt; <li> <p>Spark on Mesos' <code>MesosCoarseGrainedSchedulerBackend</code> is requested to &lt;&gt; (in <code>client</code> deploy mode)"},{"location":"local/LauncherBackend/#spark-on-yarns-client-is-requested-to","title":"* Spark on YARN's <code>Client</code> is requested to &lt;&gt; <p>=== [[fireStopRequest]] <code>fireStopRequest</code> Internal Method</p>","text":""},{"location":"local/LauncherBackend/#source-scala_4","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#firestoprequest-unit","title":"fireStopRequest(): Unit","text":"<p><code>fireStopRequest</code>...FIXME</p> <p>NOTE: <code>fireStopRequest</code> is used exclusively when <code>BackendConnection</code> is requested to handle a <code>Stop</code> message.</p> <p>=== [[onDisconnected]] Handling Disconnects From Scheduling Backend -- <code>onDisconnected</code> Method</p>"},{"location":"local/LauncherBackend/#source-scala_5","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#ondisconnected-unit","title":"onDisconnected(): Unit","text":"<p><code>onDisconnected</code> does nothing by default and is expected to be overriden by &lt;&gt;. <p>NOTE: <code>onDisconnected</code> is used when...FIXME</p> <p>=== [[setAppId]] <code>setAppId</code> Method</p>"},{"location":"local/LauncherBackend/#source-scala_6","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#setappidappid-string-unit","title":"setAppId(appId: String): Unit","text":"<p><code>setAppId</code>...FIXME</p> <p>NOTE: <code>setAppId</code> is used when...FIXME</p> <p>=== [[setState]] <code>setState</code> Method</p>"},{"location":"local/LauncherBackend/#source-scala_7","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#setstatestate-sparkapphandlestate-unit","title":"setState(state: SparkAppHandle.State): Unit","text":"<p><code>setState</code>...FIXME</p> <p>NOTE: <code>setState</code> is used when...FIXME</p>"},{"location":"local/LocalEndpoint/","title":"LocalEndpoint","text":"<p><code>LocalEndpoint</code> is the <code>ThreadSafeRpcEndpoint</code> for LocalSchedulerBackend and is registered under the LocalSchedulerBackendEndpoint name.</p>"},{"location":"local/LocalEndpoint/#review-me","title":"Review Me","text":"<p><code>LocalEndpoint</code> is &lt;&gt; exclusively when <code>LocalSchedulerBackend</code> is requested to &lt;&gt;. <p>Put simply, <code>LocalEndpoint</code> is the communication channel between &lt;&gt; and &lt;&gt;. <code>LocalEndpoint</code> is a (thread-safe) rpc:RpcEndpoint.md[RpcEndpoint] that hosts an &lt;&gt; (with <code>driver</code> ID and <code>localhost</code> hostname) for Spark local mode. <p>[[messages]] .LocalEndpoint's RPC Messages [cols=\"1,3\",options=\"header\",width=\"100%\"] |=== | Message | Description</p> <p>| &lt;&gt; | Requests the &lt;&gt; to executor:Executor.md#killTask[kill a given task] <p>| &lt;&gt; | Calls &lt;&gt; &lt;&gt; <p>| &lt;&gt; | Requests the &lt;&gt; to executor:Executor.md#stop[stop] <p>|===</p> <p>When a <code>LocalEndpoint</code> starts up (as part of Spark local's initialization) it prints out the following INFO messages to the logs:</p> <pre><code>INFO Executor: Starting executor ID driver on host localhost\nINFO Executor: Using REPL class URI: http://192.168.1.4:56131\n</code></pre> <p>[[executor]] <code>LocalEndpoint</code> creates a single executor:Executor.md[] with the following properties:</p> <ul> <li> <p>[[localExecutorId]] driver ID for the executor:Executor.md#executorId[executor ID]</p> </li> <li> <p>[[localExecutorHostname]] localhost for the executor:Executor.md#executorHostname[hostname]</p> </li> <li> <p>&lt;&gt; for the executor:Executor.md#userClassPath[user-defined CLASSPATH] <li> <p>executor:Executor.md#isLocal[isLocal] flag enabled</p> </li> <p>The &lt;&gt; is then used when <code>LocalEndpoint</code> is requested to handle &lt;&gt; and &lt;&gt; RPC messages, and &lt;&gt;. <p>[[internal-registries]] .LocalEndpoint's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| freeCores a| [[freeCores]] The number of CPU cores that are free to use (to schedule tasks)</p> <p>Default: Initial &lt;&gt; (aka totalCores) <p>Increments when <code>LocalEndpoint</code> is requested to handle &lt;&gt; RPC message with a finished state <p>Decrements when <code>LocalEndpoint</code> is requested to &lt;&gt; and there were tasks to execute <p>NOTE: A single task to execute costs scheduler:TaskSchedulerImpl.md#CPUS_PER_TASK[spark.task.cpus] configuration (default: <code>1</code>).</p> <p>Used when <code>LocalEndpoint</code> is requested to &lt;&gt; <p>|===</p> <p>[[logging]] [TIP] ==== Enable <code>INFO</code> logging level for <code>org.apache.spark.scheduler.local.LocalEndpoint</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.local.LocalEndpoint=INFO\n</code></pre>"},{"location":"local/LocalEndpoint/#refer-to-spark-loggingmd-logging","title":"Refer to &lt;&lt;../spark-logging.md#, Logging&gt;&gt;.","text":"<p>=== [[creating-instance]] Creating LocalEndpoint Instance</p> <p><code>LocalEndpoint</code> takes the following to be created:</p> <ul> <li>[[rpcEnv]] &lt;&lt;../index.md#, RpcEnv&gt;&gt;</li> <li>[[userClassPath]] User-defined class path (<code>Seq[URL]</code>) that is the &lt;&gt; configuration property and is used exclusively to create the &lt;&gt; <li>[[scheduler]] scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl]</li> <li>[[executorBackend]] &lt;&gt; <li>[[totalCores]] Number of CPU cores (aka totalCores)</li> <p><code>LocalEndpoint</code> initializes the &lt;&gt;. <p>=== [[receive]] Processing Receive-Only RPC Messages -- <code>receive</code> Method</p>"},{"location":"local/LocalEndpoint/#source-scala","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#receive-partialfunctionany-unit","title":"receive: PartialFunction[Any, Unit]","text":"<p>NOTE: <code>receive</code> is part of the rpc:RpcEndpoint.md#receive[RpcEndpoint] abstraction.</p> <p><code>receive</code> handles (processes) &lt;&gt;, &lt;&gt;, and &lt;&gt; RPC messages. <p>==== [[ReviveOffers]] <code>ReviveOffers</code> RPC Message</p>"},{"location":"local/LocalEndpoint/#source-scala_1","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#reviveoffers","title":"ReviveOffers()","text":"<p>When &lt;&gt;, <code>LocalEndpoint</code> &lt;&gt;. <p>NOTE: <code>ReviveOffers</code> RPC message is sent out exclusively when <code>LocalSchedulerBackend</code> is requested to &lt;&gt;. <p>==== [[StatusUpdate]] <code>StatusUpdate</code> RPC Message</p>"},{"location":"local/LocalEndpoint/#source-scala_2","title":"[source, scala]","text":"<p>StatusUpdate(   taskId: Long,   state: TaskState,   serializedData: ByteBuffer)</p> <p>When &lt;&gt;, <code>LocalEndpoint</code> requests the &lt;&gt; to scheduler:TaskSchedulerImpl.md#statusUpdate[handle a task status update] (given the <code>taskId</code>, the task state and the data). <p>If the given scheduler:Task.md#TaskState[TaskState] is a finished state (one of <code>FINISHED</code>, <code>FAILED</code>, <code>KILLED</code>, <code>LOST</code> states), <code>LocalEndpoint</code> adds scheduler:TaskSchedulerImpl.md#CPUS_PER_TASK[spark.task.cpus] configuration (default: <code>1</code>) to the &lt;&gt; registry followed by &lt;&gt;. <p>NOTE: <code>StatusUpdate</code> RPC message is sent out exclusively when <code>LocalSchedulerBackend</code> is requested to &lt;&gt;. <p>==== [[KillTask]] <code>KillTask</code> RPC Message</p>"},{"location":"local/LocalEndpoint/#source-scala_3","title":"[source, scala]","text":"<p>KillTask(   taskId: Long,   interruptThread: Boolean,   reason: String)</p> <p>When &lt;&gt;, <code>LocalEndpoint</code> requests the single &lt;&gt; to executor:Executor.md#killTask[kill a task] (given the <code>taskId</code>, the <code>interruptThread</code> flag and the reason). <p>NOTE: <code>KillTask</code> RPC message is sent out exclusively when <code>LocalSchedulerBackend</code> is requested to &lt;&gt;. <p>=== [[reviveOffers]] Reviving Offers -- <code>reviveOffers</code> Method</p>"},{"location":"local/LocalEndpoint/#source-scala_4","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#reviveoffers-unit","title":"reviveOffers(): Unit","text":"<p><code>reviveOffers</code>...FIXME</p> <p>NOTE: <code>reviveOffers</code> is used when <code>LocalEndpoint</code> is requested to &lt;&gt; (namely &lt;&gt; and &lt;&gt;). <p>=== [[receiveAndReply]] Processing Receive-Reply RPC Messages -- <code>receiveAndReply</code> Method</p>"},{"location":"local/LocalEndpoint/#source-scala_5","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#receiveandreplycontext-rpccallcontext-partialfunctionany-unit","title":"receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit]","text":"<p>NOTE: <code>receiveAndReply</code> is part of the rpc:RpcEndpoint.md#receiveAndReply[RpcEndpoint] abstraction.</p> <p><code>receiveAndReply</code> handles (processes) &lt;&gt; RPC message exclusively. <p>==== [[StopExecutor]] <code>StopExecutor</code> RPC Message</p>"},{"location":"local/LocalEndpoint/#source-scala_6","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#stopexecutor","title":"StopExecutor()","text":"<p>When &lt;&gt;, <code>LocalEndpoint</code> requests the single &lt;&gt; to executor:Executor.md#stop[stop] and requests the given <code>RpcCallContext</code> to <code>reply</code> with <code>true</code> (as the response). <p>NOTE: <code>StopExecutor</code> RPC message is sent out exclusively when <code>LocalSchedulerBackend</code> is requested to &lt;&gt;."},{"location":"local/LocalSchedulerBackend/","title":"LocalSchedulerBackend","text":"<p><code>LocalSchedulerBackend</code> is a SchedulerBackend and an ExecutorBackend for Spark local deployment.</p> Master URL Total CPU Cores <code>local</code> 1 <code>local[n]</code> <code>n</code> <code>local[*]</code> The number of available CPU cores on the local machine <code>local[n, m]</code> <code>n</code> CPU cores and <code>m</code> task retries <code>local[*, m]</code> The number of available CPU cores on the local machine and <code>m</code> task retries <p></p>"},{"location":"local/LocalSchedulerBackend/#creating-instance","title":"Creating Instance","text":"<p><code>LocalSchedulerBackend</code> takes the following to be created:</p> <ul> <li> SparkConf <li> TaskSchedulerImpl <li> Total number of CPU cores <p><code>LocalSchedulerBackend</code> is created when:</p> <ul> <li><code>SparkContext</code> is requested to create a Spark Scheduler (for <code>local</code> master URL)</li> <li><code>KubernetesClusterManager</code> (Spark on Kubernetes) is requested for a <code>SchedulerBackend</code></li> </ul>"},{"location":"local/LocalSchedulerBackend/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks","text":"SchedulerBackend <pre><code>maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n</code></pre> <p><code>maxNumConcurrentTasks</code> is part of the SchedulerBackend abstraction.</p> <p><code>maxNumConcurrentTasks</code> calculates the number of CPU cores per task for the given ResourceProfile (and this SparkConf).</p> <p>In the end, <code>maxNumConcurrentTasks</code> is the total CPU cores available divided by the number of CPU cores per task.</p>"},{"location":"local/LocalSchedulerBackend/#logging","title":"Logging","text":"<p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.local.LocalSchedulerBackend</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.LocalSchedulerBackend.name = org.apache.spark.scheduler.local.LocalSchedulerBackend\nlogger.LocalSchedulerBackend.level = all\n</code></pre> <p>Refer to Logging.</p>"},{"location":"memory/","title":"Memory System","text":"<p>Memory System is a core component of Apache Spark that is based on UnifiedMemoryManager.</p>"},{"location":"memory/#resources","title":"Resources","text":"<ul> <li>SPARK-10000: Consolidate storage and execution memory management</li> </ul>"},{"location":"memory/#videos","title":"Videos","text":"<ul> <li>Deep Dive: Apache Spark Memory Management</li> <li>Deep Dive into Project Tungsten</li> <li>Spark Performance: What's Next</li> </ul>"},{"location":"memory/ExecutionMemoryPool/","title":"ExecutionMemoryPool","text":"<p><code>ExecutionMemoryPool</code> is a MemoryPool.</p>"},{"location":"memory/ExecutionMemoryPool/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutionMemoryPool</code> takes the following to be created:</p> <ul> <li> Lock Object <li> <code>MemoryMode</code> (<code>ON_HEAP</code> or <code>OFF_HEAP</code>) <p><code>ExecutionMemoryPool</code> is created\u00a0when:</p> <ul> <li><code>MemoryManager</code> is created (and initializes on-heap and off-heap execution memory pools)</li> </ul>"},{"location":"memory/ExecutionMemoryPool/#acquiring-memory","title":"Acquiring Memory <pre><code>acquireMemory(\n  numBytes: Long,\n  taskAttemptId: Long,\n  maybeGrowPool: Long =&gt; Unit = (additionalSpaceNeeded: Long) =&gt; (),\n  computeMaxPoolSize: () =&gt; Long = () =&gt; poolSize): Long\n</code></pre> <p><code>acquireMemory</code>...FIXME</p> <p><code>acquireMemory</code>\u00a0is used when:</p> <ul> <li><code>UnifiedMemoryManager</code> is requested to acquire execution memory</li> </ul>","text":""},{"location":"memory/MemoryAllocator/","title":"MemoryAllocator","text":"<p><code>MemoryAllocator</code> is an abstraction of memory allocators that TaskMemoryManager uses to allocate and release memory.</p> <p><code>MemoryAllocator</code> creates the available MemoryAllocators to be available under the names HEAP and UNSAFE.</p> <p>A MemoryAllocator to use is selected when <code>MemoryManager</code> is created (based on MemoryMode).</p>"},{"location":"memory/MemoryAllocator/#contract","title":"Contract","text":""},{"location":"memory/MemoryAllocator/#allocating-contiguous-block-of-memory","title":"Allocating Contiguous Block of Memory <pre><code>MemoryBlock allocate(\n  long size)\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested to allocate a memory page</li> </ul>","text":""},{"location":"memory/MemoryAllocator/#releasing-memory","title":"Releasing Memory <pre><code>void free(\n  MemoryBlock memory)\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested to release a memory page and clean up all the allocated memory</li> </ul>","text":""},{"location":"memory/MemoryAllocator/#implementations","title":"Implementations","text":"<ul> <li> HeapMemoryAllocator <li> UnsafeMemoryAllocator"},{"location":"memory/MemoryConsumer/","title":"MemoryConsumer","text":"<p><code>MemoryConsumer</code> is an abstraction of memory consumers (of TaskMemoryManager) that support spilling.</p> <p><code>MemoryConsumer</code>s correspond to individual operators and data structures within a task. <code>TaskMemoryManager</code> receives memory allocation requests from <code>MemoryConsumer</code>s and issues callbacks to consumers in order to trigger spilling when running low on memory.</p> <p>A <code>MemoryConsumer</code> basically tracks how much memory is allocated.</p>"},{"location":"memory/MemoryConsumer/#contract","title":"Contract","text":""},{"location":"memory/MemoryConsumer/#spilling","title":"Spilling <pre><code>void spill() // (1)\nlong spill(\n  long size,\n  MemoryConsumer trigger)\n</code></pre> <ol> <li>Uses <code>MAX_VALUE</code> for the size and this <code>MemoryConsumer</code></li> </ol> <p>Used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested to acquire execution memory (and trySpillAndAcquire)</li> <li><code>ShuffleExternalSorter</code> is requested to growPointerArrayIfNecessary, insertRecord</li> <li><code>UnsafeExternalSorter</code> is requested to createWithExistingInMemorySorter,  growPointerArrayIfNecessary, insertRecord, merge</li> </ul>","text":""},{"location":"memory/MemoryConsumer/#implementations","title":"Implementations","text":"<ul> <li>BytesToBytesMap</li> <li>ShuffleExternalSorter</li> <li>Spillable</li> <li>UnsafeExternalSorter</li> <li>a few others</li> </ul>"},{"location":"memory/MemoryConsumer/#creating-instance","title":"Creating Instance","text":"<p><code>MemoryConsumer</code> takes the following to be created:</p> <ul> <li> TaskMemoryManager <li> Page Size <li> <code>MemoryMode</code> (<code>ON_HEAP</code> or <code>OFF_HEAP</code>) Abstract Class <p><code>MemoryConsumer</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MemoryConsumers.</p>"},{"location":"memory/MemoryManager/","title":"MemoryManager","text":"<p><code>MemoryManager</code> is an abstraction of memory managers that can share available memory between tasks (TaskMemoryManager) and storage (BlockManager).</p> <p></p> <p><code>MemoryManager</code> splits assigned memory into two regions:</p> <ul> <li> <p>Execution Memory for shuffles, joins, sorts and aggregations</p> </li> <li> <p>Storage Memory for caching and propagating internal data across Spark nodes (in on- and off-heap modes)</p> </li> </ul> <p><code>MemoryManager</code> is used to create BlockManager (and MemoryStore) and TaskMemoryManager.</p>"},{"location":"memory/MemoryManager/#contract","title":"Contract","text":""},{"location":"memory/MemoryManager/#acquiring-execution-memory-for-task","title":"Acquiring Execution Memory for Task <pre><code>acquireExecutionMemory(\n  numBytes: Long,\n  taskAttemptId: Long,\n  memoryMode: MemoryMode): Long\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested to acquire execution memory</li> </ul>","text":""},{"location":"memory/MemoryManager/#acquiring-storage-memory-for-block","title":"Acquiring Storage Memory for Block <pre><code>acquireStorageMemory(\n  blockId: BlockId,\n  numBytes: Long,\n  memoryMode: MemoryMode): Boolean\n</code></pre> <p>Used when:</p> <ul> <li><code>MemoryStore</code> is requested for the putBytes and putIterator</li> </ul>","text":""},{"location":"memory/MemoryManager/#acquiring-unroll-memory-for-block","title":"Acquiring Unroll Memory for Block <pre><code>acquireUnrollMemory(\n  blockId: BlockId,\n  numBytes: Long,\n  memoryMode: MemoryMode): Boolean\n</code></pre> <p>Used when:</p> <ul> <li><code>MemoryStore</code> is requested for the reserveUnrollMemoryForThisTask</li> </ul>","text":""},{"location":"memory/MemoryManager/#total-available-off-heap-storage-memory","title":"Total Available Off-Heap Storage Memory <pre><code>maxOffHeapStorageMemory: Long\n</code></pre> <p>May vary over time</p> <p>Used when:</p> <ul> <li><code>BlockManager</code> is created</li> <li><code>MemoryStore</code> is requested for the maxMemory</li> </ul>","text":""},{"location":"memory/MemoryManager/#total-available-on-heap-storage-memory","title":"Total Available On-Heap Storage Memory <pre><code>maxOnHeapStorageMemory: Long\n</code></pre> <p>May vary over time</p> <p>Used when:</p> <ul> <li><code>BlockManager</code> is created</li> <li><code>MemoryStore</code> is requested for the maxMemory</li> </ul>","text":""},{"location":"memory/MemoryManager/#implementations","title":"Implementations","text":"<ul> <li>UnifiedMemoryManager</li> </ul>"},{"location":"memory/MemoryManager/#creating-instance","title":"Creating Instance","text":"<p><code>MemoryManager</code> takes the following to be created:</p> <ul> <li> SparkConf <li> Number of CPU Cores <li> Size of the On-Heap Storage Memory <li> Size of the On-Heap Execution Memory Abstract Class <p><code>MemoryManager</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MemoryManagers.</p>"},{"location":"memory/MemoryManager/#SparkEnv","title":"Accessing MemoryManager","text":"<p><code>MemoryManager</code> is available as SparkEnv.memoryManager on the driver and executors.</p> <pre><code>import org.apache.spark.SparkEnv\nval mm = SparkEnv.get.memoryManager\n</code></pre> <pre><code>// MemoryManager is private[spark]\n// the following won't work unless within org.apache.spark package\n// import org.apache.spark.memory.MemoryManager\n// assert(mm.isInstanceOf[MemoryManager])\n\n// we have to revert to string comparision \ud83d\ude14\nassert(\"UnifiedMemoryManager\".equals(mm.getClass.getSimpleName))\n</code></pre>"},{"location":"memory/MemoryManager/#associating-memorystore-with-storage-memory-pools","title":"Associating MemoryStore with Storage Memory Pools <pre><code>setMemoryStore(\n  store: MemoryStore): Unit\n</code></pre> <p><code>setMemoryStore</code> requests the on-heap and off-heap storage memory pools to use the given MemoryStore.</p> <p><code>setMemoryStore</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is created</li> </ul>","text":""},{"location":"memory/MemoryManager/#execution-memory-pools","title":"Execution Memory Pools","text":""},{"location":"memory/MemoryManager/#on-heap","title":"On-Heap <pre><code>onHeapExecutionMemoryPool: ExecutionMemoryPool\n</code></pre> <p><code>MemoryManager</code> creates an ExecutionMemoryPool for <code>ON_HEAP</code> memory mode when created and immediately requests it to incrementPoolSize to onHeapExecutionMemory.</p>","text":""},{"location":"memory/MemoryManager/#off-heap","title":"Off-Heap <pre><code>offHeapExecutionMemoryPool: ExecutionMemoryPool\n</code></pre> <p><code>MemoryManager</code> creates an ExecutionMemoryPool for <code>OFF_HEAP</code> memory mode when created and immediately requests it to incrementPoolSize to...FIXME</p>","text":""},{"location":"memory/MemoryManager/#storage-memory-pools","title":"Storage Memory Pools","text":""},{"location":"memory/MemoryManager/#on-heap_1","title":"On-Heap <pre><code>onHeapStorageMemoryPool: StorageMemoryPool\n</code></pre> <p><code>MemoryManager</code> creates a StorageMemoryPool for <code>ON_HEAP</code> memory mode when created and immediately requests it to incrementPoolSize to onHeapExecutionMemory.</p> <p><code>onHeapStorageMemoryPool</code> is requested to setMemoryStore when <code>MemoryManager</code> is requested to setMemoryStore.</p> <p><code>onHeapStorageMemoryPool</code> is requested to release memory when <code>MemoryManager</code> is requested to release on-heap storage memory.</p> <p><code>onHeapStorageMemoryPool</code> is requested to release all memory when <code>MemoryManager</code> is requested to release all storage memory.</p> <p><code>onHeapStorageMemoryPool</code> is used when:</p> <ul> <li><code>MemoryManager</code> is requested for the storageMemoryUsed and onHeapStorageMemoryUsed</li> <li><code>UnifiedMemoryManager</code> is requested to acquire on-heap execution and storage memory</li> </ul>","text":""},{"location":"memory/MemoryManager/#off-heap_1","title":"Off-Heap <pre><code>offHeapStorageMemoryPool: StorageMemoryPool\n</code></pre> <p><code>MemoryManager</code> creates a StorageMemoryPool for <code>OFF_HEAP</code> memory mode when created and immediately requested it to incrementPoolSize to offHeapStorageMemory.</p> <p><code>MemoryManager</code> requests the <code>MemoryPool</code>s to use a given MemoryStore when requested to setMemoryStore.</p> <p><code>MemoryManager</code> requests the <code>MemoryPool</code>s to release memory when requested to releaseStorageMemory.</p> <p><code>MemoryManager</code> requests the <code>MemoryPool</code>s to release all memory when requested to release all storage memory.</p> <p><code>MemoryManager</code> requests the <code>MemoryPool</code>s for the memoryUsed when requested for storageMemoryUsed.</p> <p><code>offHeapStorageMemoryPool</code> is used when:</p> <ul> <li><code>MemoryManager</code> is requested for the offHeapStorageMemoryUsed</li> <li><code>UnifiedMemoryManager</code> is requested to acquire off-heap execution and storage memory</li> </ul>","text":""},{"location":"memory/MemoryManager/#total-storage-memory-used","title":"Total Storage Memory Used <pre><code>storageMemoryUsed: Long\n</code></pre> <p><code>storageMemoryUsed</code> is the sum of the memory used of the on-heap and off-heap storage memory pools.</p> <p><code>storageMemoryUsed</code>\u00a0is used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested to showMemoryUsage</li> <li><code>MemoryStore</code> is requested to memoryUsed</li> </ul>","text":""},{"location":"memory/MemoryManager/#memorymode","title":"MemoryMode <pre><code>tungstenMemoryMode: MemoryMode\n</code></pre> <p><code>tungstenMemoryMode</code> tracks whether Tungsten memory will be allocated on the JVM heap or off-heap (using <code>sun.misc.Unsafe</code>).</p>  <p>final val</p> <p><code>tungstenMemoryMode</code> is a <code>final val</code>ue so initialized once when <code>MemoryManager</code> is created.</p>  <p><code>tungstenMemoryMode</code> is <code>OFF_HEAP</code> when the following are all met:</p> <ul> <li> <p>spark.memory.offHeap.enabled configuration property is enabled</p> </li> <li> <p>spark.memory.offHeap.size configuration property is greater than <code>0</code></p> </li> <li> <p>JVM supports unaligned memory access (aka unaligned Unsafe, i.e. <code>sun.misc.Unsafe</code> package is available and the underlying system has unaligned-access capability)</p> </li> </ul> <p>Otherwise, <code>tungstenMemoryMode</code> is <code>ON_HEAP</code>.</p>  <p>Note</p> <p>Given that spark.memory.offHeap.enabled configuration property is turned off by default and spark.memory.offHeap.size configuration property is <code>0</code> by default, Apache Spark seems to encourage using Tungsten memory allocated on the JVM heap (<code>ON_HEAP</code>).</p>  <p><code>tungstenMemoryMode</code> is used when:</p> <ul> <li><code>MemoryManager</code> is created (and initializes the pageSizeBytes and tungstenMemoryAllocator internal properties)</li> <li><code>TaskMemoryManager</code> is created</li> </ul>","text":""},{"location":"memory/MemoryManager/#memoryallocator","title":"MemoryAllocator <pre><code>tungstenMemoryAllocator: MemoryAllocator\n</code></pre> <p><code>MemoryManager</code> selects the MemoryAllocator to use based on the MemoryMode.</p>  <p>final val</p> <p><code>tungstenMemoryAllocator</code> is a <code>final val</code>ue so initialized once when <code>MemoryManager</code> is created.</p>     MemoryMode MemoryAllocator     <code>ON_HEAP</code> HeapMemoryAllocator   <code>OFF_HEAP</code> UnsafeMemoryAllocator    <p><code>tungstenMemoryAllocator</code> is used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested to allocate a memory page, release a memory page and clean up all the allocated memory</li> </ul>","text":""},{"location":"memory/MemoryManager/#pageSizeBytes","title":"Page Size <p><code>pageSizeBytes</code> is either spark.buffer.pageSize, if defined, or the default page size.</p> <p><code>pageSizeBytes</code> is used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested for the page size</li> </ul>","text":""},{"location":"memory/MemoryManager/#defaultPageSizeBytes","title":"Default Page Size <pre><code>defaultPageSizeBytes: Long\n</code></pre>  Lazy Value <p><code>defaultPageSizeBytes</code> is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.</p> <p>Learn more in the Scala Language Specification.</p>","text":""},{"location":"memory/MemoryPool/","title":"MemoryPool","text":"<p><code>MemoryPool</code> is an abstraction of memory pools.</p>"},{"location":"memory/MemoryPool/#contract","title":"Contract","text":""},{"location":"memory/MemoryPool/#size-of-memory-used","title":"Size of Memory Used <pre><code>memoryUsed: Long\n</code></pre> <p>Used when:</p> <ul> <li><code>MemoryPool</code> is requested for the amount of free memory and decrementPoolSize</li> </ul>","text":""},{"location":"memory/MemoryPool/#implementations","title":"Implementations","text":"<ul> <li>ExecutionMemoryPool</li> <li>StorageMemoryPool</li> </ul>"},{"location":"memory/MemoryPool/#creating-instance","title":"Creating Instance","text":"<p><code>MemoryPool</code> takes the following to be created:</p> <ul> <li> Lock Object Abstract Class <p><code>MemoryPool</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MemoryPools.</p>"},{"location":"memory/MemoryPool/#free-memory","title":"Free Memory <pre><code>memoryFree\n</code></pre> <p><code>memoryFree</code>...FIXME</p> <p><code>memoryFree</code>\u00a0is used when:</p> <ul> <li><code>ExecutionMemoryPool</code> is requested to acquireMemory</li> <li><code>StorageMemoryPool</code> is requested to acquireMemory and freeSpaceToShrinkPool</li> <li><code>UnifiedMemoryManager</code> is requested to acquire execution and storage memory</li> </ul>","text":""},{"location":"memory/MemoryPool/#decrementpoolsize","title":"decrementPoolSize <pre><code>decrementPoolSize(\n  delta: Long): Unit\n</code></pre> <p><code>decrementPoolSize</code>...FIXME</p> <p><code>decrementPoolSize</code>\u00a0is used when:</p> <ul> <li><code>UnifiedMemoryManager</code> is requested to acquireExecutionMemory and acquireStorageMemory</li> </ul>","text":""},{"location":"memory/StorageMemoryPool/","title":"StorageMemoryPool","text":"<p><code>StorageMemoryPool</code> is a MemoryPool.</p>"},{"location":"memory/StorageMemoryPool/#creating-instance","title":"Creating Instance","text":"<p><code>StorageMemoryPool</code> takes the following to be created:</p> <ul> <li> Lock Object <li> <code>MemoryMode</code> (<code>ON_HEAP</code> or <code>OFF_HEAP</code>) <p><code>StorageMemoryPool</code> is created\u00a0when:</p> <ul> <li><code>MemoryManager</code> is created (and initializes on-heap and off-heap storage memory pools)</li> </ul>"},{"location":"memory/StorageMemoryPool/#memorystore","title":"MemoryStore <p><code>StorageMemoryPool</code> is given a MemoryStore when <code>MemoryManager</code> is requested to associate one with the on- and off-heap storage memory pools.</p> <p><code>StorageMemoryPool</code> uses the <code>MemoryStore</code> (to evict blocks) when requested to:</p> <ul> <li>Acquire Memory</li> <li>Free Space to Shrink Pool</li> </ul>","text":""},{"location":"memory/StorageMemoryPool/#size-of-memory-used","title":"Size of Memory Used <p><code>StorageMemoryPool</code> keeps track of the size of the memory acquired.</p> <p>The size descreases when <code>StorageMemoryPool</code> is requested to releaseMemory or releaseAllMemory.</p> <p><code>memoryUsed</code> is part of the MemoryPool abstraction.</p>","text":""},{"location":"memory/StorageMemoryPool/#acquiring-memory","title":"Acquiring Memory <pre><code>acquireMemory(\n  blockId: BlockId,\n  numBytes: Long): Boolean\nacquireMemory(\n  blockId: BlockId,\n  numBytesToAcquire: Long,\n  numBytesToFree: Long): Boolean\n</code></pre> <p><code>acquireMemory</code>...FIXME</p> <p><code>acquireMemory</code>\u00a0is used when:</p> <ul> <li><code>UnifiedMemoryManager</code> is requested to acquire storage memory</li> </ul>","text":""},{"location":"memory/StorageMemoryPool/#freeing-space-to-shrink-pool","title":"Freeing Space to Shrink Pool <pre><code>freeSpaceToShrinkPool(\n  spaceToFree: Long): Long\n</code></pre> <p><code>freeSpaceToShrinkPool</code>...FIXME</p> <p><code>freeSpaceToShrinkPool</code>\u00a0is used when:</p> <ul> <li><code>UnifiedMemoryManager</code> is requested to acquire execution memory</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/","title":"TaskMemoryManager","text":"<p><code>TaskMemoryManager</code> manages the memory allocated to a single task (using MemoryManager).</p> <p><code>TaskMemoryManager</code> assumes that:</p> <ol> <li> The number of bits to address pages is <code>13</code> <li> The number of bits to encode offsets in pages is <code>51</code> (64 bits - 13 bits) <li> Number of pages in the page table and to be allocated is <code>8192</code> (<code>1 &lt;&lt;</code> 13) <li> The maximum page size is <code>15GB</code> (<code>((1L &lt;&lt; 31) - 1) * 8L</code>)"},{"location":"memory/TaskMemoryManager/#creating-instance","title":"Creating Instance","text":"<p><code>TaskMemoryManager</code> takes the following to be created:</p> <ul> <li>MemoryManager</li> <li> Task Attempt ID <p><code>TaskMemoryManager</code> is created\u00a0when:</p> <ul> <li><code>TaskRunner</code> is requested to run</li> </ul> <p></p>"},{"location":"memory/TaskMemoryManager/#memorymanager","title":"MemoryManager <p><code>TaskMemoryManager</code> is given a MemoryManager when created.</p> <p><code>TaskMemoryManager</code> uses the <code>MemoryManager</code>\u00a0when requested for the following:</p> <ul> <li>Acquiring, releasing or cleaning up execution memory</li> <li>Report memory usage</li> <li>pageSizeBytes</li> <li>Allocating a memory block for Tungsten consumers</li> <li>freePage</li> <li>getMemoryConsumptionForThisTask</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#page-table-memoryblocks","title":"Page Table (MemoryBlocks) <p><code>TaskMemoryManager</code> uses an array of <code>MemoryBlock</code>s (to mimic an operating system's page table).</p> <p>The page table uses 13 bits for addressing pages.</p> <p>A page is \"stored\" in allocatePage and \"removed\" in freePage.</p> <p>All pages are released (removed) in cleanUpAllAllocatedMemory.</p> <p><code>TaskMemoryManager</code> uses the page table when requested to:</p> <ul> <li>getPage</li> <li>getOffsetInPage</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#spillable-memory-consumers","title":"Spillable Memory Consumers <pre><code>HashSet&lt;MemoryConsumer&gt; consumers\n</code></pre> <p><code>TaskMemoryManager</code> tracks spillable memory consumers.</p> <p><code>TaskMemoryManager</code> registers a new memory consumer when requested to acquire execution memory.</p> <p><code>TaskMemoryManager</code> removes (clears) all registered memory consumers when cleaning up all the allocated memory.</p> <p>Memory consumers are used to report memory usage when <code>TaskMemoryManager</code> is requested to show memory usage.</p>","text":""},{"location":"memory/TaskMemoryManager/#memory-acquired-but-not-used","title":"Memory Acquired But Not Used <p><code>TaskMemoryManager</code> tracks the size of memory allocated but not used (by any of the MemoryConsumers due to a <code>OutOfMemoryError</code> upon trying to use it).</p> <p><code>TaskMemoryManager</code> releases the memory when cleaning up all the allocated memory.</p>","text":""},{"location":"memory/TaskMemoryManager/#allocated-pages","title":"Allocated Pages <pre><code>BitSet allocatedPages\n</code></pre> <p><code>TaskMemoryManager</code> uses a <code>BitSet</code> (Java) to track allocated pages.</p> <p>The size is exactly the number of entries in the page table (8192).</p>","text":""},{"location":"memory/TaskMemoryManager/#memorymode","title":"MemoryMode <p><code>TaskMemoryManager</code> can be in <code>ON_HEAP</code> or <code>OFF_HEAP</code> mode (to avoid extra work for off-heap and hoping that the JIT handles branching well).</p> <p><code>TaskMemoryManager</code> is given the <code>MemoryMode</code> matching the MemoryMode (of the given MemoryManager) when created.</p> <p><code>TaskMemoryManager</code> uses the <code>MemoryMode</code> to match to for the following:</p> <ul> <li>allocatePage</li> <li>cleanUpAllAllocatedMemory</li> </ul> <p>For <code>OFF_HEAP</code> mode, <code>TaskMemoryManager</code> has to change offset while encodePageNumberAndOffset and getOffsetInPage.</p> <p>For <code>OFF_HEAP</code> mode, <code>TaskMemoryManager</code> returns no page.</p> <p>The <code>MemoryMode</code> is used when:</p> <ul> <li><code>ShuffleExternalSorter</code> is created</li> <li><code>BytesToBytesMap</code> is created</li> <li><code>UnsafeExternalSorter</code> is created</li> <li><code>Spillable</code> is requested to spill (only when in <code>ON_HEAP</code> mode)</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#acquiring-execution-memory","title":"Acquiring Execution Memory <pre><code>long acquireExecutionMemory(\n  long required,\n  MemoryConsumer consumer)\n</code></pre> <p><code>acquireExecutionMemory</code> allocates up to <code>required</code> execution memory (bytes) for the MemoryConsumer (from the MemoryManager).</p> <p>When not enough memory could be allocated initially, <code>acquireExecutionMemory</code> requests every consumer (with the same MemoryMode, itself including) to spill.</p> <p><code>acquireExecutionMemory</code> returns the amount of memory allocated.</p> <p><code>acquireExecutionMemory</code>\u00a0is used when:</p> <ul> <li><code>MemoryConsumer</code> is requested to acquire execution memory</li> <li><code>TaskMemoryManager</code> is requested to allocate a page</li> </ul>  <p><code>acquireExecutionMemory</code> requests the MemoryManager to acquire execution memory (with <code>required</code> bytes, the taskAttemptId and the MemoryMode of the MemoryConsumer).</p> <p>In the end, <code>acquireExecutionMemory</code> registers the <code>MemoryConsumer</code> (and adds it to the consumers registry) and prints out the following DEBUG message to the logs:</p> <pre><code>Task [taskAttemptId] acquired [got] for [consumer]\n</code></pre>  <p>In case <code>MemoryManager</code> will have offerred less memory than <code>required</code>, <code>acquireExecutionMemory</code> finds the MemoryConsumers (in the consumers registry) with the MemoryMode and non-zero memory used, sorts them by memory usage, requests them (one by one) to spill until enough memory is acquired or there are no more consumers to release memory from (by spilling).</p> <p>When a <code>MemoryConsumer</code> releases memory, <code>acquireExecutionMemory</code> prints out the following DEBUG message to the logs:</p> <pre><code>Task [taskAttemptId] released [released] from [c] for [consumer]\n</code></pre>  <p>In case there is still not enough memory (less than <code>required</code>), <code>acquireExecutionMemory</code> requests the <code>MemoryConsumer</code> (to acquire memory for) to spill.</p> <p><code>acquireExecutionMemory</code> prints out the following DEBUG message to the logs:</p> <pre><code>Task [taskAttemptId] released [released] from itself ([consumer])\n</code></pre>","text":""},{"location":"memory/TaskMemoryManager/#releasing-execution-memory","title":"Releasing Execution Memory <pre><code>void releaseExecutionMemory(\n  long size,\n  MemoryConsumer consumer)\n</code></pre> <p><code>releaseExecutionMemory</code> prints out the following DEBUG message to the logs:</p> <pre><code>Task [taskAttemptId] release [size] from [consumer]\n</code></pre> <p>In the end, <code>releaseExecutionMemory</code> requests the MemoryManager to releaseExecutionMemory.</p> <p><code>releaseExecutionMemory</code> is used when:</p> <ul> <li><code>MemoryConsumer</code> is requested to free up memory</li> <li><code>TaskMemoryManager</code> is requested to allocatePage and freePage</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#pageSizeBytes","title":"Page Size <pre><code>long pageSizeBytes()\n</code></pre> <p><code>pageSizeBytes</code> requests the MemoryManager for the page size.</p>  <p><code>pageSizeBytes</code> is used when:</p> <ul> <li><code>MemoryConsumer</code> is created</li> <li><code>ShuffleExternalSorter</code> is created (as a <code>MemoryConsumer</code>)</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#reporting-memory-usage","title":"Reporting Memory Usage <pre><code>void showMemoryUsage()\n</code></pre> <p><code>showMemoryUsage</code> prints out the following INFO message to the logs (with the taskAttemptId):</p> <pre><code>Memory used in task [taskAttemptId]\n</code></pre> <p><code>showMemoryUsage</code> requests every MemoryConsumer to report memory used. For consumers with non-zero memory usage, <code>showMemoryUsage</code> prints out the following INFO message to the logs:</p> <pre><code>Acquired by [consumer]: [memUsage]\n</code></pre> <p><code>showMemoryUsage</code> requests the MemoryManager to getExecutionMemoryUsageForTask to calculate memory not accounted for (that is not associated with a specific consumer).</p> <p><code>showMemoryUsage</code> prints out the following INFO messages to the logs:</p> <pre><code>[memoryNotAccountedFor] bytes of memory were used by task [taskAttemptId] but are not associated with specific consumers\n</code></pre> <p><code>showMemoryUsage</code> requests the MemoryManager for the executionMemoryUsed and storageMemoryUsed and prints out the following INFO message to the logs:</p> <pre><code>[executionMemoryUsed] bytes of memory are used for execution and\n[storageMemoryUsed] bytes of memory are used for storage\n</code></pre> <p><code>showMemoryUsage</code> is used when:</p> <ul> <li><code>MemoryConsumer</code> is requested to throw an OutOfMemoryError</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#cleaning-up-all-allocated-memory","title":"Cleaning Up All Allocated Memory <pre><code>long cleanUpAllAllocatedMemory()\n</code></pre> <p>The <code>consumers</code> collection is then cleared.</p> <p><code>cleanUpAllAllocatedMemory</code> finds all the registered MemoryConsumers (in the consumers registry) that still keep some memory used and, for every such consumer, prints out the following DEBUG message to the logs:</p> <pre><code>unreleased [getUsed] memory from [consumer]\n</code></pre> <p><code>cleanUpAllAllocatedMemory</code> removes all the consumers.</p>  <p>For every <code>MemoryBlock</code> in the pageTable, <code>cleanUpAllAllocatedMemory</code> prints out the following DEBUG message to the logs:</p> <pre><code>unreleased page: [page] in task [taskAttemptId]\n</code></pre> <p><code>cleanUpAllAllocatedMemory</code> marks the pages to be freed (<code>FREED_IN_TMM_PAGE_NUMBER</code>) and requests the MemoryManager for the tungstenMemoryAllocator to free up the MemoryBlock.</p> <p><code>cleanUpAllAllocatedMemory</code> clears the pageTable registry (by assigning <code>null</code> values).</p>  <p><code>cleanUpAllAllocatedMemory</code> requests the MemoryManager to release execution memory that is not used by any consumer (with the acquiredButNotUsed and the tungstenMemoryMode).</p> <p>In the end, <code>cleanUpAllAllocatedMemory</code> requests the MemoryManager to release all execution memory for the task.</p>  <p><code>cleanUpAllAllocatedMemory</code>\u00a0is used when:</p> <ul> <li><code>TaskRunner</code> is requested to run a task (and the task has finished successfully)</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#allocating-memory-page","title":"Allocating Memory Page <pre><code>MemoryBlock allocatePage(\n  long size,\n  MemoryConsumer consumer)\n</code></pre> <p><code>allocatePage</code> allocates a block of memory (page) that is:</p> <ol> <li>Below MAXIMUM_PAGE_SIZE_BYTES maximum size</li> <li>For MemoryConsumers with the same MemoryMode as the TaskMemoryManager</li> </ol> <p><code>allocatePage</code> acquireExecutionMemory (for the <code>size</code> and the MemoryConsumer). <code>allocatePage</code> returns immediately (with <code>null</code>) when this allocation ended up with <code>0</code> or less bytes.</p> <p><code>allocatePage</code> allocates the first clear bit in the allocatedPages (unless the whole page table is taken and <code>allocatePage</code> throws an <code>IllegalStateException</code>).</p> <p><code>allocatePage</code> requests the MemoryManager for the tungstenMemoryAllocator that is requested to allocate the acquired memory.</p> <p><code>allocatePage</code> registers the page in the pageTable.</p> <p>In the end, <code>allocatePage</code> prints out the following TRACE message to the logs and returns the <code>MemoryBlock</code> allocated.</p> <pre><code>Allocate page number [pageNumber] ([acquired] bytes)\n</code></pre>","text":""},{"location":"memory/TaskMemoryManager/#usage","title":"Usage <p><code>allocatePage</code> is used when:</p> <ul> <li><code>MemoryConsumer</code> is requested to allocate an array and a page</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#toolargepageexception","title":"TooLargePageException <p>For sizes larger than the MAXIMUM_PAGE_SIZE_BYTES <code>allocatePage</code> throws a <code>TooLargePageException</code>.</p>","text":""},{"location":"memory/TaskMemoryManager/#outofmemoryerror","title":"OutOfMemoryError <p>Requesting the tungstenMemoryAllocator to allocate the acquired memory may throw an <code>OutOfMemoryError</code>. If so, <code>allocatePage</code> prints out the following WARN message to the logs:</p> <pre><code>Failed to allocate a page ([acquired] bytes), try again.\n</code></pre> <p><code>allocatePage</code> adds the acquired memory to the acquiredButNotUsed and removes the page from the allocatedPages (by clearing the bit).</p> <p>In the end, <code>allocatePage</code> tries to allocate the page again (recursively).</p>","text":""},{"location":"memory/TaskMemoryManager/#releasing-memory-page","title":"Releasing Memory Page <pre><code>void freePage(\n  MemoryBlock page,\n  MemoryConsumer consumer)\n</code></pre> <p><code>pageSizeBytes</code> requests the MemoryManager for pageSizeBytes.</p> <p><code>pageSizeBytes</code> is used when:</p> <ul> <li><code>MemoryConsumer</code> is requested to freePage and throwOom</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#getting-page","title":"Getting Page <pre><code>Object getPage(\n  long pagePlusOffsetAddress)\n</code></pre> <p><code>getPage</code> handles the <code>ON_HEAP</code> mode of the tungstenMemoryMode only.</p> <p><code>getPage</code> looks up the page (by the given address) in the page table and requests it for the base object.</p> <p><code>getPage</code> is used when:</p> <ul> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile</li> <li><code>Location</code> (of BytesToBytesMap) is requested to <code>updateAddressesAndSizes</code></li> <li><code>SortComparator</code> (of UnsafeInMemorySorter) is requested to <code>compare</code> two record pointers</li> <li><code>SortedIterator</code> (of UnsafeInMemorySorter) is requested to <code>loadNext</code> record</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#getoffsetinpage","title":"getOffsetInPage <pre><code>long getOffsetInPage(\n  long pagePlusOffsetAddress)\n</code></pre> <p><code>getOffsetInPage</code> gives the offset associated with the given <code>pagePlusOffsetAddress</code> (encoded by <code>encodePageNumberAndOffset</code>).</p> <p><code>getOffsetInPage</code> is used when:</p> <ul> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile</li> <li><code>Location</code> (of BytesToBytesMap) is requested to <code>updateAddressesAndSizes</code></li> <li><code>SortComparator</code> (of UnsafeInMemorySorter) is requested to <code>compare</code> two record pointers</li> <li><code>SortedIterator</code> (of UnsafeInMemorySorter) is requested to <code>loadNext</code> record</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.memory.TaskMemoryManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.memory.TaskMemoryManager=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"memory/UnifiedMemoryManager/","title":"UnifiedMemoryManager","text":"<p><code>UnifiedMemoryManager</code> is a MemoryManager (with the onHeapExecutionMemory being the Maximum Heap Memory with the onHeapStorageRegionSize taken out).</p> <p><code>UnifiedMemoryManager</code> allows for soft boundaries between storage and execution memory (allowing requests for memory in one region to be fulfilled by borrowing memory from the other).</p>"},{"location":"memory/UnifiedMemoryManager/#creating-instance","title":"Creating Instance","text":"<p><code>UnifiedMemoryManager</code> takes the following to be created:</p> <ul> <li> SparkConf <li>Maximum Heap Memory</li> <li>Size of the On-Heap Storage Region</li> <li> Number of CPU Cores <p>While being created, <code>UnifiedMemoryManager</code> asserts the invariants.</p> <p><code>UnifiedMemoryManager</code> is created\u00a0using apply factory.</p>"},{"location":"memory/UnifiedMemoryManager/#invariants","title":"Invariants <p><code>UnifiedMemoryManager</code> asserts the following:</p> <ul> <li> <p>Sum of the pool size of the on-heap ExecutionMemoryPool and on-heap StorageMemoryPool is exactly the maximum heap memory</p> </li> <li> <p>Sum of the pool size of the off-heap ExecutionMemoryPool and off-heap StorageMemoryPool is exactly the maximum off-heap memory</p> </li> </ul>","text":""},{"location":"memory/UnifiedMemoryManager/#total-available-on-heap-memory-for-storage","title":"Total Available On-Heap Memory for Storage <pre><code>maxOnHeapStorageMemory: Long\n</code></pre> <p><code>maxOnHeapStorageMemory</code>\u00a0is part of the MemoryManager abstraction.</p> <p><code>maxOnHeapStorageMemory</code> is the difference between Maximum Heap Memory and the memory used in the on-heap execution memory pool.</p>","text":""},{"location":"memory/UnifiedMemoryManager/#size-of-the-on-heap-storage-memory","title":"Size of the On-Heap Storage Memory <p><code>UnifiedMemoryManager</code> is given the size of the on-heap storage memory (region) when created.</p> <p>The size is the fraction (based on spark.memory.storageFraction configuration property) of the maximum heap memory.</p> <p>The remaining memory space (of the maximum heap memory) is used for the on-heap execution memory.</p>","text":""},{"location":"memory/UnifiedMemoryManager/#creating-unifiedmemorymanager","title":"Creating UnifiedMemoryManager <pre><code>apply(\n  conf: SparkConf,\n  numCores: Int): UnifiedMemoryManager\n</code></pre> <p><code>apply</code> creates a UnifiedMemoryManager with the Maximum Heap Memory and the size of the on-heap storage region as spark.memory.storageFraction of the Maximum Memory.</p> <p><code>apply</code>\u00a0is used when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a base SparkEnv (for the driver and executors)</li> </ul>","text":""},{"location":"memory/UnifiedMemoryManager/#maximum-heap-memory","title":"Maximum Heap Memory <p><code>UnifiedMemoryManager</code> is given the maximum heap memory to use (for execution and storage) when created (that uses apply factory method which uses <code>getMaxMemory</code>).</p> <p><code>UnifiedMemoryManager</code> makes sure that the driver's system memory is at least <code>1.5</code> of the Reserved System Memory. Otherwise, <code>getMaxMemory</code> throws an <code>IllegalArgumentException</code>:</p> <pre><code>System memory [systemMemory] must be at least [minSystemMemory].\nPlease increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.\n</code></pre> <p><code>UnifiedMemoryManager</code> makes sure that the executor memory (spark.executor.memory) is at least the Reserved System Memory. Otherwise, <code>getMaxMemory</code> throws an <code>IllegalArgumentException</code>:</p> <pre><code>Executor memory [executorMemory] must be at least [minSystemMemory].\nPlease increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration.\n</code></pre> <p><code>UnifiedMemoryManager</code> considers \"usable\" memory to be the system memory without the reserved memory.</p> <p><code>UnifiedMemoryManager</code> uses the fraction (based on spark.memory.fraction configuration property) of the \"usable\" memory for the maximum heap memory.</p>","text":""},{"location":"memory/UnifiedMemoryManager/#demo","title":"Demo <pre><code>// local mode with --conf spark.driver.memory=2g\nscala&gt; sc.getConf.getSizeAsBytes(\"spark.driver.memory\")\nres0: Long = 2147483648\n\nscala&gt; val systemMemory = Runtime.getRuntime.maxMemory\n\n// fixed amount of memory for non-storage, non-execution purposes\n// UnifiedMemoryManager.RESERVED_SYSTEM_MEMORY_BYTES\nval reservedMemory = 300 * 1024 * 1024\n\n// minimum system memory required\nval minSystemMemory = (reservedMemory * 1.5).ceil.toLong\n\nval usableMemory = systemMemory - reservedMemory\n\nval memoryFraction = sc.getConf.getDouble(\"spark.memory.fraction\", 0.6)\nscala&gt; val maxMemory = (usableMemory * memoryFraction).toLong\nmaxMemory: Long = 956615884\n\nimport org.apache.spark.network.util.JavaUtils\nscala&gt; JavaUtils.byteStringAsMb(maxMemory + \"b\")\nres1: Long = 912\n</code></pre>","text":""},{"location":"memory/UnifiedMemoryManager/#reserved-system-memory","title":"Reserved System Memory <p><code>UnifiedMemoryManager</code> considers <code>300MB</code> (<code>300 * 1024 * 1024</code> bytes) as a reserved system memory while calculating the maximum heap memory.</p>","text":""},{"location":"memory/UnifiedMemoryManager/#acquiring-execution-memory-for-task","title":"Acquiring Execution Memory for Task <pre><code>acquireExecutionMemory(\n  numBytes: Long,\n  taskAttemptId: Long,\n  memoryMode: MemoryMode): Long\n</code></pre> <p><code>acquireExecutionMemory</code> asserts the invariants.</p> <p><code>acquireExecutionMemory</code> selects the execution and storage pools, the storage region size and the maximum memory for the given <code>MemoryMode</code>.</p>    MemoryMode ON_HEAP OFF_HEAP     executionPool onHeapExecutionMemoryPool offHeapExecutionMemoryPool   storagePool onHeapStorageMemoryPool offHeapStorageMemoryPool   storageRegionSize onHeapStorageRegionSize offHeapStorageMemory   maxMemory maxHeapMemory maxOffHeapMemory    <p>In the end, <code>acquireExecutionMemory</code> requests the ExecutionMemoryPool to acquire memory of <code>numBytes</code> bytes (with the maybeGrowExecutionPool and the maximum size of execution pool functions).</p>  <p><code>acquireExecutionMemory</code>\u00a0is part of the MemoryManager abstraction.</p>","text":""},{"location":"memory/UnifiedMemoryManager/#maybegrowexecutionpool","title":"maybeGrowExecutionPool <pre><code>maybeGrowExecutionPool(\n  extraMemoryNeeded: Long): Unit\n</code></pre> <p><code>maybeGrowExecutionPool</code>...FIXME</p>","text":""},{"location":"memory/UnifiedMemoryManager/#maximum-size-of-execution-pool","title":"Maximum Size of Execution Pool <pre><code>computeMaxExecutionPoolSize(): Long\n</code></pre> <p><code>computeMaxExecutionPoolSize</code> takes the minimum size of the storage memory regions (based on the memory mode, <code>ON_HEAP</code> or <code>OFF_HEAP</code>, respectively):</p> <ul> <li>Memory used of the on-heap or the off-heap storage memory pool</li> <li>On-heap or the off-heap storage memory size</li> </ul> <p>In the end, <code>computeMaxExecutionPoolSize</code> returns the size of the remaining memory space of the maximum memory (the maxHeapMemory or the maxOffHeapMemory for <code>ON_HEAP</code> or <code>OFF_HEAP</code> memory mode, respectively) without (the minimum size of) the storage memory region.</p>","text":""},{"location":"memory/UnsafeExternalSorter/","title":"UnsafeExternalSorter","text":"<p><code>UnsafeExternalSorter</code> is a MemoryConsumer.</p>"},{"location":"memory/UnsafeExternalSorter/#creating-instance","title":"Creating Instance","text":"<p><code>UnsafeExternalSorter</code> takes the following to be created:</p> <ul> <li> TaskMemoryManager <li> BlockManager <li> SerializerManager <li> TaskContext <li> RecordComparator Supplier <li> <code>PrefixComparator</code> <li> Initial Size <li> Page size (in bytes) <li> numElementsForSpillThreshold <li> UnsafeInMemorySorter <li> <code>canUseRadixSort</code> flag <p><code>UnsafeExternalSorter</code> is created\u00a0when:</p> <ul> <li><code>UnsafeExternalSorter</code> utility is used to createWithExistingInMemorySorter and create</li> </ul>"},{"location":"memory/UnsafeExternalSorter/#createwithexistinginmemorysorter","title":"createWithExistingInMemorySorter <pre><code>UnsafeExternalSorter createWithExistingInMemorySorter(\n  TaskMemoryManager taskMemoryManager,\n  BlockManager blockManager,\n  SerializerManager serializerManager,\n  TaskContext taskContext,\n  Supplier&lt;RecordComparator&gt; recordComparatorSupplier,\n  PrefixComparator prefixComparator,\n  int initialSize,\n  long pageSizeBytes,\n  int numElementsForSpillThreshold,\n  UnsafeInMemorySorter inMemorySorter,\n  long existingMemoryConsumption)\n</code></pre> <p><code>createWithExistingInMemorySorter</code>...FIXME</p> <p><code>createWithExistingInMemorySorter</code>\u00a0is used when:</p> <ul> <li><code>UnsafeKVExternalSorter</code> is created</li> </ul>","text":""},{"location":"memory/UnsafeExternalSorter/#create","title":"create <pre><code>UnsafeExternalSorter create(\n  TaskMemoryManager taskMemoryManager,\n  BlockManager blockManager,\n  SerializerManager serializerManager,\n  TaskContext taskContext,\n  Supplier&lt;RecordComparator&gt; recordComparatorSupplier,\n  PrefixComparator prefixComparator,\n  int initialSize,\n  long pageSizeBytes,\n  int numElementsForSpillThreshold,\n  boolean canUseRadixSort)\n</code></pre> <p><code>create</code> creates a new UnsafeExternalSorter with no UnsafeInMemorySorter given (<code>null</code>).</p> <p><code>create</code>\u00a0is used when:</p> <ul> <li><code>UnsafeExternalRowSorter</code> and <code>UnsafeKVExternalSorter</code> are created</li> </ul>","text":""},{"location":"memory/UnsafeInMemorySorter/","title":"UnsafeInMemorySorter","text":""},{"location":"memory/UnsafeInMemorySorter/#creating-instance","title":"Creating Instance","text":"<p><code>UnsafeInMemorySorter</code> takes the following to be created:</p> <ul> <li> MemoryConsumer <li> TaskMemoryManager <li> <code>RecordComparator</code> <li> <code>PrefixComparator</code> <li> Long Array or Size <li> <code>canUseRadixSort</code> flag <p><code>UnsafeInMemorySorter</code> is created\u00a0when:</p> <ul> <li><code>UnsafeExternalSorter</code> is created</li> <li><code>UnsafeKVExternalSorter</code> is created</li> </ul>"},{"location":"memory/UnsafeSorterSpillReader/","title":"UnsafeSorterSpillReader","text":"<p>= UnsafeSorterSpillReader</p> <p>UnsafeSorterSpillReader is...FIXME</p>"},{"location":"memory/UnsafeSorterSpillWriter/","title":"UnsafeSorterSpillWriter","text":"<p>= [[UnsafeSorterSpillWriter]] UnsafeSorterSpillWriter</p> <p>UnsafeSorterSpillWriter is...FIXME</p>"},{"location":"metrics/","title":"Spark Metrics","text":"<p>Spark Metrics gives you execution metrics of Spark subsystems (metrics instances, e.g. the driver of a Spark application or the master of a Spark Standalone cluster).</p> <p>Spark Metrics uses Dropwizard Metrics Java library for the metrics infrastructure.</p> <p>Metrics is a Java library which gives you unparalleled insight into what your code does in production.</p> <p>Metrics provides a powerful toolkit of ways to measure the behavior of critical components in your production environment.</p>"},{"location":"metrics/#metrics-systems","title":"Metrics Systems","text":""},{"location":"metrics/#applicationmaster","title":"applicationMaster","text":"<p>Registered when <code>ApplicationMaster</code> (Hadoop YARN) is requested to <code>createAllocator</code></p>"},{"location":"metrics/#applications","title":"applications","text":"<p>Registered when <code>Master</code> (Spark Standalone) is created</p>"},{"location":"metrics/#driver","title":"driver","text":"<p>Registered when <code>SparkEnv</code> is created for the driver</p> <p></p>"},{"location":"metrics/#executor","title":"executor","text":"<p>Registered when <code>SparkEnv</code> is created for an executor</p>"},{"location":"metrics/#master","title":"master","text":"<p>Registered when <code>Master</code> (Spark Standalone) is created</p>"},{"location":"metrics/#mesos_cluster","title":"mesos_cluster","text":"<p>Registered when <code>MesosClusterScheduler</code> (Apache Mesos) is created</p>"},{"location":"metrics/#shuffleservice","title":"shuffleService","text":"<p>Registered when <code>ExternalShuffleService</code> is created</p>"},{"location":"metrics/#worker","title":"worker","text":"<p>Registered when <code>Worker</code> (Spark Standalone) is created</p>"},{"location":"metrics/#metricssystem","title":"MetricsSystem <p>Spark Metrics uses MetricsSystem.</p> <p><code>MetricsSystem</code> uses Dropwizard Metrics' MetricRegistry that acts as the integration point between Spark and the metrics library.</p> <p>A Spark subsystem can access the <code>MetricsSystem</code> through the SparkEnv.metricsSystem property.</p> <pre><code>val metricsSystem = SparkEnv.get.metricsSystem\n</code></pre>","text":""},{"location":"metrics/#metricsconfig","title":"MetricsConfig <p><code>MetricsConfig</code> is the configuration of the MetricsSystem (i.e. metrics spark-metrics-Source.md[sources] and spark-metrics-Sink.md[sinks]).</p> <p>metrics.properties is the default metrics configuration file. It is configured using spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property. The file is first loaded from the path directly before using Spark's CLASSPATH.</p> <p><code>MetricsConfig</code> also accepts a metrics configuration using <code>spark.metrics.conf.</code>-prefixed configuration properties.</p> <p>Spark comes with <code>conf/metrics.properties.template</code> file that is a template of metrics configuration.</p>","text":""},{"location":"metrics/#metricsservlet-metrics-sink","title":"MetricsServlet Metrics Sink <p>Among the metrics sinks is spark-metrics-MetricsServlet.md[MetricsServlet] that is used when sink.servlet metrics sink is configured in spark-metrics-MetricsConfig.md[metrics configuration].</p> <p>CAUTION: FIXME Describe configuration files and properties</p>","text":""},{"location":"metrics/#jmxsink-metrics-sink","title":"JmxSink Metrics Sink <p>Enable <code>org.apache.spark.metrics.sink.JmxSink</code> in spark-metrics-MetricsConfig.md[metrics configuration].</p> <p>You can then use <code>jconsole</code> to access Spark metrics through JMX.</p> <pre><code>*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink\n</code></pre> <p></p>","text":""},{"location":"metrics/#json-uri-path","title":"JSON URI Path <p>Metrics System is available at http://localhost:4040/metrics/json (for the default setup of a Spark application).</p> <pre><code>$ http --follow http://localhost:4040/metrics/json\nHTTP/1.1 200 OK\nCache-Control: no-cache, no-store, must-revalidate\nContent-Length: 2200\nContent-Type: text/json;charset=utf-8\nDate: Sat, 25 Feb 2017 14:14:16 GMT\nServer: Jetty(9.2.z-SNAPSHOT)\nX-Frame-Options: SAMEORIGIN\n\n{\n    \"counters\": {\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.fileCacheHits\": {\n            \"count\": 0\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.filesDiscovered\": {\n            \"count\": 0\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.hiveClientCalls\": {\n            \"count\": 2\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.parallelListingJobCount\": {\n            \"count\": 0\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.partitionsFetched\": {\n            \"count\": 0\n        }\n    },\n    \"gauges\": {\n    ...\n    \"timers\": {\n        \"app-20170225151406-0000.driver.DAGScheduler.messageProcessingTime\": {\n            \"count\": 0,\n            \"duration_units\": \"milliseconds\",\n            \"m15_rate\": 0.0,\n            \"m1_rate\": 0.0,\n            \"m5_rate\": 0.0,\n            \"max\": 0.0,\n            \"mean\": 0.0,\n            \"mean_rate\": 0.0,\n            \"min\": 0.0,\n            \"p50\": 0.0,\n            \"p75\": 0.0,\n            \"p95\": 0.0,\n            \"p98\": 0.0,\n            \"p99\": 0.0,\n            \"p999\": 0.0,\n            \"rate_units\": \"calls/second\",\n            \"stddev\": 0.0\n        }\n    },\n    \"version\": \"3.0.0\"\n}\n</code></pre> <p>NOTE: You can access a Spark subsystem's <code>MetricsSystem</code> using its corresponding \"leading\" port, e.g. <code>4040</code> for the <code>driver</code>, <code>8080</code> for Spark Standalone's <code>master</code> and <code>applications</code>.</p> <p>NOTE: You have to use the trailing slash (<code>/</code>) to have the output.</p>","text":""},{"location":"metrics/#spark-standalone-master","title":"Spark Standalone Master <pre><code>$ http http://192.168.1.4:8080/metrics/master/json/path\nHTTP/1.1 200 OK\nCache-Control: no-cache, no-store, must-revalidate\nContent-Length: 207\nContent-Type: text/json;charset=UTF-8\nServer: Jetty(8.y.z-SNAPSHOT)\nX-Frame-Options: SAMEORIGIN\n\n{\n    \"counters\": {},\n    \"gauges\": {\n        \"master.aliveWorkers\": {\n            \"value\": 0\n        },\n        \"master.apps\": {\n            \"value\": 0\n        },\n        \"master.waitingApps\": {\n            \"value\": 0\n        },\n        \"master.workers\": {\n            \"value\": 0\n        }\n    },\n    \"histograms\": {},\n    \"meters\": {},\n    \"timers\": {},\n    \"version\": \"3.0.0\"\n}\n</code></pre>","text":""},{"location":"metrics/JvmSource/","title":"JvmSource","text":"<p><code>JvmSource</code> is a metrics source.</p> <p> The name of the source is jvm. <p><code>JvmSource</code> registers the build-in Codahale metrics:</p> <ul> <li><code>GarbageCollectorMetricSet</code></li> <li><code>MemoryUsageGaugeSet</code></li> <li><code>BufferPoolMetricSet</code></li> </ul> <p>Among the metrics is total.committed (from <code>MemoryUsageGaugeSet</code>) that describes the current usage of the heap and non-heap memories.</p>"},{"location":"metrics/MetricsConfig/","title":"MetricsConfig","text":"<p><code>MetricsConfig</code> is the configuration of the MetricsSystem (i.e. metrics sources and sinks).</p> <p><code>MetricsConfig</code> is &lt;&gt; when MetricsSystem is. <p><code>MetricsConfig</code> uses metrics.properties as the default metrics configuration file. It is configured using spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property. The file is first loaded from the path directly before using Spark's CLASSPATH.</p> <p><code>MetricsConfig</code> accepts a metrics configuration using <code>spark.metrics.conf.</code>-prefixed configuration properties.</p> <p>Spark comes with <code>conf/metrics.properties.template</code> file that is a template of metrics configuration.</p> <p><code>MetricsConfig</code> &lt;&gt; that the &lt;&gt; are always defined. <p>[[default-properties]] .MetricsConfig's Default Metrics Properties [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| <code>*.sink.servlet.class</code> | <code>org.apache.spark.metrics.sink.MetricsServlet</code></p> <p>| <code>*.sink.servlet.path</code> | <code>/metrics/json</code></p> <p>| <code>master.sink.servlet.path</code> | <code>/metrics/master/json</code></p> <p>| <code>applications.sink.servlet.path</code> | <code>/metrics/applications/json</code> |===</p>"},{"location":"metrics/MetricsConfig/#note","title":"[NOTE]","text":"<p>The order of precedence of metrics configuration settings is as follows:</p> <p>. &lt;&gt; . spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property or <code>metrics.properties</code> configuration file . <code>spark.metrics.conf.</code>-prefixed Spark properties ==== <p>[[creating-instance]] [[conf]] <code>MetricsConfig</code> takes a SparkConf.md[SparkConf] when created.</p> <p>[[internal-registries]] .MetricsConfig's Internal Registries and Counters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| [[properties]] <code>properties</code> | https://docs.oracle.com/javase/8/docs/api/java/util/Properties.html[java.util.Properties] with metrics properties</p> <p>Used to &lt;&gt; per-subsystem's &lt;&gt;. <p>| [[perInstanceSubProperties]] <code>perInstanceSubProperties</code> | Lookup table of metrics properties per subsystem |===</p> <p>=== [[initialize]] Initializing MetricsConfig -- <code>initialize</code> Method</p>"},{"location":"metrics/MetricsConfig/#source-scala","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#initialize-unit","title":"initialize(): Unit","text":"<p><code>initialize</code> &lt;&gt; and &lt;&gt; (that is defined using spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property). <p><code>initialize</code> takes all Spark properties that start with spark.metrics.conf. prefix from &lt;&gt; and adds them to &lt;&gt; (without the prefix). <p>In the end, <code>initialize</code> splits &lt;&gt; with the default configuration (denoted as <code>*</code>) assigned to all subsystems afterwards. <p>NOTE: <code>initialize</code> accepts <code>*</code> (star) for the default configuration or any combination of lower- and upper-case letters for Spark subsystem names.</p> <p>NOTE: <code>initialize</code> is used exclusively when <code>MetricsSystem</code> is created.</p> <p>=== [[setDefaultProperties]] <code>setDefaultProperties</code> Internal Method</p>"},{"location":"metrics/MetricsConfig/#source-scala_1","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#setdefaultpropertiesprop-properties-unit","title":"setDefaultProperties(prop: Properties): Unit","text":"<p><code>setDefaultProperties</code> sets the &lt;&gt; (in the input <code>prop</code>). <p>NOTE: <code>setDefaultProperties</code> is used exclusively when <code>MetricsConfig</code> &lt;&gt;. <p>=== [[loadPropertiesFromFile]] Loading Custom Metrics Configuration File or metrics.properties -- <code>loadPropertiesFromFile</code> Method</p>"},{"location":"metrics/MetricsConfig/#source-scala_2","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#loadpropertiesfromfilepath-optionstring-unit","title":"loadPropertiesFromFile(path: Option[String]): Unit","text":"<p><code>loadPropertiesFromFile</code> tries to open the input <code>path</code> file (if defined) or the default metrics configuration file metrics.properties (on CLASSPATH).</p> <p>If either file is available, <code>loadPropertiesFromFile</code> loads the properties (to &lt;&gt; registry). <p>In case of exceptions, you should see the following ERROR message in the logs followed by the exception.</p> <pre><code>ERROR Error loading configuration file [file]\n</code></pre> <p>NOTE: <code>loadPropertiesFromFile</code> is used exclusively when <code>MetricsConfig</code> &lt;&gt;. <p>=== [[subProperties]] Grouping Properties Per Subsystem -- <code>subProperties</code> Method</p>"},{"location":"metrics/MetricsConfig/#source-scala_3","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#subpropertiesprop-properties-regex-regex-mutablehashmapstring-properties","title":"subProperties(prop: Properties, regex: Regex): mutable.HashMap[String, Properties]","text":"<p><code>subProperties</code> takes <code>prop</code> properties and destructures keys given <code>regex</code>. <code>subProperties</code> takes the matching prefix (of a key per <code>regex</code>) and uses it as a new key with the value(s) being the matching suffix(es).</p>"},{"location":"metrics/MetricsConfig/#source-scala_4","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#driverhelloworld-driver-helloworld","title":"driver.hello.world =&gt; (driver, (hello.world))","text":"<p>NOTE: <code>subProperties</code> is used when <code>MetricsConfig</code> &lt;&gt; (to apply the default metrics configuration) and when <code>MetricsSystem</code> registers metrics sources and sinks. <p>=== [[getInstance]] <code>getInstance</code> Method</p>"},{"location":"metrics/MetricsConfig/#source-scala_5","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#getinstanceinst-string-properties","title":"getInstance(inst: String): Properties","text":"<p><code>getInstance</code>...FIXME</p> <p>NOTE: <code>getInstance</code> is used when...FIXME</p>"},{"location":"metrics/MetricsServlet/","title":"MetricsServlet JSON Metrics Sink","text":"<p><code>MetricsServlet</code> is a metrics sink that gives metrics snapshots in JSON format.</p> <p><code>MetricsServlet</code> is a \"special\" sink as it is only available to the metrics instances with a web UI:</p> <ul> <li>Driver of a Spark application</li> <li>Spark Standalone's <code>Master</code> and <code>Worker</code></li> </ul> <p>You can access the metrics from <code>MetricsServlet</code> at /metrics/json URI by default. The entire URL depends on a metrics instance, e.g. http://localhost:4040/metrics/json/ for a running Spark application.</p> <pre><code>$ http http://localhost:4040/metrics/json/\nHTTP/1.1 200 OK\nCache-Control: no-cache, no-store, must-revalidate\nContent-Length: 5005\nContent-Type: text/json;charset=utf-8\nDate: Mon, 11 Jun 2018 06:29:03 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nX-Content-Type-Options: nosniff\nX-Frame-Options: SAMEORIGIN\nX-XSS-Protection: 1; mode=block\n\n{\n    \"counters\": {\n        \"local-1528698499919.driver.HiveExternalCatalog.fileCacheHits\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.filesDiscovered\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.hiveClientCalls\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.parallelListingJobCount\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.partitionsFetched\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.LiveListenerBus.numEventsPosted\": {\n            \"count\": 7\n        },\n        \"local-1528698499919.driver.LiveListenerBus.queue.appStatus.numDroppedEvents\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.LiveListenerBus.queue.executorManagement.numDroppedEvents\": {\n            \"count\": 0\n        }\n    },\n    ...\n</code></pre> <p><code>MetricsServlet</code> is &lt;&gt; exclusively when <code>MetricsSystem</code> is started (and requested to register metrics sinks). <p><code>MetricsServlet</code> can be configured using configuration properties with sink.servlet prefix (in spark-metrics-MetricsConfig.md[metrics configuration]). That is not required since <code>MetricsConfig</code> spark-metrics-MetricsConfig.md#setDefaultProperties[makes sure] that <code>MetricsServlet</code> is always configured.</p> <p><code>MetricsServlet</code> uses https://fasterxml.github.io/jackson-databind/[jackson-databind], the general data-binding package for Jackson (as &lt;&gt;) with Dropwizard Metrics library (i.e. registering a Coda Hale <code>MetricsModule</code>). <p>[[properties]] .MetricsServlet's Configuration Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Name | Default | Description</p> <p>| <code>path</code> | <code>/metrics/json/</code> | [[path]] Path URI prefix to bind to</p> <p>| <code>sample</code> | <code>false</code> | [[sample]] Whether to show entire set of samples for histograms |===</p> <p>[[internal-registries]] .MetricsServlet's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| <code>mapper</code> | [[mapper]] Jaxson's https://fasterxml.github.io/jackson-databind/javadoc/2.6/com/fasterxml/jackson/databind/ObjectMapper.html[com.fasterxml.jackson.databind.ObjectMapper] that \"provides functionality for reading and writing JSON, either to and from basic POJOs (Plain Old Java Objects), or to and from a general-purpose JSON Tree Model (JsonNode), as well as related functionality for performing conversions.\"</p> <p>When created, <code>mapper</code> is requested to register a Coda Hale com.codahale.metrics.json.MetricsModule.</p> <p>Used exclusively when <code>MetricsServlet</code> is requested to &lt;&gt;. <p>| <code>servletPath</code> | [[servletPath]] Value of &lt;&gt; configuration property <p>| <code>servletShowSample</code> | [[servletShowSample]] Flag to control whether to show samples (<code>true</code>) or not (<code>false</code>).</p> <p><code>servletShowSample</code> is the value of &lt;&gt; configuration property (if defined) or <code>false</code>. <p>Used when &lt;&gt; is requested to register a Coda Hale com.codahale.metrics.json.MetricsModule. |==="},{"location":"metrics/MetricsServlet/#creating-instance","title":"Creating Instance","text":"<p><code>MetricsServlet</code> takes the following when created:</p> <ul> <li>[[property]] Configuration Properties (as Java <code>Properties</code>)</li> <li>[[registry]] <code>MetricRegistry</code> (Dropwizard Metrics</li> <li>[[securityMgr]] <code>SecurityManager</code></li> </ul> <p><code>MetricsServlet</code> initializes the &lt;&gt;. <p>=== [[getMetricsSnapshot]] Requesting Metrics Snapshot -- <code>getMetricsSnapshot</code> Method</p>"},{"location":"metrics/MetricsServlet/#source-scala","title":"[source, scala]","text":""},{"location":"metrics/MetricsServlet/#getmetricssnapshotrequest-httpservletrequest-string","title":"getMetricsSnapshot(request: HttpServletRequest): String","text":"<p><code>getMetricsSnapshot</code> simply requests the &lt;&gt; to serialize the &lt;&gt; to a JSON string (using ++https://fasterxml.github.io/jackson-databind/javadoc/2.6/com/fasterxml/jackson/databind/ObjectMapper.html#writeValueAsString-java.lang.Object-++[ObjectMapper.writeValueAsString]). <p>NOTE: <code>getMetricsSnapshot</code> is used exclusively when <code>MetricsServlet</code> is requested to &lt;&gt;. <p>=== [[getHandlers]] Requesting JSON Servlet Handler -- <code>getHandlers</code> Method</p>"},{"location":"metrics/MetricsServlet/#source-scala_1","title":"[source, scala]","text":""},{"location":"metrics/MetricsServlet/#gethandlersconf-sparkconf-arrayservletcontexthandler","title":"getHandlers(conf: SparkConf): Array[ServletContextHandler]","text":"<p><code>getHandlers</code> returns just a single <code>ServletContextHandler</code> (in a collection) that gives &lt;&gt; in JSON format at every request at &lt;&gt; URI path. <p>NOTE: <code>getHandlers</code> is used exclusively when <code>MetricsSystem</code> is requested for MetricsSystem.md#getServletHandlers[metrics ServletContextHandlers].</p>"},{"location":"metrics/MetricsSystem/","title":"MetricsSystem","text":"<p><code>MetricsSystem</code> is a registry of metrics sources and sinks of a Spark subsystem.</p>"},{"location":"metrics/MetricsSystem/#creating-instance","title":"Creating Instance","text":"<p><code>MetricsSystem</code> takes the following to be created:</p> <ul> <li> Instance Name <li> SparkConf <li> <code>SecurityManager</code> <p>While being created, <code>MetricsSystem</code> requests the MetricsConfig to initialize.</p> <p></p> <p><code>MetricsSystem</code> is created (using createMetricsSystem utility) for the Metrics Systems.</p>"},{"location":"metrics/MetricsSystem/#prometheusservlet","title":"PrometheusServlet <p><code>MetricsSystem</code> creates a PrometheusServlet when requested to registerSinks for an instance with <code>sink.prometheusServlet</code> configuration.</p> <p><code>MetricsSystem</code> requests the <code>PrometheusServlet</code> for URL handlers when requested for servlet handlers (so it can be attached to a web UI and serve HTTP requests).</p>","text":""},{"location":"metrics/MetricsSystem/#metricsservlet","title":"MetricsServlet  <p>Note</p> <p>review me</p>  <p>MetricsServlet JSON metrics sink that is only available for the &lt;&gt; with a web UI (i.e. the driver of a Spark application and Spark Standalone's <code>Master</code>). <p><code>MetricsSystem</code> may have at most one <code>MetricsServlet</code> JSON metrics sink (which is registered by default).</p> <p>Initialized when MetricsSystem registers &lt;&gt; (and finds a configuration entry with <code>servlet</code> sink name). <p>Used when MetricsSystem is requested for a &lt;&gt;.","text":""},{"location":"metrics/MetricsSystem/#creating-metricssystem","title":"Creating MetricsSystem <pre><code>createMetricsSystem(\n  instance: String\n  conf: SparkConf\n  securityMgr: SecurityManager): MetricsSystem\n</code></pre> <p><code>createMetricsSystem</code> creates a new <code>MetricsSystem</code> (for the given parameters).</p> <p><code>createMetricsSystem</code> is used to create metrics systems.</p>","text":""},{"location":"metrics/MetricsSystem/#metrics-sources-for-spark-sql","title":"Metrics Sources for Spark SQL <ul> <li><code>CodegenMetrics</code></li> <li><code>HiveCatalogMetrics</code></li> </ul>","text":""},{"location":"metrics/MetricsSystem/#registering-metrics-source","title":"Registering Metrics Source <pre><code>registerSource(\n  source: Source): Unit\n</code></pre> <p><code>registerSource</code> adds <code>source</code> to the sources internal registry.</p> <p><code>registerSource</code> creates an identifier for the metrics source and registers it with the MetricRegistry.</p> <p><code>registerSource</code> registers the metrics source under a given name.</p> <p><code>registerSource</code> prints out the following INFO message to the logs when registering a name more than once:</p> <pre><code>Metrics already registered\n</code></pre>","text":""},{"location":"metrics/MetricsSystem/#building-metrics-source-identifier","title":"Building Metrics Source Identifier <pre><code>buildRegistryName(\n  source: Source): String\n</code></pre> <p><code>buildRegistryName</code> uses spark-metrics-properties.md#spark.metrics.namespace[spark.metrics.namespace] and executor:Executor.md#spark.executor.id[spark.executor.id] Spark properties to differentiate between a Spark application's driver and executors, and the other Spark framework's components.</p> <p>(only when &lt;&gt; is <code>driver</code> or <code>executor</code>) <code>buildRegistryName</code> builds metrics source name that is made up of spark-metrics-properties.md#spark.metrics.namespace[spark.metrics.namespace], executor:Executor.md#spark.executor.id[spark.executor.id] and the name of the <code>source</code>. <p>FIXME Finish for the other components.</p> <p><code>buildRegistryName</code> is used when <code>MetricsSystem</code> is requested to register or remove a metrics source.</p>","text":""},{"location":"metrics/MetricsSystem/#registering-metrics-sources-for-spark-instance","title":"Registering Metrics Sources for Spark Instance <pre><code>registerSources(): Unit\n</code></pre> <p><code>registerSources</code> finds &lt;&gt; configuration for the &lt;&gt;. <p>NOTE: <code>instance</code> is defined when MetricsSystem &lt;&gt;. <p><code>registerSources</code> finds the configuration of all the spark-metrics-Source.md[metrics sources] for the subsystem (as described with <code>source.</code> prefix).</p> <p>For every metrics source, <code>registerSources</code> finds <code>class</code> property, creates an instance, and in the end &lt;&gt;. <p>When <code>registerSources</code> fails, you should see the following ERROR message in the logs followed by the exception.</p> <pre><code>Source class [classPath] cannot be instantiated\n</code></pre> <p><code>registerSources</code> is used when <code>MetricsSystem</code> is requested to start.</p>","text":""},{"location":"metrics/MetricsSystem/#servlet-handlers","title":"Servlet Handlers <pre><code>getServletHandlers: Array[ServletContextHandler]\n</code></pre> <p><code>getServletHandlers</code> requests the metricsServlet (if defined) and the prometheusServlet (if defined) for URL handlers.</p> <p><code>getServletHandlers</code> requires that the <code>MetricsSystem</code> is running or throws an <code>IllegalArgumentException</code>:</p> <pre><code>Can only call getServletHandlers on a running MetricsSystem\n</code></pre> <p><code>getServletHandlers</code> is used when:</p> <ul> <li><code>SparkContext</code> is created (and attaches the URL handlers to the web UI)</li> <li><code>Master</code> (Spark Standalone) is requested to <code>onStart</code></li> <li><code>Worker</code> (Spark Standalone) is requested to <code>onStart</code></li> </ul>","text":""},{"location":"metrics/MetricsSystem/#registering-metrics-sinks","title":"Registering Metrics Sinks <pre><code>registerSinks(): Unit\n</code></pre> <p><code>registerSinks</code> requests the &lt;&gt; for the spark-metrics-MetricsConfig.md#getInstance[configuration] of the &lt;&gt;. <p><code>registerSinks</code> requests the &lt;&gt; for the spark-metrics-MetricsConfig.md#subProperties[configuration] of all metrics sinks (i.e. configuration entries that match <code>^sink\\\\.(.+)\\\\.(.+)</code> regular expression). <p>For every metrics sink configuration, <code>registerSinks</code> takes <code>class</code> property and (if defined) creates an instance of the metric sink using an constructor that takes the configuration, &lt;&gt; and &lt;&gt;. <p>For a single servlet metrics sink, <code>registerSinks</code> converts the sink to a spark-metrics-MetricsServlet.md[MetricsServlet] and sets the &lt;&gt; internal registry. <p>For all other metrics sinks, <code>registerSinks</code> adds the sink to the &lt;&gt; internal registry. <p>In case of an <code>Exception</code>, <code>registerSinks</code> prints out the following ERROR message to the logs:</p> <pre><code>Sink class [classPath] cannot be instantiated\n</code></pre> <p><code>registerSinks</code> is used when <code>MetricsSystem</code> is requested to start.</p>","text":""},{"location":"metrics/MetricsSystem/#stopping","title":"Stopping <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>...FIXME</p>","text":""},{"location":"metrics/MetricsSystem/#reporting-metrics","title":"Reporting Metrics <pre><code>report(): Unit\n</code></pre> <p><code>report</code> simply requests the registered metrics sinks to report metrics.</p>","text":""},{"location":"metrics/MetricsSystem/#starting","title":"Starting <pre><code>start(): Unit\n</code></pre> <p><code>start</code> turns &lt;&gt; flag on. <p>NOTE: <code>start</code> can only be called once and &lt;&gt; an <code>IllegalArgumentException</code> when called multiple times. <p><code>start</code> &lt;&gt; the &lt;&gt; for Spark SQL, i.e. <code>CodegenMetrics</code> and <code>HiveCatalogMetrics</code>. <p><code>start</code> then registers the configured metrics &lt;&gt; and &lt;&gt; for the &lt;&gt;. <p>In the end, <code>start</code> requests the registered &lt;&gt; to spark-metrics-Sink.md#start[start]. <p>[[start-IllegalArgumentException]] <code>start</code> throws an <code>IllegalArgumentException</code> when &lt;&gt; flag is on. <pre><code>requirement failed: Attempting to start a MetricsSystem that is already running\n</code></pre>","text":""},{"location":"metrics/MetricsSystem/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.metrics.MetricsSystem</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.metrics.MetricsSystem=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"metrics/MetricsSystem/#internal-registries","title":"Internal Registries","text":""},{"location":"metrics/MetricsSystem/#metricregistry","title":"MetricRegistry <p>Integration point to Dropwizard Metrics' MetricRegistry</p> <p>Used when MetricsSystem is requested to:</p> <ul> <li>Register or remove a metrics source</li> <li>Start (that in turn registers metrics sinks)</li> </ul>","text":""},{"location":"metrics/MetricsSystem/#metricsconfig","title":"MetricsConfig <p>MetricsConfig</p> <p>Initialized when MetricsSystem is &lt;&gt;. <p>Used when MetricsSystem registers &lt;&gt; and &lt;&gt;.","text":""},{"location":"metrics/MetricsSystem/#running-flag","title":"running Flag <p>Indicates whether <code>MetricsSystem</code> has been started (<code>true</code>) or not (<code>false</code>)</p> <p>Default: <code>false</code></p>","text":""},{"location":"metrics/MetricsSystem/#sinks","title":"sinks <p>Metrics sinks</p> <p>Used when MetricsSystem &lt;&gt; and &lt;&gt;.","text":""},{"location":"metrics/MetricsSystem/#sources","title":"sources <p>Metrics sources</p> <p>Used when MetricsSystem &lt;&gt;.","text":""},{"location":"metrics/PrometheusServlet/","title":"PrometheusServlet","text":"<p><code>PrometheusServlet</code> is a metrics sink that comes with a ServletContextHandler to serve metrics snapshots in Prometheus format.</p>"},{"location":"metrics/PrometheusServlet/#creating-instance","title":"Creating Instance","text":"<p><code>PrometheusServlet</code> takes the following to be created:</p> <ul> <li> <code>Properties</code> <li> <code>MetricRegistry</code> (Dropwizard Metrics) <p><code>PrometheusServlet</code> is created when:</p> <ul> <li><code>MetricsSystem</code> is requested to register metric sinks (with <code>sink.prometheusServlet</code> configuration)</li> </ul>"},{"location":"metrics/PrometheusServlet/#servletcontexthandler","title":"ServletContextHandler <p><code>PrometheusServlet</code> creates a <code>ServletContextHandler</code> to be registered at the path configured by <code>path</code> property.</p> <p>The <code>ServletContextHandler</code> handles <code>text/plain</code> content type.</p> <p>When executed, the <code>ServletContextHandler</code> gives a metrics snapshot.</p>","text":""},{"location":"metrics/PrometheusServlet/#metrics-snapshot","title":"Metrics Snapshot <pre><code>getMetricsSnapshot(\n  request: HttpServletRequest): String\n</code></pre> <p><code>getMetricsSnapshot</code>...FIXME</p>","text":""},{"location":"metrics/PrometheusServlet/#gethandlers","title":"getHandlers <pre><code>getHandlers(\n  conf: SparkConf): Array[ServletContextHandler]\n</code></pre> <p><code>getHandlers</code> is the ServletContextHandler.</p> <p><code>getHandlers</code> is used when:</p> <ul> <li><code>MetricsSystem</code> is requested for servlet handlers</li> </ul>","text":""},{"location":"metrics/Sink/","title":"Sink","text":"<p><code>Sink</code> is a &lt;&gt; of metrics sinks. <p>[[contract]] [source, scala]</p> <p>package org.apache.spark.metrics.sink</p> <p>trait Sink {   def start(): Unit   def stop(): Unit   def report(): Unit }</p> <p>NOTE: <code>Sink</code> is a <code>private[spark]</code> contract.</p> <p>.Sink Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| <code>start</code> | [[start]] Used when...FIXME</p> <p>| <code>stop</code> | [[stop]] Used when...FIXME</p> <p>| <code>report</code> | [[report]] Used when...FIXME |===</p> <p>[[implementations]] .Sinks [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Sink | Description</p> <p>| <code>ConsoleSink</code> | [[ConsoleSink]]</p> <p>| <code>CsvSink</code> | [[CsvSink]]</p> <p>| <code>GraphiteSink</code> | [[GraphiteSink]]</p> <p>| <code>JmxSink</code> | [[JmxSink]]</p> <p>| spark-metrics-MetricsServlet.md[MetricsServlet] | [[MetricsServlet]]</p> <p>| <code>Slf4jSink</code> | [[Slf4jSink]]</p> <p>| <code>StatsdSink</code> | [[StatsdSink]] |===</p> <p>NOTE: All known &lt;&gt; in Spark 2.3 are in <code>org.apache.spark.metrics.sink</code> Scala package."},{"location":"metrics/Source/","title":"Source","text":"<p><code>Source</code> is an abstraction of metrics sources.</p>"},{"location":"metrics/Source/#contract","title":"Contract","text":""},{"location":"metrics/Source/#metricregistry","title":"MetricRegistry <pre><code>metricRegistry: MetricRegistry\n</code></pre> <p><code>MetricRegistry</code> (Codahale Metrics)</p> <p>Used when:</p> <ul> <li><code>MetricsSystem</code> is requested to register a metrics source</li> </ul>","text":""},{"location":"metrics/Source/#source-name","title":"Source Name <pre><code>sourceName: String\n</code></pre> <p>Used when:</p> <ul> <li><code>MetricsSystem</code> is requested to build a metrics source identifier and getSourcesByName</li> </ul>","text":""},{"location":"metrics/Source/#implementations","title":"Implementations","text":"<ul> <li>AccumulatorSource</li> <li>AppStatusSource</li> <li>BlockManagerSource</li> <li>DAGSchedulerSource</li> <li>ExecutorAllocationManagerSource</li> <li>ExecutorMetricsSource</li> <li>ExecutorSource</li> <li>JvmSource</li> <li>ShuffleMetricsSource</li> <li>others</li> </ul>"},{"location":"metrics/configuration-properties/","title":"Configuration Properties","text":""},{"location":"metrics/configuration-properties/#sparkmetricsappstatussourceenabled","title":"spark.metrics.appStatusSource.enabled <p>Enables Dropwizard/Codahale metrics with the status of a live Spark application</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>AppStatusSource</code> utility is used to create an AppStatusSource</li> </ul>","text":""},{"location":"metrics/configuration-properties/#sparkmetricsconf","title":"spark.metrics.conf <p>The metrics configuration file</p> <p>Default: <code>metrics.properties</code></p>","text":""},{"location":"metrics/configuration-properties/#sparkmetricsexecutormetricssourceenabled","title":"spark.metrics.executorMetricsSource.enabled <p>Enables registering ExecutorMetricsSource with the metrics system</p> <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is created</li> <li><code>Executor</code> is created</li> </ul>","text":""},{"location":"metrics/configuration-properties/#sparkmetricsnamespace","title":"spark.metrics.namespace <p>Root namespace for metrics reporting</p> <p>Default: Spark Application ID (i.e. <code>spark.app.id</code> configuration property)</p> <p>Since a Spark application's ID changes with every execution of a Spark application, a custom namespace can be specified for an easier metrics reporting.</p> <p>Used when <code>MetricsSystem</code> is requested for a metrics source identifier (metrics namespace)</p>","text":""},{"location":"metrics/configuration-properties/#sparkmetricsstaticsourcesenabled","title":"spark.metrics.staticSources.enabled <p>Enables static metric sources</p> <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is created</li> <li><code>SparkEnv</code> utility is used to create SparkEnv for executors</li> </ul>","text":""},{"location":"network/","title":"Network","text":""},{"location":"network/SparkTransportConf/","title":"SparkTransportConf Utility","text":""},{"location":"network/SparkTransportConf/#fromsparkconf","title":"fromSparkConf <pre><code>fromSparkConf(\n  _conf: SparkConf,\n  module: String, // (1)\n  numUsableCores: Int = 0,\n  role: Option[String] = None): TransportConf // (2)\n</code></pre> <ol> <li>The given <code>module</code> is <code>shuffle</code> most of the time except:<ul> <li><code>rpc</code> for NettyRpcEnv</li> <li><code>files</code> for NettyRpcEnv</li> </ul> </li> <li>Only defined in NettyRpcEnv to be either <code>driver</code> or <code>executor</code></li> </ol> <p><code>fromSparkConf</code> makes a copy (clones) the given SparkConf.</p> <p><code>fromSparkConf</code> sets the following configuration properties (for the given <code>module</code>):</p> <ul> <li><code>spark.[module].io.serverThreads</code></li> <li><code>spark.[module].io.clientThreads</code></li> </ul> <p>The values are taken using the following properties in the order and until one is found (with <code>suffix</code> being <code>serverThreads</code> or <code>clientThreads</code>, respectively):</p> <ol> <li><code>spark.[role].[module].io.[suffix]</code></li> <li><code>spark.[module].io.[suffix]</code></li> </ol> <p>Unless found, <code>fromSparkConf</code> defaults to the default number of threads (based on the given <code>numUsableCores</code> and not more than <code>8</code>).</p> <p>In the end, <code>fromSparkConf</code> creates a TransportConf (for the given <code>module</code> and the updated <code>SparkConf</code>).</p> <p><code>fromSparkConf</code>\u00a0is used when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv (with the spark.shuffle.service.enabled configuration property enabled)</li> <li><code>ExternalShuffleService</code> is created</li> <li><code>NettyBlockTransferService</code> is requested to init</li> <li><code>NettyRpcEnv</code> is created and requested for a downloadClient</li> <li><code>IndexShuffleBlockResolver</code> is created</li> <li><code>ShuffleBlockPusher</code> is requested to initiateBlockPush</li> <li><code>BlockManager</code> is requested to readDiskBlockFromSameHostExecutor</li> </ul>","text":""},{"location":"network/TransportClientFactory/","title":"TransportClientFactory","text":""},{"location":"network/TransportClientFactory/#creating-instance","title":"Creating Instance","text":"<p><code>TransportClientFactory</code> takes the following to be created:</p> <ul> <li> TransportContext <li> <code>TransportClientBootstrap</code>s <p><code>TransportClientFactory</code> is created\u00a0when:</p> <ul> <li><code>TransportContext</code> is requested for a TransportClientFactory</li> </ul>"},{"location":"network/TransportClientFactory/#configuration-properties","title":"Configuration Properties","text":"<p>While being created, <code>TransportClientFactory</code> requests the given TransportContext for the TransportConf that is used to access the values of the following (configuration) properties:</p> <ul> <li>io.numConnectionsPerPeer</li> <li>io.mode</li> <li>io.mode</li> <li>io.preferDirectBufs</li> <li>io.retryWait</li> <li>spark.network.sharedByteBufAllocators.enabled</li> <li>spark.network.io.preferDirectBufs</li> <li>Module Name</li> </ul>"},{"location":"network/TransportClientFactory/#creating-transportclient","title":"Creating TransportClient <pre><code>TransportClient createClient(\n  String remoteHost,\n  int remotePort) // (1)\nTransportClient createClient(\n  String remoteHost,\n  int remotePort,\n  boolean fastFail)\nTransportClient createClient(\n  InetSocketAddress address)\n</code></pre> <ol> <li>Turns <code>fastFail</code> off</li> </ol> <p><code>createClient</code> prints out the following DEBUG message to the logs:</p> <pre><code>Creating new connection to [address]\n</code></pre> <p><code>createClient</code> creates a Netty <code>Bootstrap</code> and initializes it.</p> <p><code>createClient</code> requests the Netty <code>Bootstrap</code> to connect.</p> <p>If successful, <code>createClient</code> prints out the following DEBUG message and requests the TransportClientBootstraps to <code>doBootstrap</code>.</p> <pre><code>Connection to [address] successful, running bootstraps...\n</code></pre> <p>In the end, <code>createClient</code> prints out the following INFO message:</p> <pre><code>Successfully created connection to [address] after [t] ms ([t] ms spent in bootstraps)\n</code></pre>","text":""},{"location":"network/TransportConf/","title":"TransportConf","text":""},{"location":"network/TransportConf/#creating-instance","title":"Creating Instance","text":"<p><code>TransportConf</code> takes the following to be created:</p> <ul> <li>Module Name</li> <li> <code>ConfigProvider</code> <p><code>TransportConf</code> is created\u00a0when:</p> <ul> <li><code>SparkTransportConf</code> utility is used to fromSparkConf</li> <li><code>YarnShuffleService</code> (Spark on YARN) is requested to <code>serviceInit</code></li> </ul>"},{"location":"network/TransportConf/#module-name","title":"Module Name <p><code>TransportConf</code> is given the name of a module the transport-related configuration properties are for and is as follows (per SparkTransportConf):</p> <ul> <li><code>shuffle</code></li> <li><code>rpc</code> for NettyRpcEnv</li> <li><code>files</code> for NettyRpcEnv</li> </ul>","text":""},{"location":"network/TransportConf/#getmodulename","title":"getModuleName <pre><code>String getModuleName()\n</code></pre> <p><code>getModuleName</code> returns the module name.</p>","text":""},{"location":"network/TransportConf/#getconfkey","title":"getConfKey <pre><code>String getConfKey(\n  String suffix)\n</code></pre> <p><code>getConfKey</code> creates the key of a configuration property (with the module and the given suffix):</p> <pre><code>spark.[module].[suffix]\n</code></pre>","text":""},{"location":"network/TransportConf/#suffixes","title":"Suffixes","text":""},{"location":"network/TransportConf/#iomode","title":"io.mode <ul> <li><code>nio</code> (default)</li> <li><code>epoll</code></li> </ul>","text":""},{"location":"network/TransportConf/#iopreferdirectbufs","title":"io.preferDirectBufs <p>Controls whether Spark prefers allocating off-heap byte buffers within Netty (<code>true</code>) or not (<code>false</code>).</p> <p>Default: <code>true</code></p>","text":""},{"location":"network/TransportConf/#ioconnectiontimeout","title":"io.connectionTimeout","text":""},{"location":"network/TransportConf/#ioconnectioncreationtimeout","title":"io.connectionCreationTimeout","text":""},{"location":"network/TransportConf/#iobacklog","title":"io.backLog <p>The requested maximum length of the queue of incoming connections</p> <p>Default: <code>-1</code> (no backlog)</p>","text":""},{"location":"network/TransportConf/#ionumconnectionsperpeer","title":"io.numConnectionsPerPeer <p>Default: <code>1</code></p>","text":""},{"location":"network/TransportConf/#ioserverthreads","title":"io.serverThreads","text":""},{"location":"network/TransportConf/#ioclientthreads","title":"io.clientThreads <p>Default: <code>0</code></p>","text":""},{"location":"network/TransportConf/#ioreceivebuffer","title":"io.receiveBuffer","text":""},{"location":"network/TransportConf/#iosendbuffer","title":"io.sendBuffer","text":""},{"location":"network/TransportConf/#sasltimeout","title":"sasl.timeout","text":""},{"location":"network/TransportConf/#iomaxretries","title":"io.maxRetries","text":""},{"location":"network/TransportConf/#ioretrywait","title":"io.retryWait <p>Time that we will wait in order to perform a retry after an <code>IOException</code>. Only relevant if maxIORetries is greater than 0.</p> <p>Default: <code>5s</code></p>","text":""},{"location":"network/TransportConf/#iolazyfd","title":"io.lazyFD","text":""},{"location":"network/TransportConf/#ioenableverbosemetrics","title":"io.enableVerboseMetrics <p>Enables Netty's memory detailed metrics</p> <p>Default: <code>false</code></p>","text":""},{"location":"network/TransportConf/#ioenabletcpkeepalive","title":"io.enableTcpKeepAlive","text":""},{"location":"network/TransportConf/#preferdirectbufsforsharedbytebufallocators","title":"preferDirectBufsForSharedByteBufAllocators <p>The value of spark.network.io.preferDirectBufs.</p>","text":""},{"location":"network/TransportConf/#sharedbytebufallocators","title":"sharedByteBufAllocators <p>The value of spark.network.sharedByteBufAllocators.enabled.</p>","text":""},{"location":"network/TransportContext/","title":"TransportContext","text":""},{"location":"network/TransportContext/#creating-instance","title":"Creating Instance","text":"<p><code>TransportContext</code> takes the following to be created:</p> <ul> <li> TransportConf <li> <code>RpcHandler</code> <li> <code>closeIdleConnections</code> flag <li> <code>isClientOnly</code> flag <p><code>TransportContext</code> is created\u00a0when:</p> <ul> <li><code>ExternalBlockStoreClient</code> is requested to init</li> <li><code>ExternalShuffleService</code> is requested to start</li> <li><code>NettyBlockTransferService</code> is requested to init</li> <li><code>NettyRpcEnv</code> is created and requested to downloadClient</li> <li><code>YarnShuffleService</code> (Spark on YARN) is requested to <code>serviceInit</code></li> </ul>"},{"location":"network/TransportContext/#creating-server","title":"Creating Server <pre><code>TransportServer createServer(\n  int port,\n  List&lt;TransportServerBootstrap&gt; bootstraps)\nTransportServer createServer(\n  String host,\n  int port,\n  List&lt;TransportServerBootstrap&gt; bootstraps)\n</code></pre> <p><code>createServer</code> creates a <code>TransportServer</code> (with the RpcHandler and the input arguments).</p> <p><code>createServer</code>\u00a0is used when:</p> <ul> <li><code>YarnShuffleService</code> (Spark on YARN) is requested to <code>serviceInit</code></li> <li><code>ExternalShuffleService</code> is requested to start</li> <li><code>NettyBlockTransferService</code> is requested to createServer</li> <li><code>NettyRpcEnv</code> is requested to startServer</li> </ul>","text":""},{"location":"network/TransportContext/#creating-transportclientfactory","title":"Creating TransportClientFactory <pre><code>TransportClientFactory createClientFactory() // (1)\nTransportClientFactory createClientFactory(\n  List&lt;TransportClientBootstrap&gt; bootstraps)\n</code></pre> <ol> <li>Uses empty <code>bootstraps</code></li> </ol> <p><code>createClientFactory</code> creates a TransportClientFactory (with itself and the given <code>TransportClientBootstrap</code>s).</p> <p><code>createClientFactory</code>\u00a0is used when:</p> <ul> <li><code>ExternalBlockStoreClient</code> is requested to init</li> <li><code>NettyBlockTransferService</code> is requested to init</li> <li><code>NettyRpcEnv</code> is created and requested to downloadClient</li> </ul>","text":""},{"location":"plugins/","title":"Plugin Framework","text":"<p>Plugin Framework is an API for registering custom extensions (plugins) to be executed on the driver and executors.</p> <p>Plugin Framework uses separate PluginContainers for the driver and executors, and spark.plugins configuration property for SparkPlugins to be registered.</p> <p>Plugin Framework was introduced in Spark 2.4.4 (with an API for executors) with further changes in Spark 3.0.0 (to cover the driver).</p>"},{"location":"plugins/#resources","title":"Resources","text":"<ul> <li>Advanced Instrumentation in the official documentation of Apache Spark</li> <li>Commit for SPARK-29397</li> <li>Spark Plugin Framework in 3.0 - Part 1: Introduction by Madhukara Phatak</li> <li>Spark Memory Monitor by squito</li> <li>SparkPlugins by Luca Canali (CERN)</li> </ul>"},{"location":"plugins/DriverPlugin/","title":"DriverPlugin","text":"<p><code>DriverPlugin</code> is...FIXME</p>"},{"location":"plugins/DriverPluginContainer/","title":"DriverPluginContainer","text":"<p><code>DriverPluginContainer</code> is a PluginContainer.</p>"},{"location":"plugins/DriverPluginContainer/#creating-instance","title":"Creating Instance","text":"<p><code>DriverPluginContainer</code> takes the following to be created:</p> <ul> <li> SparkContext <li> Resources (<code>Map[String, ResourceInformation]</code>) <li> SparkPlugins <p><code>DriverPluginContainer</code> is created\u00a0when:</p> <ul> <li><code>PluginContainer</code> utility is used for a PluginContainer (at SparkContext startup)</li> </ul>"},{"location":"plugins/DriverPluginContainer/#registering-metrics","title":"Registering Metrics <pre><code>registerMetrics(\n  appId: String): Unit\n</code></pre> <p><code>registerMetrics</code>\u00a0is part of the PluginContainer abstraction.</p> <p>For every driver plugin, <code>registerMetrics</code> requests it to register metrics and the associated PluginContextImpl for the same.</p>","text":""},{"location":"plugins/DriverPluginContainer/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.internal.plugin.DriverPluginContainer</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.internal.plugin.DriverPluginContainer=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"plugins/ExecutorPlugin/","title":"ExecutorPlugin","text":"<p><code>ExecutorPlugin</code> is...FIXME</p>"},{"location":"plugins/ExecutorPluginContainer/","title":"ExecutorPluginContainer","text":"<p><code>ExecutorPluginContainer</code> is a PluginContainer for Executors.</p>"},{"location":"plugins/ExecutorPluginContainer/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorPluginContainer</code> takes the following to be created:</p> <ul> <li> SparkEnv <li> Resources (<code>Map[String, ResourceInformation]</code>) <li> SparkPlugins <p><code>ExecutorPluginContainer</code> is created when:</p> <ul> <li><code>PluginContainer</code> utility is used to create a PluginContainer (for Executors)</li> </ul>"},{"location":"plugins/ExecutorPluginContainer/#executorplugins","title":"ExecutorPlugins <p><code>ExecutorPluginContainer</code> initializes <code>executorPlugins</code> internal registry of ExecutorPlugins when created.</p>","text":""},{"location":"plugins/ExecutorPluginContainer/#initialization","title":"Initialization","text":"<p><code>executorPlugins</code> finds all the configuration properties with <code>spark.plugins.internal.conf.</code> prefix (in the SparkConf) for extra configuration of every ExecutorPlugin of the given SparkPlugins.</p> <p>For every <code>SparkPlugin</code> (in the given SparkPlugins) that defines an ExecutorPlugin, <code>executorPlugins</code> creates a PluginContextImpl, requests the <code>ExecutorPlugin</code> to init (with the <code>PluginContextImpl</code> and the extra configuration) and the <code>PluginContextImpl</code> to registerMetrics.</p> <p>In the end, <code>executorPlugins</code> prints out the following INFO message to the logs (for every <code>ExecutorPlugin</code>):</p> <pre><code>Initialized executor component for plugin [name].\n</code></pre>"},{"location":"plugins/ExecutorPluginContainer/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.internal.plugin.ExecutorPluginContainer</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.internal.plugin.ExecutorPluginContainer=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"plugins/PluginContainer/","title":"PluginContainer","text":"<p><code>PluginContainer</code> is an abstraction of plugin containers that can register metrics (for the driver and executors).</p> <p><code>PluginContainer</code> is created for the driver and executors using apply utility.</p>"},{"location":"plugins/PluginContainer/#contract","title":"Contract","text":""},{"location":"plugins/PluginContainer/#listening-to-task-failures","title":"Listening to Task Failures <pre><code>onTaskFailed(\n  failureReason: TaskFailedReason): Unit\n</code></pre> <p>For ExecutorPluginContainer only</p> <p>Possible <code>TaskFailedReason</code>s:</p> <ul> <li><code>TaskKilledException</code></li> <li><code>TaskKilled</code></li> <li><code>FetchFailed</code></li> <li><code>TaskCommitDenied</code></li> <li><code>ExceptionFailure</code></li> </ul> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run (and the task has failed)</li> </ul>","text":""},{"location":"plugins/PluginContainer/#listening-to-task-start","title":"Listening to Task Start <pre><code>onTaskStart(): Unit\n</code></pre> <p>For ExecutorPluginContainer only</p> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run (and the task has just started)</li> </ul>","text":""},{"location":"plugins/PluginContainer/#listening-to-task-success","title":"Listening to Task Success <pre><code>onTaskSucceeded(): Unit\n</code></pre> <p>For ExecutorPluginContainer only</p> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run (and the task has finished successfully)</li> </ul>","text":""},{"location":"plugins/PluginContainer/#registering-metrics","title":"Registering Metrics <pre><code>registerMetrics(\n  appId: String): Unit\n</code></pre> <p>Registers metrics for the application ID</p> <p>For DriverPluginContainer only</p> <p>Used when:</p> <ul> <li>SparkContext is created</li> </ul>","text":""},{"location":"plugins/PluginContainer/#shutdown","title":"Shutdown <pre><code>shutdown(): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested to stop</li> <li><code>Executor</code> is requested to stop</li> </ul>","text":""},{"location":"plugins/PluginContainer/#implementations","title":"Implementations","text":"Sealed Abstract Class <p><code>PluginContainer</code> is a Scala sealed abstract class which means that all of the implementations are in the same compilation unit (a single file).</p> <ul> <li>DriverPluginContainer</li> <li>ExecutorPluginContainer</li> </ul>"},{"location":"plugins/PluginContainer/#creating-plugincontainer","title":"Creating PluginContainer <pre><code>// the driver\napply(\n  sc: SparkContext,\n  resources: java.util.Map[String, ResourceInformation]): Option[PluginContainer]\n// executors\napply(\n  env: SparkEnv,\n  resources: java.util.Map[String, ResourceInformation]): Option[PluginContainer]\n// private helper\napply(\n  ctx: Either[SparkContext, SparkEnv],\n  resources: java.util.Map[String, ResourceInformation]): Option[PluginContainer]\n</code></pre> <p><code>apply</code> creates a <code>PluginContainer</code> for the driver or executors (based on the type of the first input argument, i.e. SparkContext or SparkEnv, respectively).</p> <p><code>apply</code> first loads the SparkPlugins defined by spark.plugins configuration property.</p> <p>Only when there was at least one plugin loaded, <code>apply</code> creates a DriverPluginContainer or ExecutorPluginContainer.</p> <p><code>apply</code> is used when:</p> <ul> <li><code>SparkContext</code> is created</li> <li><code>Executor</code> is created</li> </ul>","text":""},{"location":"plugins/PluginContextImpl/","title":"PluginContextImpl","text":"<p><code>PluginContextImpl</code> is...FIXME</p>"},{"location":"plugins/SparkPlugin/","title":"SparkPlugin","text":"<p><code>SparkPlugin</code> is an abstraction of custom extensions for Spark applications.</p>","tags":["DeveloperApi"]},{"location":"plugins/SparkPlugin/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"plugins/SparkPlugin/#driver-side-component","title":"Driver-side Component <pre><code>DriverPlugin driverPlugin()\n</code></pre> <p>Used when:</p> <ul> <li><code>DriverPluginContainer</code> is created</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"plugins/SparkPlugin/#executor-side-component","title":"Executor-side Component <pre><code>ExecutorPlugin executorPlugin()\n</code></pre> <p>Used when:</p> <ul> <li><code>ExecutorPluginContainer</code> is created</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/","title":"Resilient Distributed Dataset (RDD)","text":"<p>Resilient Distributed Dataset (aka RDD) is the primary data abstraction in Apache Spark and the core of Spark (that I often refer to as \"Spark Core\").</p> <p>.The origins of RDD</p> <p>The original paper that gave birth to the concept of RDD is https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf[Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing] by Matei Zaharia, et al.</p> <p>An RDD is a description of a fault-tolerant and resilient computation over a distributed collection of records (spread over &lt;&gt;). <p>NOTE: One could compare RDDs to collections in Scala, i.e. a RDD is computed on many JVMs while a Scala collection lives on a single JVM.</p> <p>Using RDD Spark hides data partitioning and so distribution that in turn allowed them to design parallel computational framework with a higher-level programming interface (API) for four mainstream programming languages.</p> <p>The features of RDDs (decomposing the name):</p> <ul> <li>Resilient, i.e. fault-tolerant with the help of &lt;&gt; and so able to recompute missing or damaged partitions due to node failures. <li>Distributed with data residing on multiple nodes in a spark-cluster.md[cluster].</li> <li>Dataset is a collection of spark-rdd-partitions.md[partitioned data] with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).</li> <p>.RDDs image::spark-rdds.png[align=\"center\"]</p> <p>From the scaladoc of http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD]:</p> <p>A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.</p> <p>From the original paper about RDD - https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf[Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing]:</p> <p>Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.</p> <p>Beside the above traits (that are directly embedded in the name of the data abstraction - RDD) it has the following additional traits:</p> <ul> <li>In-Memory, i.e. data inside RDD is stored in memory as much (size) and long (time) as possible.</li> <li>Immutable or Read-Only, i.e. it does not change once created and can only be transformed using transformations to new RDDs.</li> <li>Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that triggers the execution.</li> <li>Cacheable, i.e. you can hold all the data in a persistent \"storage\" like memory (default and the most preferred) or disk (the least preferred due to access speed).</li> <li>Parallel, i.e. process data in parallel.</li> <li>Typed -- RDD records have types, e.g. <code>Long</code> in <code>RDD[Long]</code> or <code>(Int, String)</code> in <code>RDD[(Int, String)]</code>.</li> <li>Partitioned -- records are partitioned (split into logical partitions) and distributed across nodes in a cluster.</li> <li>Location-Stickiness -- <code>RDD</code> can define &lt;&gt; to compute partitions (as close to the records as possible). <p>NOTE: Preferred location (aka locality preferences or placement preferences or locality info) is information about the locations of RDD records (that Spark's scheduler:DAGScheduler.md#preferred-locations[DAGScheduler] uses to place computing partitions on to have the tasks as close to the data as possible).</p> <p>Computing partitions in a RDD is a distributed process by design and to achieve even data distribution as well as leverage data locality (in distributed systems like HDFS or Apache Kafka in which data is partitioned by default), they are partitioned to a fixed number of spark-rdd-partitions.md[partitions] - logical chunks (parts) of data. The logical division is for processing only and internally it is not divided whatsoever. Each partition comprises of records.</p> <p></p> <p>spark-rdd-partitions.md[Partitions are the units of parallelism]. You can control the number of partitions of a RDD using spark-rdd-partitions.md#repartition[repartition] or spark-rdd-partitions.md#coalesce[coalesce] transformations. Spark tries to be as close to data as possible without wasting time to send data across network by means of RDD shuffling, and creates as many partitions as required to follow the storage layout and thus optimize data access. It leads to a one-to-one mapping between (physical) data in distributed data storage, e.g. HDFS or Cassandra, and partitions.</p> <p>RDDs support two kinds of operations:</p> <ul> <li>&lt;&gt; - lazy operations that return another RDD. <li>&lt;&gt; - operations that trigger computation and return values. <p>The motivation to create RDD were (https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf[after the authors]) two types of applications that current computing frameworks handle inefficiently:</p> <ul> <li>iterative algorithms in machine learning and graph computations.</li> <li>interactive data mining tools as ad-hoc queries on the same dataset.</li> </ul> <p>The goal is to reuse intermediate in-memory results across multiple data-intensive workloads with no need for copying large amounts of data over the network.</p> <p>Technically, RDDs follow the &lt;&gt; defined by the five main intrinsic properties: <ul> <li> <p>[[dependencies]] Parent RDDs (aka rdd:RDD.md#dependencies[RDD dependencies])</p> </li> <li> <p>An array of spark-rdd-partitions.md[partitions] that a dataset is divided to.</p> </li> <li> <p>A rdd:RDD.md#compute[compute] function to do a computation on partitions.</p> </li> <li> <p>An optional rdd:Partitioner.md[Partitioner] that defines how keys are hashed, and the pairs partitioned (for key-value RDDs)</p> </li> <li> <p>Optional &lt;&gt; (aka locality info), i.e. hosts for a partition where the records live or are the closest to read from. <p>This RDD abstraction supports an expressive set of operations without having to modify scheduler for each one.</p> <p>[[context]] An RDD is a named (by <code>name</code>) and uniquely identified (by <code>id</code>) entity in a SparkContext.md[] (available as <code>context</code> property).</p> <p>RDDs live in one and only one SparkContext.md[] that creates a logical boundary.</p> <p>NOTE: RDDs cannot be shared between <code>SparkContexts</code> (see SparkContext.md#sparkcontext-and-rdd[SparkContext and RDDs]).</p> <p>An RDD can optionally have a friendly name accessible using <code>name</code> that can be changed using <code>=</code>:</p> <pre><code>scala&gt; val ns = sc.parallelize(0 to 10)\nns: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at &lt;console&gt;:24\n\nscala&gt; ns.id\nres0: Int = 2\n\nscala&gt; ns.name\nres1: String = null\n\nscala&gt; ns.name = \"Friendly name\"\nns.name: String = Friendly name\n\nscala&gt; ns.name\nres2: String = Friendly name\n\nscala&gt; ns.toDebugString\nres3: String = (8) Friendly name ParallelCollectionRDD[2] at parallelize at &lt;console&gt;:24 []\n</code></pre> <p>RDDs are a container of instructions on how to materialize big (arrays of) distributed data, and how to split it into partitions so Spark (using executor:Executor.md[executors]) can hold some of them.</p> <p>In general data distribution can help executing processing in parallel so a task processes a chunk of data that it could eventually keep in memory.</p> <p>Spark does jobs in parallel, and RDDs are split into partitions to be processed and written in parallel. Inside a partition, data is processed sequentially.</p> <p>Saving partitions results in part-files instead of one single file (unless there is a single partition).</p> <p>== [[transformations]] Transformations</p> <p>A transformation is a lazy operation on a RDD that returns another RDD, e.g. <code>map</code>, <code>flatMap</code>, <code>filter</code>, <code>reduceByKey</code>, <code>join</code>, <code>cogroup</code>, etc.</p> <p>Find out more in rdd:spark-rdd-transformations.md[Transformations].</p> <p>== [[actions]] Actions</p> <p>An action is an operation that triggers execution of &lt;&gt; and returns a value (to a Spark driver - the user program). <p>TIP: Go in-depth in the section spark-rdd-actions.md[Actions].</p> <p>== [[creating-rdds]] Creating RDDs</p> <p>=== SparkContext.parallelize</p> <p>One way to create a RDD is with <code>SparkContext.parallelize</code> method. It accepts a collection of elements as shown below (<code>sc</code> is a SparkContext instance):</p> <pre><code>scala&gt; val rdd = sc.parallelize(1 to 1000)\nrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:25\n</code></pre> <p>You may also want to randomize the sample data:</p> <pre><code>scala&gt; val data = Seq.fill(10)(util.Random.nextInt)\ndata: Seq[Int] = List(-964985204, 1662791, -1820544313, -383666422, -111039198, 310967683, 1114081267, 1244509086, 1797452433, 124035586)\n\nscala&gt; val rdd = sc.parallelize(data)\nrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:29\n</code></pre> <p>Given the reason to use Spark to process more data than your own laptop could handle, <code>SparkContext.parallelize</code> is mainly used to learn Spark in the Spark shell. <code>SparkContext.parallelize</code> requires all the data to be available on a single machine - the Spark driver - that eventually hits the limits of your laptop.</p> <p>=== SparkContext.makeRDD</p> <p>CAUTION: FIXME What's the use case for <code>makeRDD</code>?</p> <pre><code>scala&gt; sc.makeRDD(0 to 1000)\nres0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at &lt;console&gt;:25\n</code></pre> <p>=== SparkContext.textFile</p> <p>One of the easiest ways to create an RDD is to use <code>SparkContext.textFile</code> to read files.</p> <p>You can use the local <code>README.md</code> file (and then <code>flatMap</code> over the lines inside to have an RDD of words):</p> <pre><code>scala&gt; val words = sc.textFile(\"README.md\").flatMap(_.split(\"\\\\W+\")).cache\nwords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[27] at flatMap at &lt;console&gt;:24\n</code></pre> <p>NOTE: You spark-rdd-caching.md[cache] it so the computation is not performed every time you work with <code>words</code>.</p> <p>== [[creating-rdds-from-input]] Creating RDDs from Input</p> <p>Refer to spark-io.md[Using Input and Output (I/O)] to learn about the IO API to create RDDs.</p> <p>=== Transformations</p> <p>RDD transformations by definition transform an RDD into another RDD and hence are the way to create new ones.</p> <p>Refer to &lt;&gt; section to learn more. <p>== RDDs in Web UI</p> <p>It is quite informative to look at RDDs in the Web UI that is at http://localhost:4040 for spark-shell.md[Spark shell].</p> <p>Execute the following Spark application (type all the lines in <code>spark-shell</code>):</p>"},{"location":"rdd/#sourcescala","title":"[source,scala]","text":"<p>val ints = sc.parallelize(1 to 100) // &lt;1&gt; ints.setName(\"Hundred ints\")        // &lt;2&gt; ints.cache                          // &lt;3&gt; ints.count                          // &lt;4&gt;</p> <p>&lt;1&gt; Creates an RDD with hundred of numbers (with as many partitions as possible) &lt;2&gt; Sets the name of the RDD &lt;3&gt; Caches the RDD for performance reasons that also makes it visible in Storage tab in the web UI &lt;4&gt; Executes action (and materializes the RDD)</p> <p>With the above executed, you should see the following in the Web UI:</p> <p>.RDD with custom name image::spark-ui-rdd-name.png[align=\"center\"]</p> <p>Click the name of the RDD (under RDD Name) and you will get the details of how the RDD is cached.</p> <p>.RDD Storage Info image::spark-ui-storage-hundred-ints.png[align=\"center\"]</p> <p>Execute the following Spark job and you will see how the number of partitions decreases.</p> <pre><code>ints.repartition(2).count\n</code></pre> <p>.Number of tasks after <code>repartition</code> image::spark-ui-repartition-2.png[align=\"center\"]</p>"},{"location":"rdd/Aggregator/","title":"Aggregator","text":"<p><code>Aggregator</code> is a set of &lt;&gt; used to aggregate data using rdd:PairRDDFunctions.md#combineByKeyWithClassTag[PairRDDFunctions.combineByKeyWithClassTag] transformation. <p><code>Aggregator[K, V, C]</code> is a parameterized type of <code>K</code> keys, <code>V</code> values, and <code>C</code> combiner (partial) values.</p> <p>[[creating-instance]][[aggregation-functions]] Aggregator transforms an <code>RDD[(K, V)]</code> into an <code>RDD[(K, C)]</code> (for a \"combined type\" C) using the functions:</p> <ul> <li>[[createCombiner]] <code>createCombiner: V =&gt; C</code></li> <li>[[mergeValue]] <code>mergeValue: (C, V) =&gt; C</code></li> <li>[[mergeCombiners]] <code>mergeCombiners: (C, C) =&gt; C</code></li> </ul> <p>Aggregator is used to create a ShuffleDependency and ExternalSorter.</p> <p>== [[combineValuesByKey]] combineValuesByKey Method</p>"},{"location":"rdd/Aggregator/#source-scala","title":"[source, scala]","text":"<p>combineValuesByKey(   iter: Iterator[_ &lt;: Product2[K, V]],   context: TaskContext): Iterator[(K, C)]</p> <p>combineValuesByKey creates a new shuffle:ExternalAppendOnlyMap.md[ExternalAppendOnlyMap] (with the &lt;&gt;). <p>combineValuesByKey requests the ExternalAppendOnlyMap to shuffle:ExternalAppendOnlyMap.md#insertAll[insert all key-value pairs] from the given iterator (that is the values of a partition).</p> <p>combineValuesByKey &lt;&gt;. <p>In the end, combineValuesByKey requests the ExternalAppendOnlyMap for an shuffle:ExternalAppendOnlyMap.md#iterator[iterator of \"combined\" pairs].</p> <p>combineValuesByKey is used when:</p> <ul> <li> <p>rdd:PairRDDFunctions.md#combineByKeyWithClassTag[PairRDDFunctions.combineByKeyWithClassTag] transformation is used (with the same Partitioner as the RDD's)</p> </li> <li> <p>BlockStoreShuffleReader is requested to shuffle:BlockStoreShuffleReader.md#read[read combined records for a reduce task] (with the Map-Size Partial Aggregation Flag off)</p> </li> </ul> <p>== [[combineCombinersByKey]] combineCombinersByKey Method</p>"},{"location":"rdd/Aggregator/#source-scala_1","title":"[source, scala]","text":"<p>combineCombinersByKey(   iter: Iterator[_ &lt;: Product2[K, C]],   context: TaskContext): Iterator[(K, C)]</p> <p>combineCombinersByKey...FIXME</p> <p>combineCombinersByKey is used when BlockStoreShuffleReader is requested to shuffle:BlockStoreShuffleReader.md#read[read combined records for a reduce task] (with the Map-Size Partial Aggregation Flag on).</p> <p>== [[updateMetrics]] Updating Task Metrics</p>"},{"location":"rdd/Aggregator/#source-scala_2","title":"[source, scala]","text":"<p>updateMetrics(   context: TaskContext,   map: ExternalAppendOnlyMap[_, _, _]): Unit</p> <p>updateMetrics requests the input TaskContext for the TaskMetrics to update the metrics based on the metrics of the input ExternalAppendOnlyMap:</p> <ul> <li> <p>executor:TaskMetrics.md#incMemoryBytesSpilled[Increment memory bytes spilled]</p> </li> <li> <p>executor:TaskMetrics.md#incDiskBytesSpilled[Increment disk bytes spilled]</p> </li> <li> <p>executor:TaskMetrics.md#incPeakExecutionMemory[Increment peak execution memory]</p> </li> </ul> <p>updateMetrics is used when Aggregator is requested to &lt;&gt; and &lt;&gt;."},{"location":"rdd/AsyncRDDActions/","title":"AsyncRDDActions","text":"<p><code>AsyncRDDActions</code> is...FIXME</p>"},{"location":"rdd/CheckpointRDD/","title":"CheckpointRDD","text":"<p><code>CheckpointRDD</code> is an extension of the RDD abstraction for RDDs that recovers checkpointed data from storage.</p> <p><code>CheckpointRDD</code> cannot be checkpointed again (and doCheckpoint, checkpoint, and localCheckpoint are simply noops).</p> <p>getPartitions and compute throw an <code>NotImplementedError</code> and are supposed to be overriden by the implementations.</p>"},{"location":"rdd/CheckpointRDD/#implementations","title":"Implementations","text":"<ul> <li>LocalCheckpointRDD</li> <li>ReliableCheckpointRDD</li> </ul>"},{"location":"rdd/CoGroupedRDD/","title":"CoGroupedRDD","text":"<p><code>CoGroupedRDD[K]</code> is an RDD that cogroups the parent RDDs.</p> <pre><code>RDD[(K, Array[Iterable[_]])]\n</code></pre> <p>For each key <code>k</code> in parent RDDs, the resulting RDD contains a tuple with the list of values for that key.</p>"},{"location":"rdd/CoGroupedRDD/#creating-instance","title":"Creating Instance","text":"<p><code>CoGroupedRDD</code> takes the following to be created:</p> <ul> <li> Key-Value RDDs (<code>Seq[RDD[_ &lt;: Product2[K, _]]]</code>) <li> Partitioner <p><code>CoGroupedRDD</code> is created\u00a0when:</p> <ul> <li>RDD.cogroup operator is used</li> </ul>"},{"location":"rdd/CoalescedRDD/","title":"CoalescedRDD","text":"<p><code>CoalescedRDD</code> is...FIXME</p>"},{"location":"rdd/Dependency/","title":"Dependency","text":"<p><code>Dependency[T]</code> is an abstraction of dependencies between <code>RDD</code>s.</p> <p>Any time an RDD transformation (e.g. <code>map</code>, <code>flatMap</code>) is used (and RDD lineage graph is built), <code>Dependency</code>ies are the edges.</p>","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#rdd","title":"RDD <pre><code>rdd: RDD[T]\n</code></pre> <p>Used when:</p> <ul> <li><code>DAGScheduler</code> is requested for the shuffle dependencies and ResourceProfiles (of an <code>RDD</code>)</li> <li><code>RDD</code> is requested to getNarrowAncestors, cleanShuffleDependencies, firstParent, parent, toDebugString, getOutputDeterministicLevel</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#implementations","title":"Implementations","text":"<ul> <li>NarrowDependency</li> <li>ShuffleDependency</li> </ul>","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#demo","title":"Demo","text":"<p>The dependencies of an <code>RDD</code> are available using <code>RDD.dependencies</code> method.</p> <pre><code>val myRdd = sc.parallelize(0 to 9).groupBy(_ % 2)\n</code></pre> <pre><code>scala&gt; myRdd.dependencies.foreach(println)\norg.apache.spark.ShuffleDependency@41e38d89\n</code></pre> <pre><code>scala&gt; myRdd.dependencies.map(_.rdd).foreach(println)\nMapPartitionsRDD[6] at groupBy at &lt;console&gt;:39\n</code></pre> <p>RDD.toDebugString is used to print out the RDD lineage in a developer-friendly way.</p> <pre><code>scala&gt; println(myRdd.toDebugString)\n(16) ShuffledRDD[7] at groupBy at &lt;console&gt;:39 []\n +-(16) MapPartitionsRDD[6] at groupBy at &lt;console&gt;:39 []\n    |   ParallelCollectionRDD[5] at parallelize at &lt;console&gt;:39 []\n</code></pre>","tags":["DeveloperApi"]},{"location":"rdd/HadoopRDD/","title":"HadoopRDD","text":"<p>https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.HadoopRDD[HadoopRDD] is an RDD that provides core functionality for reading data stored in HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI using the older MapReduce API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/package-summary.html[org.apache.hadoop.mapred]).</p> <p>HadoopRDD is created as a result of calling the following methods in SparkContext.md[]:</p> <ul> <li><code>hadoopFile</code></li> <li><code>textFile</code> (the most often used in examples!)</li> <li><code>sequenceFile</code></li> </ul> <p>Partitions are of type <code>HadoopPartition</code>.</p> <p>When an HadoopRDD is computed, i.e. an action is called, you should see the INFO message <code>Input split:</code> in the logs.</p> <pre><code>scala&gt; sc.textFile(\"README.md\").count\n...\n15/10/10 18:03:21 INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/README.md:0+1784\n15/10/10 18:03:21 INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/README.md:1784+1784\n...\n</code></pre> <p>The following properties are set upon partition execution:</p> <ul> <li>mapred.tip.id - task id of this task's attempt</li> <li>mapred.task.id - task attempt's id</li> <li>mapred.task.is.map as <code>true</code></li> <li>mapred.task.partition - split id</li> <li>mapred.job.id</li> </ul> <p>Spark settings for <code>HadoopRDD</code>:</p> <ul> <li>spark.hadoop.cloneConf (default: <code>false</code>) - shouldCloneJobConf - should a Hadoop job configuration <code>JobConf</code> object be cloned before spawning a Hadoop job. Refer to https://issues.apache.org/jira/browse/SPARK-2546[[SPARK-2546] Configuration object thread safety issue]. When <code>true</code>, you should see a DEBUG message <code>Cloning Hadoop Configuration</code>.</li> </ul> <p>You can register callbacks on TaskContext.</p> <p>HadoopRDDs are not checkpointed. They do nothing when <code>checkpoint()</code> is called.</p>"},{"location":"rdd/HadoopRDD/#caution","title":"[CAUTION]","text":"<p>FIXME</p> <ul> <li>What are <code>InputMetrics</code>?</li> <li>What is <code>JobConf</code>?</li> <li>What are the InputSplits: <code>FileSplit</code> and <code>CombineFileSplit</code>? * What are <code>InputFormat</code> and <code>Configurable</code> subtypes?</li> <li>What's InputFormat's RecordReader? It creates a key and a value. What are they?</li> <li> </li> </ul> <p>=== [[getPreferredLocations]] <code>getPreferredLocations</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[getPartitions]] <code>getPartitions</code> Method</p> <p>The number of partition for HadoopRDD, i.e. the return value of <code>getPartitions</code>, is calculated using <code>InputFormat.getSplits(jobConf, minPartitions)</code> where <code>minPartitions</code> is only a hint of how many partitions one may want at minimum. As a hint it does not mean the number of partitions will be exactly the number given.</p> <p>For <code>SparkContext.textFile</code> the input format class is https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html[org.apache.hadoop.mapred.TextInputFormat].</p> <p>The https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html[javadoc of org.apache.hadoop.mapred.FileInputFormat] says:</p> <p>FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobConf, int). Subclasses of FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.</p> <p>TIP: You may find https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L319[the sources of org.apache.hadoop.mapred.FileInputFormat.getSplits] enlightening.</p>"},{"location":"rdd/HadoopRDD/#whats-hadoop-split-input-splits-for-hadoop-reads-see-inputformatgetsplits","title":"What's Hadoop Split? input splits for Hadoop reads? See <code>InputFormat.getSplits</code>","text":""},{"location":"rdd/HashPartitioner/","title":"HashPartitioner","text":"<p><code>HashPartitioner</code> is a Partitioner for hash-based partitioning.</p> <p>Important</p> <p><code>HashPartitioner</code> places null keys in 0<sup>th</sup> partition.</p> <p><code>HashPartitioner</code> is used as the default Partitioner.</p>"},{"location":"rdd/HashPartitioner/#creating-instance","title":"Creating Instance","text":"<p><code>HashPartitioner</code> takes the following to be created:</p> <ul> <li> Number of partitions"},{"location":"rdd/HashPartitioner/#number-of-partitions","title":"Number of Partitions <pre><code>numPartitions: Int\n</code></pre> <p><code>numPartitions</code> returns the given number of partitions.</p> <p><code>numPartitions</code>\u00a0is part of the Partitioner abstraction.</p>","text":""},{"location":"rdd/HashPartitioner/#partition-for-key","title":"Partition for Key <pre><code>getPartition(\n  key: Any): Int\n</code></pre> <p>For <code>null</code> keys <code>getPartition</code> simply returns <code>0</code>.</p> <p>For non-<code>null</code> keys, <code>getPartition</code> uses the Object.hashCode of the key  modulo the number of partitions. For negative results, <code>getPartition</code> adds the number of partitions to make it non-negative.</p> <p><code>getPartition</code>\u00a0is part of the Partitioner abstraction.</p>","text":""},{"location":"rdd/LocalCheckpointRDD/","title":"LocalCheckpointRDD","text":"<p><code>LocalCheckpointRDD[T]</code> is a CheckpointRDD.</p>"},{"location":"rdd/LocalCheckpointRDD/#creating-instance","title":"Creating Instance","text":"<p><code>LocalCheckpointRDD</code> takes the following to be created:</p> <ul> <li> RDD <li> SparkContext <li> RDD ID <li> Number of Partitions <p><code>LocalCheckpointRDD</code> is created\u00a0when:</p> <ul> <li><code>LocalRDDCheckpointData</code> is requested to doCheckpoint</li> </ul>"},{"location":"rdd/LocalCheckpointRDD/#partitions","title":"Partitions <pre><code>getPartitions: Array[Partition]\n</code></pre> <p><code>getPartitions</code>\u00a0is part of the RDD abstraction.</p> <p><code>getPartitions</code> creates a <code>CheckpointRDDPartition</code> for every input partition (index).</p>","text":""},{"location":"rdd/LocalCheckpointRDD/#computing-partition","title":"Computing Partition <pre><code>compute(\n  partition: Partition,\n  context: TaskContext): Iterator[T]\n</code></pre> <p><code>compute</code>\u00a0is part of the RDD abstraction.</p> <p><code>compute</code> merely throws an <code>SparkException</code> (that explains the reason):</p> <pre><code>Checkpoint block [RDDBlockId] not found! Either the executor\nthat originally checkpointed this partition is no longer alive, or the original RDD is\nunpersisted. If this problem persists, you may consider using `rdd.checkpoint()`\ninstead, which is slower than local checkpointing but more fault-tolerant.\"\n</code></pre>","text":""},{"location":"rdd/LocalRDDCheckpointData/","title":"LocalRDDCheckpointData","text":"<p><code>LocalRDDCheckpointData</code> is a RDDCheckpointData.</p>"},{"location":"rdd/LocalRDDCheckpointData/#creating-instance","title":"Creating Instance","text":"<p><code>LocalRDDCheckpointData</code> takes the following to be created:</p> <ul> <li> RDD <p><code>LocalRDDCheckpointData</code> is created\u00a0when:</p> <ul> <li><code>RDD</code> is requested to localCheckpoint</li> </ul>"},{"location":"rdd/LocalRDDCheckpointData/#docheckpoint","title":"doCheckpoint <pre><code>doCheckpoint(): CheckpointRDD[T]\n</code></pre> <p><code>doCheckpoint</code>\u00a0is part of the RDDCheckpointData abstraction.</p> <p><code>doCheckpoint</code> creates a LocalCheckpointRDD with the RDD. <code>doCheckpoint</code> triggers caching any missing partitions (by checking availability of the RDDBlockIds for the partitions in the BlockManagerMaster).</p>  <p>Extra Spark Job</p> <p>If there are any missing partitions (<code>RDDBlockId</code>s) <code>doCheckpoint</code> requests the <code>SparkContext</code> to run a Spark job with the <code>RDD</code> and the missing partitions.</p>  <p><code>doCheckpoint</code>makes sure that the StorageLevel of the <code>RDD</code> uses disk (among other persistence storages). If not, <code>doCheckpoint</code>\u00a0throws an <code>AssertionError</code>:</p> <pre><code>Storage level [level] is not appropriate for local checkpointing\n</code></pre>","text":""},{"location":"rdd/MapPartitionsRDD/","title":"MapPartitionsRDD","text":"<p><code>MapPartitionsRDD[U, T]</code> is a RDD that transforms (maps) input <code>T</code> records into <code>U</code>s using partition function.</p> <p><code>MapPartitionsRDD</code> is a RDD that has exactly one-to-one narrow dependency on the parent RDD.</p>"},{"location":"rdd/MapPartitionsRDD/#creating-instance","title":"Creating Instance","text":"<p><code>MapPartitionsRDD</code> takes the following to be created:</p> <ul> <li> Parent RDD (<code>RDD[T]</code>) <li> Partition Function <li> <code>preservesPartitioning</code> flag <li>isFromBarrier Flag</li> <li> <code>isOrderSensitive</code> flag <p><code>MapPartitionsRDD</code> is created when:</p> <ul> <li><code>PairRDDFunctions</code> is requested to mapValues and flatMapValues</li> <li><code>RDD</code> is requested to map, flatMap, filter, glom, mapPartitions, mapPartitionsWithIndexInternal, mapPartitionsInternal, mapPartitionsWithIndex</li> <li><code>RDDBarrier</code> is requested to mapPartitions, mapPartitionsWithIndex</li> </ul>"},{"location":"rdd/MapPartitionsRDD/#barrier-rdd","title":"Barrier RDD","text":"<p><code>MapPartitionsRDD</code> can be a barrier RDD in Barrier Execution Mode.</p>"},{"location":"rdd/MapPartitionsRDD/#isFromBarrier","title":"isFromBarrier Flag","text":"<p><code>MapPartitionsRDD</code> can be given <code>isFromBarrier</code> flag when created.</p> <p><code>isFromBarrier</code> flag is assumed disabled (<code>false</code>) and can only be enabled (<code>true</code>) using RDDBarrier transformations:</p> <ul> <li>RDDBarrier.mapPartitions</li> <li>RDDBarrier.mapPartitionsWithIndex</li> </ul>"},{"location":"rdd/MapPartitionsRDD/#isBarrier_","title":"isBarrier_","text":"RDD <pre><code>isBarrier_ : Boolean\n</code></pre> <p><code>isBarrier_</code> is part of the RDD abstraction.</p> <p><code>isBarrier_</code> is enabled (<code>true</code>) when either this <code>MapPartitionsRDD</code> is isFromBarrier or any of the parent RDDs is isBarrier. Otherwise, <code>isBarrier_</code> is disabled (<code>false</code>).</p>"},{"location":"rdd/NarrowDependency/","title":"NarrowDependency","text":"<p><code>NarrowDependency[T]</code> is an extension of the Dependency abstraction for narrow dependencies (of RDD[T]s) where each partition of the child RDD depends on a small number of partitions of the parent RDD.</p>","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#getparents","title":"getParents <pre><code>getParents(\n  partitionId: Int): Seq[Int]\n</code></pre> <p>The parent partitions for a given child partition</p> <p>Used when:</p> <ul> <li><code>DAGScheduler</code> is requested for the preferred locations (of a partition of an <code>RDD</code>)</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#implementations","title":"Implementations","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#onetoonedependency","title":"OneToOneDependency <p><code>OneToOneDependency</code> is a <code>NarrowDependency</code> with getParents returning a single-element collection with the given <code>partitionId</code>.</p> <pre><code>val myRdd = sc.parallelize(0 to 9).map((_, 1))\n\nscala&gt; :type myRdd\norg.apache.spark.rdd.RDD[(Int, Int)]\n\nscala&gt; myRdd.dependencies.foreach(println)\norg.apache.spark.OneToOneDependency@801fe56\n\nimport org.apache.spark.OneToOneDependency\nval dep = myRdd.dependencies.head.asInstanceOf[OneToOneDependency[(_, _)]]\n\nscala&gt; println(dep.getParents(0))\nList(0)\n\nscala&gt; println(dep.getParents(1))\nList(1)\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#prunedependency","title":"PruneDependency <p><code>PruneDependency</code> is a <code>NarrowDependency</code> that represents a dependency between the <code>PartitionPruningRDD</code> and the parent RDD (with a subset of partitions of the parents).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#rangedependency","title":"RangeDependency <p><code>RangeDependency</code> is a <code>NarrowDependency</code> that represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.</p> <p>Used in <code>UnionRDD</code> (<code>SparkContext.union</code>).</p> <pre><code>val r1 = sc.range(0, 4)\nval r2 = sc.range(5, 9)\n\nval unioned = sc.union(r1, r2)\n\nscala&gt; unioned.dependencies.foreach(println)\norg.apache.spark.RangeDependency@76b0e1d9\norg.apache.spark.RangeDependency@3f3e51e0\n\nimport org.apache.spark.RangeDependency\nval dep = unioned.dependencies.head.asInstanceOf[RangeDependency[(_, _)]]\n\nscala&gt; println(dep.getParents(0))\nList(0)\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#creating-instance","title":"Creating Instance","text":"<p><code>NarrowDependency</code> takes the following to be created:</p> <ul> <li> RDD[T] <p>Abstract Class</p> <p><code>NarrowDependency</code> is an abstract class and cannot be created directly. It is created indirectly for the concrete NarrowDependencies.</p>","tags":["DeveloperApi"]},{"location":"rdd/NewHadoopRDD/","title":"NewHadoopRDD","text":"<p>== [[NewHadoopRDD]] NewHadoopRDD</p> <p><code>NewHadoopRDD</code> is an rdd:index.md[RDD] of <code>K</code> keys and <code>V</code> values.</p> <p>&lt;NewHadoopRDD is created&gt;&gt; when: <ul> <li><code>SparkContext.newAPIHadoopFile</code></li> <li><code>SparkContext.newAPIHadoopRDD</code></li> <li>(indirectly) <code>SparkContext.binaryFiles</code></li> <li>(indirectly) <code>SparkContext.wholeTextFiles</code></li> </ul> <p>NOTE: <code>NewHadoopRDD</code> is the base RDD of <code>BinaryFileRDD</code> and <code>WholeTextFileRDD</code>.</p> <p>=== [[getPreferredLocations]] <code>getPreferredLocations</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[creating-instance]] Creating NewHadoopRDD Instance</p> <p><code>NewHadoopRDD</code> takes the following when created:</p> <ul> <li>[[sc]] SparkContext.md[]</li> <li>[[inputFormatClass]] HDFS' <code>InputFormat[K, V]</code></li> <li>[[keyClass]] <code>K</code> class name</li> <li>[[valueClass]] <code>V</code> class name</li> <li>[[_conf]] transient HDFS' <code>Configuration</code></li> </ul> <p><code>NewHadoopRDD</code> initializes the &lt;&gt;."},{"location":"rdd/OrderedRDDFunctions/","title":"OrderedRDDFunctions","text":"<pre><code>class OrderedRDDFunctions[\n  K: Ordering : ClassTag,\n  V: ClassTag,\n  P &lt;: Product2[K, V] : ClassTag]\n</code></pre> <p><code>OrderedRDDFunctions</code> adds extra operators to RDDs of (key, value) pairs (<code>RDD[(K, V)]</code>) where the <code>K</code> key is sortable (i.e. any key type <code>K</code> that has an implicit <code>Ordering[K]</code> in scope).</p> <p>Tip</p> <p>Learn more about Ordering in the Scala Standard Library documentation.</p>"},{"location":"rdd/OrderedRDDFunctions/#creating-instance","title":"Creating Instance","text":"<p><code>OrderedRDDFunctions</code> takes the following to be created:</p> <ul> <li> RDD of <code>P</code>s <p><code>OrderedRDDFunctions</code> is created using RDD.rddToOrderedRDDFunctions implicit method.</p>"},{"location":"rdd/OrderedRDDFunctions/#filterbyrange","title":"filterByRange <pre><code>filterByRange(\n  lower: K,\n  upper: K): RDD[P]\n</code></pre> <p><code>filterByRange</code>...FIXME</p>","text":""},{"location":"rdd/OrderedRDDFunctions/#repartitionandsortwithinpartitions","title":"repartitionAndSortWithinPartitions <pre><code>repartitionAndSortWithinPartitions(\n  partitioner: Partitioner): RDD[(K, V)]\n</code></pre> <p><code>repartitionAndSortWithinPartitions</code> creates a ShuffledRDD with the given Partitioner.</p>  <p>Note</p> <p><code>repartitionAndSortWithinPartitions</code> is a generalization of sortByKey operator.</p>","text":""},{"location":"rdd/OrderedRDDFunctions/#sortbykey","title":"sortByKey <pre><code>sortByKey(\n  ascending: Boolean = true,\n  numPartitions: Int = self.partitions.length): RDD[(K, V)]\n</code></pre> <p><code>sortByKey</code> creates a ShuffledRDD (with the RDD and a RangePartitioner).</p>  <p>Note</p> <p><code>sortByKey</code> is a specialization of repartitionAndSortWithinPartitions operator.</p>  <p><code>sortByKey</code> is used when:</p> <ul> <li>RDD.sortBy high-level operator is used</li> </ul>","text":""},{"location":"rdd/PairRDDFunctions/","title":"PairRDDFunctions","text":"<p><code>PairRDDFunctions</code> is an extension of RDD API for additional high-level operators to work with key-value RDDs (<code>RDD[(K, V)]</code>).</p> <p><code>PairRDDFunctions</code> is available in RDDs of key-value pairs via Scala implicit conversion.</p> <p>The gist of <code>PairRDDFunctions</code> is combineByKeyWithClassTag.</p>"},{"location":"rdd/PairRDDFunctions/#aggregatebykey","title":"aggregateByKey <pre><code>aggregateByKey[U: ClassTag](\n  zeroValue: U)(\n  seqOp: (U, V) =&gt; U,\n  combOp: (U, U) =&gt; U): RDD[(K, U)] // (1)!\naggregateByKey[U: ClassTag](\n  zeroValue: U,\n  numPartitions: Int)(seqOp: (U, V) =&gt; U,\n  combOp: (U, U) =&gt; U): RDD[(K, U)] // (2)!\naggregateByKey[U: ClassTag](\n  zeroValue: U,\n  partitioner: Partitioner)(\n  seqOp: (U, V) =&gt; U,\n  combOp: (U, U) =&gt; U): RDD[(K, U)]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Creates a HashPartitioner with the given <code>numPartitions</code> partitions</li> </ol> <p><code>aggregateByKey</code>...FIXME</p>","text":""},{"location":"rdd/PairRDDFunctions/#combinebykey","title":"combineByKey <pre><code>combineByKey[C](\n  createCombiner: V =&gt; C,\n  mergeValue: (C, V) =&gt; C,\n  mergeCombiners: (C, C) =&gt; C): RDD[(K, C)]\ncombineByKey[C](\n  createCombiner: V =&gt; C,\n  mergeValue: (C, V) =&gt; C,\n  mergeCombiners: (C, C) =&gt; C,\n  numPartitions: Int): RDD[(K, C)]\ncombineByKey[C](\n  createCombiner: V =&gt; C,\n  mergeValue: (C, V) =&gt; C,\n  mergeCombiners: (C, C) =&gt; C,\n  partitioner: Partitioner,\n  mapSideCombine: Boolean = true,\n  serializer: Serializer = null): RDD[(K, C)]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Creates a HashPartitioner with the given <code>numPartitions</code> partitions</li> </ol> <p><code>combineByKey</code>...FIXME</p>","text":""},{"location":"rdd/PairRDDFunctions/#combinebykeywithclasstag","title":"combineByKeyWithClassTag <pre><code>combineByKeyWithClassTag[C](\n  createCombiner: V =&gt; C,\n  mergeValue: (C, V) =&gt; C,\n  mergeCombiners: (C, C) =&gt; C)(implicit ct: ClassTag[C]): RDD[(K, C)] // (1)!\ncombineByKeyWithClassTag[C](\n  createCombiner: V =&gt; C,\n  mergeValue: (C, V) =&gt; C,\n  mergeCombiners: (C, C) =&gt; C,\n  numPartitions: Int)(implicit ct: ClassTag[C]): RDD[(K, C)] // (2)!\ncombineByKeyWithClassTag[C](\n  createCombiner: V =&gt; C,\n  mergeValue: (C, V) =&gt; C,\n  mergeCombiners: (C, C) =&gt; C,\n  partitioner: Partitioner,\n  mapSideCombine: Boolean = true,\n  serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Uses a HashPartitioner (with the given <code>numPartitions</code>)</li> </ol> <p><code>combineByKeyWithClassTag</code> creates an Aggregator for the given aggregation functions.</p> <p><code>combineByKeyWithClassTag</code> branches off per the given Partitioner.</p> <p>If the input partitioner and the RDD's are the same, <code>combineByKeyWithClassTag</code> simply mapPartitions on the RDD with the following arguments:</p> <ul> <li> <p>Iterator of the Aggregator</p> </li> <li> <p><code>preservesPartitioning</code> flag turned on</p> </li> </ul> <p>If the input partitioner is different than the RDD's, <code>combineByKeyWithClassTag</code> creates a ShuffledRDD (with the <code>Serializer</code>, the <code>Aggregator</code>, and the <code>mapSideCombine</code> flag).</p>","text":""},{"location":"rdd/PairRDDFunctions/#usage","title":"Usage <p><code>combineByKeyWithClassTag</code> lays the foundation for the following high-level RDD key-value pair transformations:</p> <ul> <li>aggregateByKey</li> <li>combineByKey</li> <li>countApproxDistinctByKey</li> <li>foldByKey</li> <li>groupByKey</li> <li>reduceByKey</li> </ul>","text":""},{"location":"rdd/PairRDDFunctions/#requirements","title":"Requirements <p><code>combineByKeyWithClassTag</code> requires that the <code>mergeCombiners</code> is defined (not-<code>null</code>) or throws an <code>IllegalArgumentException</code>:</p> <pre><code>mergeCombiners must be defined\n</code></pre> <p><code>combineByKeyWithClassTag</code> throws a <code>SparkException</code> for the keys being of type array with the <code>mapSideCombine</code> flag enabled:</p> <pre><code>Cannot use map-side combining with array keys.\n</code></pre> <p><code>combineByKeyWithClassTag</code> throws a <code>SparkException</code> for the keys being of type <code>array</code> with the partitioner being a HashPartitioner:</p> <pre><code>HashPartitioner cannot partition array keys.\n</code></pre>","text":""},{"location":"rdd/PairRDDFunctions/#example","title":"Example <pre><code>val nums = sc.parallelize(0 to 9, numSlices = 4)\nval groups = nums.keyBy(_ % 2)\ndef createCombiner(n: Int) = {\n  println(s\"createCombiner($n)\")\n  n\n}\ndef mergeValue(n1: Int, n2: Int) = {\n  println(s\"mergeValue($n1, $n2)\")\n  n1 + n2\n}\ndef mergeCombiners(c1: Int, c2: Int) = {\n  println(s\"mergeCombiners($c1, $c2)\")\n  c1 + c2\n}\nval countByGroup = groups.combineByKeyWithClassTag(\n  createCombiner,\n  mergeValue,\n  mergeCombiners)\nprintln(countByGroup.toDebugString)\n/*\n(4) ShuffledRDD[3] at combineByKeyWithClassTag at &lt;console&gt;:31 []\n +-(4) MapPartitionsRDD[1] at keyBy at &lt;console&gt;:25 []\n    |  ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:24 []\n*/\n</code></pre>","text":""},{"location":"rdd/PairRDDFunctions/#countapproxdistinctbykey","title":"countApproxDistinctByKey <pre><code>countApproxDistinctByKey(\n  relativeSD: Double = 0.05): RDD[(K, Long)] // (1)!\ncountApproxDistinctByKey(\n  relativeSD: Double,\n  numPartitions: Int): RDD[(K, Long)] // (2)!\ncountApproxDistinctByKey(\n  relativeSD: Double,\n  partitioner: Partitioner): RDD[(K, Long)]\ncountApproxDistinctByKey(\n  p: Int,\n  sp: Int,\n  partitioner: Partitioner): RDD[(K, Long)]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Creates a HashPartitioner with the given <code>numPartitions</code> partitions</li> </ol> <p><code>countApproxDistinctByKey</code>...FIXME</p>","text":""},{"location":"rdd/PairRDDFunctions/#foldbykey","title":"foldByKey <pre><code>foldByKey(\n  zeroValue: V)(\n  func: (V, V) =&gt; V): RDD[(K, V)] // (1)!\nfoldByKey(\n  zeroValue: V,\n  numPartitions: Int)(\n  func: (V, V) =&gt; V): RDD[(K, V)] // (2)!\nfoldByKey(\n  zeroValue: V,\n  partitioner: Partitioner)(\n  func: (V, V) =&gt; V): RDD[(K, V)]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Creates a HashPartitioner with the given <code>numPartitions</code> partitions</li> </ol> <p><code>foldByKey</code>...FIXME</p>  <p><code>foldByKey</code> is used when:</p> <ul> <li>RDD.treeAggregate high-level operator is used</li> </ul>","text":""},{"location":"rdd/PairRDDFunctions/#groupbykey","title":"groupByKey <pre><code>groupByKey(): RDD[(K, Iterable[V])] // (1)!\ngroupByKey(\n  numPartitions: Int): RDD[(K, Iterable[V])] // (2)!\ngroupByKey(\n  partitioner: Partitioner): RDD[(K, Iterable[V])]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Creates a HashPartitioner with the given <code>numPartitions</code> partitions</li> </ol> <p><code>groupByKey</code>...FIXME</p>  <p><code>groupByKey</code> is used when:</p> <ul> <li>RDD.groupBy high-level operator is used</li> </ul>","text":""},{"location":"rdd/PairRDDFunctions/#partitionby","title":"partitionBy <pre><code>partitionBy(\n  partitioner: Partitioner): RDD[(K, V)]\n</code></pre> <p><code>partitionBy</code>...FIXME</p>","text":""},{"location":"rdd/PairRDDFunctions/#reducebykey","title":"reduceByKey <pre><code>reduceByKey(\n  func: (V, V) =&gt; V): RDD[(K, V)] // (1)!\nreduceByKey(\n  func: (V, V) =&gt; V,\n  numPartitions: Int): RDD[(K, V)] // (2)!\nreduceByKey(\n  partitioner: Partitioner,\n  func: (V, V) =&gt; V): RDD[(K, V)]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Creates a HashPartitioner with the given <code>numPartitions</code> partitions</li> </ol> <p><code>reduceByKey</code> is sort of a particular case of aggregateByKey.</p>  <p><code>reduceByKey</code> is used when:</p> <ul> <li>RDD.distinct high-level operator is used</li> </ul>","text":""},{"location":"rdd/PairRDDFunctions/#saveasnewapihadoopfile","title":"saveAsNewAPIHadoopFile <pre><code>saveAsNewAPIHadoopFile(\n  path: String,\n  keyClass: Class[_],\n  valueClass: Class[_],\n  outputFormatClass: Class[_ &lt;: NewOutputFormat[_, _]],\n  conf: Configuration = self.context.hadoopConfiguration): Unit\nsaveAsNewAPIHadoopFile[F &lt;: NewOutputFormat[K, V]](\n  path: String)(implicit fm: ClassTag[F]): Unit\n</code></pre> <p><code>saveAsNewAPIHadoopFile</code> creates a new <code>Job</code> (Hadoop MapReduce) for the given <code>Configuration</code> (Hadoop).</p> <p><code>saveAsNewAPIHadoopFile</code> configures the <code>Job</code> (with the given <code>keyClass</code>, <code>valueClass</code> and <code>outputFormatClass</code>).</p> <p><code>saveAsNewAPIHadoopFile</code> sets <code>mapreduce.output.fileoutputformat.outputdir</code> configuration property to be the given <code>path</code> and saveAsNewAPIHadoopDataset.</p>","text":""},{"location":"rdd/PairRDDFunctions/#saveasnewapihadoopdataset","title":"saveAsNewAPIHadoopDataset <pre><code>saveAsNewAPIHadoopDataset(\n  conf: Configuration): Unit\n</code></pre> <p><code>saveAsNewAPIHadoopDataset</code> creates a new HadoopMapReduceWriteConfigUtil (with the given <code>Configuration</code>) and writes the RDD out.</p> <p><code>Configuration</code> should have all the relevant output params set (an output format, output paths, e.g. a table name to write to) in the same way as it would be configured for a Hadoop MapReduce job.</p>","text":""},{"location":"rdd/ParallelCollectionRDD/","title":"ParallelCollectionRDD","text":"<p><code>ParallelCollectionRDD</code> is an RDD of a collection of elements with <code>numSlices</code> partitions and optional <code>locationPrefs</code>.</p> <p><code>ParallelCollectionRDD</code> is the result of <code>SparkContext.parallelize</code> and <code>SparkContext.makeRDD</code> methods.</p> <p>The data collection is split on to <code>numSlices</code> slices.</p> <p>It uses <code>ParallelCollectionPartition</code>.</p>"},{"location":"rdd/Partition/","title":"Partition","text":"<p><code>Partition</code> is a &lt;&gt; of a &lt;&gt; of a RDD. <p>NOTE: A partition is missing when it has not be computed yet.</p> <p>[[contract]] [[index]] <code>Partition</code> is identified by an partition index that is a unique identifier of a partition of a RDD.</p>"},{"location":"rdd/Partition/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/Partition/#index-int","title":"index: Int","text":""},{"location":"rdd/Partitioner/","title":"Partitioner","text":"<p><code>Partitioner</code> is an abstraction of partitioners that define how the elements in a key-value pair RDD are partitioned by key.</p> <p><code>Partitioner</code> maps keys to partition IDs (from 0 to numPartitions exclusive).</p> <p><code>Partitioner</code> ensures that records with the same key are in the same partition.</p> <p><code>Partitioner</code> is a Java <code>Serializable</code>.</p>"},{"location":"rdd/Partitioner/#contract","title":"Contract","text":""},{"location":"rdd/Partitioner/#partition-for-key","title":"Partition for Key <pre><code>getPartition(\n  key: Any): Int\n</code></pre> <p>Partition ID for the given key</p>","text":""},{"location":"rdd/Partitioner/#number-of-partitions","title":"Number of Partitions <pre><code>numPartitions: Int\n</code></pre>","text":""},{"location":"rdd/Partitioner/#implementations","title":"Implementations","text":"<ul> <li>HashPartitioner</li> <li>RangePartitioner</li> </ul>"},{"location":"rdd/RDD/","title":"RDD \u2014 Description of Distributed Computation","text":"<p><code>RDD[T]</code> is an abstraction of fault-tolerant resilient distributed datasets that are mere descriptions of computations over a distributed collection of records (of type <code>T</code>).</p>"},{"location":"rdd/RDD/#contract","title":"Contract","text":""},{"location":"rdd/RDD/#computing-partition","title":"Computing Partition <pre><code>compute(\n  split: Partition,\n  context: TaskContext): Iterator[T]\n</code></pre> <p>Computes the input Partition (with the TaskContext) to produce values (of type <code>T</code>).</p> <p>Used when:</p> <ul> <li><code>RDD</code> is requested to computeOrReadCheckpoint</li> </ul>","text":""},{"location":"rdd/RDD/#getpartitions","title":"getPartitions <pre><code>getPartitions: Array[Partition]\n</code></pre> <p>Used when:</p> <ul> <li><code>RDD</code> is requested for the partitions</li> </ul>","text":""},{"location":"rdd/RDD/#implementations","title":"Implementations","text":"<ul> <li>CheckpointRDD</li> <li>CoalescedRDD</li> <li>CoGroupedRDD</li> <li>HadoopRDD</li> <li>MapPartitionsRDD</li> <li>NewHadoopRDD</li> <li>ParallelCollectionRDD</li> <li>ReliableCheckpointRDD</li> <li>ShuffledRDD</li> <li>SubtractedRDD</li> <li>others</li> </ul>"},{"location":"rdd/RDD/#creating-instance","title":"Creating Instance","text":"<p><code>RDD</code> takes the following to be created:</p> <ul> <li> SparkContext <li> Dependencies (Parent RDDs that should be computed successfully before this RDD) Abstract Class <p><code>RDD</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete RDDs.</p>"},{"location":"rdd/RDD/#barrier-rdd","title":"Barrier RDD","text":"<p>Barrier RDD is a <code>RDD</code> with the isBarrier flag enabled.</p> <p>ShuffledRDD can never be a barrier RDD as it overrides isBarrier method to be always disabled (<code>false</code>).</p>"},{"location":"rdd/RDD/#isBarrier","title":"isBarrier <pre><code>isBarrier(): Boolean\n</code></pre> <p><code>isBarrier</code> is the value of isBarrier_.</p>  <p><code>isBarrier</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to submitMissingTasks (that are either ShuffleMapStages to create ShuffleMapTasks or ResultStage to create ResultTasks)</li> <li><code>RDDInfo</code> is created</li> <li><code>ShuffleDependency</code> is requested to canShuffleMergeBeEnabled</li> <li><code>DAGScheduler</code> is requested to checkBarrierStageWithRDDChainPattern, checkBarrierStageWithDynamicAllocation, checkBarrierStageWithNumSlots, handleTaskCompletion (<code>FetchFailed</code> case to mark a map stage as broken)</li> </ul>","text":""},{"location":"rdd/RDD/#isBarrier_","title":"isBarrier_ <pre><code>isBarrier_ : Boolean // (1)!\n</code></pre> <ol> <li><code>@transient protected lazy val</code></li> </ol> <p><code>isBarrier_</code> is enabled (<code>true</code>) when there is at least one barrier RDD among the parent RDDs (excluding ShuffleDependencyies).</p>  <p>Note</p> <p><code>isBarrier_</code> is overriden by <code>PythonRDD</code> and MapPartitionsRDD that both accept <code>isFromBarrier</code> flag.</p>","text":""},{"location":"rdd/RDD/#resourceProfile","title":"ResourceProfile (Stage-Level Scheduling)","text":"<p><code>RDD</code> can be assigned a ResourceProfile using RDD.withResources method.</p> <pre><code>val rdd: RDD[_] = ...\nrdd\n  .withResources(...) // request resources for a computation\n  .mapPartitions(...) // the computation\n</code></pre> <p><code>RDD</code> uses <code>resourceProfile</code> internal registry for the ResourceProfile that is undefined initially.</p> <p>The <code>ResourceProfile</code> is available using RDD.getResourceProfile method.</p>"},{"location":"rdd/RDD/#withResources","title":"withResources <pre><code>withResources(\n  rp: ResourceProfile): this.type\n</code></pre> <p><code>withResources</code> sets the given ResourceProfile as the resourceProfile and requests the ResourceProfileManager to add the resource profile.</p>","text":""},{"location":"rdd/RDD/#getResourceProfile","title":"getResourceProfile <pre><code>getResourceProfile(): ResourceProfile\n</code></pre> <p><code>getResourceProfile</code> returns the resourceProfile (if defined) or <code>null</code>.</p>  <p><code>getResourceProfile</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested for the ShuffleDependencies and ResourceProfiles of an RDD</li> </ul>","text":""},{"location":"rdd/RDD/#preferred-locations-placement-preferences-of-partition","title":"Preferred Locations (Placement Preferences of Partition) <pre><code>preferredLocations(\n  split: Partition): Seq[String]\n</code></pre>  Final Method <p><code>preferredLocations</code> is a Scala final method and may not be overridden in subclasses.</p> <p>Learn more in the Scala Language Specification.</p>  <p><code>preferredLocations</code> requests the CheckpointRDD for the preferred locations for the given Partition if this <code>RDD</code> is checkpointed orgetPreferredLocations.</p>  <p><code>preferredLocations</code> is a template method that uses getPreferredLocations that custom <code>RDD</code>s can override to specify placement preferences on their own.</p>  <p><code>preferredLocations</code>\u00a0is used when:</p> <ul> <li><code>DAGScheduler</code> is requested for preferred locations</li> </ul>","text":""},{"location":"rdd/RDD/#partitions","title":"Partitions <pre><code>partitions: Array[Partition]\n</code></pre>  Final Method <p><code>partitions</code> is a Scala final method and may not be overridden in subclasses.</p> <p>Learn more in the Scala Language Specification.</p>  <p><code>partitions</code> requests the CheckpointRDD for the partitions if this <code>RDD</code> is checkpointed.</p> <p>Otherwise, when this <code>RDD</code> is not checkpointed, <code>partitions</code> getPartitions (and caches it in the partitions_).</p>  <p>Note</p> <p><code>getPartitions</code> is an abstract method that custom <code>RDD</code>s are required to provide.</p>   <p><code>partitions</code> has the property that their internal index should be equal to their position in this <code>RDD</code>.</p>  <p><code>partitions</code>\u00a0is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to getPreferredLocsInternal</li> <li><code>SparkContext</code> is requested to run a job</li> <li>others</li> </ul>","text":""},{"location":"rdd/RDD/#dependencies","title":"dependencies <pre><code>dependencies: Seq[Dependency[_]]\n</code></pre>  Final Method <p><code>dependencies</code> is a Scala final method and may not be overridden in subclasses.</p> <p>Learn more in the Scala Language Specification.</p>  <p><code>dependencies</code> branches off based on checkpointRDD (and availability of CheckpointRDD).</p> <p>With CheckpointRDD available (this <code>RDD</code> is checkpointed), <code>dependencies</code> returns a OneToOneDependency with the <code>CheckpointRDD</code>.</p> <p>Otherwise, when this <code>RDD</code> is not checkpointed, <code>dependencies</code> getDependencies (and caches it in the dependencies_).</p>  <p>Note</p> <p><code>getDependencies</code> is an abstract method that custom <code>RDD</code>s are required to provide.</p>","text":""},{"location":"rdd/RDD/#reliable-checkpointing","title":"Reliable Checkpointing <pre><code>checkpoint(): Unit\n</code></pre> <p><code>checkpoint</code> creates a new ReliableRDDCheckpointData (with this <code>RDD</code>) and saves it in checkpointData registry.</p> <p><code>checkpoint</code> does nothing when the checkpointData registry has already been defined.</p> <p><code>checkpoint</code> throws a <code>SparkException</code> when the checkpoint directory is not specified:</p> <pre><code>Checkpoint directory has not been set in the SparkContext\n</code></pre>","text":""},{"location":"rdd/RDD/#rddcheckpointdata","title":"RDDCheckpointData <p><code>RDD</code> defines <code>checkpointData</code> internal registry for a RDDCheckpointData[T] (of <code>T</code> type of this <code>RDD</code>).</p> <p>The <code>checkpointData</code> registry is undefined (<code>None</code>) when <code>RDD</code> is created and can be the following values:</p> <ul> <li>ReliableRDDCheckpointData in checkpoint</li> <li>LocalRDDCheckpointData in localCheckpoint</li> </ul> <p>Used when:</p> <ul> <li>isCheckpointedAndMaterialized</li> <li>isLocallyCheckpointed</li> <li>isReliablyCheckpointed</li> <li>getCheckpointFile</li> <li>doCheckpoint</li> </ul>","text":""},{"location":"rdd/RDD/#checkpointrdd","title":"CheckpointRDD <pre><code>checkpointRDD: Option[CheckpointRDD[T]]\n</code></pre> <p><code>checkpointRDD</code> returns the CheckpointRDD of the RDDCheckpointData (if defined and so this <code>RDD</code> checkpointed).</p> <p><code>checkpointRDD</code> is used when:</p> <ul> <li><code>RDD</code> is requested for the dependencies, partitions and preferred locations (all using final methods!)</li> </ul>","text":""},{"location":"rdd/RDD/#docheckpoint","title":"doCheckpoint <pre><code>doCheckpoint(): Unit\n</code></pre> <p><code>doCheckpoint</code> executes in <code>checkpoint</code> scope.</p> <p><code>doCheckpoint</code> turns the doCheckpointCalled flag on (to prevent multiple executions).</p> <p><code>doCheckpoint</code> branches off based on whether a RDDCheckpointData is defined or not:</p> <ol> <li> <p>With the <code>RDDCheckpointData</code> defined, <code>doCheckpoint</code> checks out the checkpointAllMarkedAncestors flag and if enabled, <code>doCheckpoint</code> requests the Dependencies for the RDD that are in turn requested to doCheckpoint themselves. Otherwise, <code>doCheckpoint</code> requests the RDDCheckpointData to checkpoint.</p> </li> <li> <p>With the RDDCheckpointData undefined, <code>doCheckpoint</code> requests the Dependencies for the RDD that are in turn requested to doCheckpoint themselves.</p> </li> </ol> <p>In other words, With the <code>RDDCheckpointData</code> defined, requesting doCheckpointing of the Dependencies is guarded by checkpointAllMarkedAncestors flag.</p> <p><code>doCheckpoint</code> skips execution if called earlier.</p>  <p><code>doCheckpoint</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to run a job synchronously</li> </ul>","text":""},{"location":"rdd/RDD/#iterator","title":"iterator <pre><code>iterator(\n  split: Partition,\n  context: TaskContext): Iterator[T]\n</code></pre> <p><code>iterator</code>...FIXME</p>  <p>Final Method</p> <p><code>iterator</code> is a <code>final</code> method and may not be overridden in subclasses. See 5.2.6 final in the Scala Language Specification.</p>","text":""},{"location":"rdd/RDD/#getorcompute","title":"getOrCompute <pre><code>getOrCompute(\n  partition: Partition,\n  context: TaskContext): Iterator[T]\n</code></pre> <p><code>getOrCompute</code>...FIXME</p>","text":""},{"location":"rdd/RDD/#computeorreadcheckpoint","title":"computeOrReadCheckpoint <pre><code>computeOrReadCheckpoint(\n  split: Partition,\n  context: TaskContext): Iterator[T]\n</code></pre> <p><code>computeOrReadCheckpoint</code>...FIXME</p>","text":""},{"location":"rdd/RDD/#debugging-recursive-dependencies","title":"Debugging Recursive Dependencies <pre><code>toDebugString: String\n</code></pre> <p><code>toDebugString</code> returns a RDD Lineage Graph.</p> <pre><code>val wordCount = sc.textFile(\"README.md\")\n  .flatMap(_.split(\"\\\\s+\"))\n  .map((_, 1))\n  .reduceByKey(_ + _)\n\nscala&gt; println(wordCount.toDebugString)\n(2) ShuffledRDD[21] at reduceByKey at &lt;console&gt;:24 []\n +-(2) MapPartitionsRDD[20] at map at &lt;console&gt;:24 []\n    |  MapPartitionsRDD[19] at flatMap at &lt;console&gt;:24 []\n    |  README.md MapPartitionsRDD[18] at textFile at &lt;console&gt;:24 []\n    |  README.md HadoopRDD[17] at textFile at &lt;console&gt;:24 []\n</code></pre> <p><code>toDebugString</code> uses indentations to indicate a shuffle boundary.</p> <p>The numbers in round brackets show the level of parallelism at each stage, e.g. <code>(2)</code> in the above output.</p> <pre><code>scala&gt; println(wordCount.getNumPartitions)\n2\n</code></pre> <p>With spark.logLineage enabled, <code>toDebugString</code> is printed out when executing an action.</p> <pre><code>$ ./bin/spark-shell --conf spark.logLineage=true\n\nscala&gt; sc.textFile(\"README.md\", 4).count\n...\n15/10/17 14:46:42 INFO SparkContext: Starting job: count at &lt;console&gt;:25\n15/10/17 14:46:42 INFO SparkContext: RDD's recursive dependencies:\n(4) MapPartitionsRDD[1] at textFile at &lt;console&gt;:25 []\n |  README.md HadoopRDD[0] at textFile at &lt;console&gt;:25 []\n</code></pre>","text":""},{"location":"rdd/RDD/#coalesce","title":"coalesce <pre><code>coalesce(\n  numPartitions: Int,\n  shuffle: Boolean = false,\n  partitionCoalescer: Option[PartitionCoalescer] = Option.empty)\n  (implicit ord: Ordering[T] = null): RDD[T]\n</code></pre> <p><code>coalesce</code>...FIXME</p>  <p><code>coalesce</code> is used when:</p> <ul> <li>RDD.repartition high-level operator is used</li> </ul>","text":""},{"location":"rdd/RDD/#implicit-methods","title":"Implicit Methods","text":""},{"location":"rdd/RDD/#rddtoorderedrddfunctions","title":"rddToOrderedRDDFunctions <pre><code>rddToOrderedRDDFunctions[K : Ordering : ClassTag, V: ClassTag](\n  rdd: RDD[(K, V)]): OrderedRDDFunctions[K, V, (K, V)]\n</code></pre> <p><code>rddToOrderedRDDFunctions</code> is an Scala implicit method that creates an OrderedRDDFunctions.</p> <p><code>rddToOrderedRDDFunctions</code> is used (implicitly) when:</p> <ul> <li>RDD.sortBy</li> <li>PairRDDFunctions.combineByKey</li> </ul>","text":""},{"location":"rdd/RDDCheckpointData/","title":"RDDCheckpointData","text":"<p>RDDCheckpointData is an abstraction of information related to RDD checkpointing.</p> <p>== [[implementations]] Available RDDCheckpointDatas</p> <p>[cols=\"30,70\",options=\"header\",width=\"100%\"] |=== | RDDCheckpointData | Description</p> <p>| rdd:LocalRDDCheckpointData.md[LocalRDDCheckpointData] | [[LocalRDDCheckpointData]]</p> <p>| rdd:ReliableRDDCheckpointData.md[ReliableRDDCheckpointData] | [[ReliableRDDCheckpointData]] Reliable Checkpointing</p> <p>|===</p> <p>== [[creating-instance]] Creating Instance</p> <p>RDDCheckpointData takes the following to be created:</p> <ul> <li>[[rdd]] rdd:RDD.md[RDD]</li> </ul> <p>== [[Serializable]] RDDCheckpointData as Serializable</p> <p>RDDCheckpointData is java.io.Serializable.</p> <p>== [[cpState]] States</p> <ul> <li> <p>[[Initialized]] Initialized</p> </li> <li> <p>[[CheckpointingInProgress]] CheckpointingInProgress</p> </li> <li> <p>[[Checkpointed]] Checkpointed</p> </li> </ul> <p>== [[checkpoint]] Checkpointing RDD</p>"},{"location":"rdd/RDDCheckpointData/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/RDDCheckpointData/#checkpoint-checkpointrddt","title":"checkpoint(): CheckpointRDD[T]","text":"<p>checkpoint changes the &lt;&gt; to &lt;&gt; only when in &lt;&gt; state. Otherwise, checkpoint does nothing and returns. <p>checkpoint &lt;&gt; that gives an CheckpointRDD (that is the &lt;&gt; internal registry). <p>checkpoint changes the &lt;&gt; to &lt;&gt;. <p>In the end, checkpoint requests the given &lt;&gt; to rdd:RDD.md#markCheckpointed[markCheckpointed]. <p>checkpoint is used when RDD is requested to rdd:RDD.md#doCheckpoint[doCheckpoint].</p> <p>== [[doCheckpoint]] doCheckpoint Method</p>"},{"location":"rdd/RDDCheckpointData/#source-scala_1","title":"[source, scala]","text":""},{"location":"rdd/RDDCheckpointData/#docheckpoint-checkpointrddt","title":"doCheckpoint(): CheckpointRDD[T]","text":"<p>doCheckpoint is used when RDDCheckpointData is requested to &lt;&gt;."},{"location":"rdd/RangePartitioner/","title":"RangePartitioner","text":"<p><code>RangePartitioner</code> is a Partitioner that partitions sortable records by range into roughly equal ranges (that can be used for bucketed partitioning).</p> <p><code>RangePartitioner</code> is used for sortByKey operator (mostly).</p>"},{"location":"rdd/RangePartitioner/#creating-instance","title":"Creating Instance","text":"<p><code>RangePartitioner</code> takes the following to be created:</p> <ul> <li> Hint for the number of partitions <li> Key-Value RDD (<code>RDD[_ &lt;: Product2[K, V]]</code>) <li> <code>ascending</code> flag (default: <code>true</code>) <li> samplePointsPerPartitionHint (default: <code>20</code>)"},{"location":"rdd/RangePartitioner/#number-of-partitions","title":"Number of Partitions <pre><code>numPartitions: Int\n</code></pre> <p><code>numPartitions</code>\u00a0is part of the Partitioner abstraction.</p>  <p><code>numPartitions</code> is 1 more than the length of the range bounds (since the number of range bounds is 0 for 0 or 1 partitions).</p>","text":""},{"location":"rdd/RangePartitioner/#partition-for-key","title":"Partition for Key <pre><code>getPartition(\n  key: Any): Int\n</code></pre> <p><code>getPartition</code>\u00a0is part of the Partitioner abstraction.</p>  <p><code>getPartition</code> branches off based on the length of the range bounds.</p> <p>For up to 128 range bounds, <code>getPartition</code> is either the first range bound (from the rangeBounds) for which the <code>key</code> value is greater than the value of the range bound or 128 (if no value was found among the rangeBounds). <code>getPartition</code> starts finding a candidate partition number from <code>0</code> and walks over the rangeBounds until a range bound for which the given <code>key</code> value is greater than the value of the range bound is found or there are no more rangeBounds. <code>getPartition</code> increments the candidate partition candidate every iteration.</p> <p>For the number of the rangeBounds above 128, <code>getPartition</code>...FIXME</p> <p>In the end, <code>getPartition</code> returns the candidate partition number for the ascending enabled, or flips it (to be the number of the rangeBounds minus the candidate partition number), otheriwse.</p>","text":""},{"location":"rdd/RangePartitioner/#range-bounds","title":"Range Bounds <pre><code>rangeBounds: Array[K]\n</code></pre> <p><code>rangeBounds</code> is an array of upper bounds.</p> <p>For the number of partitions up to and including 1, <code>rangeBounds</code> is an empty array.</p> <p>For more than 1 partitions, <code>rangeBounds</code> determines the sample size per partitions. The total sample size is the samplePointsPerPartitionHint multiplied by the number of partitions capped by <code>1e6</code>. <code>rangeBounds</code> allows for 3x over-sample per partition.</p> <p><code>rangeBounds</code> sketches the keys of the input rdd (with the <code>sampleSizePerPartition</code>).</p>  <p>Note</p> <p>There is more going on in <code>rangeBounds</code>.</p>  <p>In the end, <code>rangeBounds</code> determines the bounds.</p>","text":""},{"location":"rdd/RangePartitioner/#determinebounds","title":"determineBounds <pre><code>determineBounds[K: Ordering](\n  candidates: ArrayBuffer[(K, Float)],\n  partitions: Int): Array[K]\n</code></pre> <p><code>determineBounds</code>...FIXME</p>","text":""},{"location":"rdd/ReliableCheckpointRDD/","title":"ReliableCheckpointRDD","text":"<p><code>ReliableCheckpointRDD</code> is an CheckpointRDD.</p>"},{"location":"rdd/ReliableCheckpointRDD/#creating-instance","title":"Creating Instance","text":"<p>ReliableCheckpointRDD takes the following to be created:</p> <ul> <li>[[sc]] SparkContext.md[]</li> <li>[[checkpointPath]] Checkpoint Directory (on a Hadoop DFS-compatible file system)</li> <li>&lt;&lt;_partitioner, Partitioner&gt;&gt;</li> </ul> <p>ReliableCheckpointRDD is created when:</p> <ul> <li> <p>ReliableCheckpointRDD utility is used to &lt;&gt;. <li> <p>SparkContext is requested to SparkContext.md#checkpointFile[checkpointFile]</p> </li> <p>== [[checkpointPartitionerFileName]] Checkpointed Partitioner File</p> <p>ReliableCheckpointRDD uses _partitioner as the name of the file in the &lt;&gt; with the &lt;&gt; serialized to. <p>== [[partitioner]] Partitioner</p> <p>ReliableCheckpointRDD can be given a rdd:Partitioner.md[Partitioner] to be created.</p> <p>When rdd:RDD.md#partitioner[requested for the Partitioner] (as an RDD), ReliableCheckpointRDD returns the one it was created with or &lt;&gt;. <p>== [[writeRDDToCheckpointDirectory]] Writing RDD to Checkpoint Directory</p>"},{"location":"rdd/ReliableCheckpointRDD/#source-scala","title":"[source, scala]","text":"<p>writeRDDToCheckpointDirectoryT: ClassTag: ReliableCheckpointRDD[T]</p> <p>writeRDDToCheckpointDirectory...FIXME</p> <p>writeRDDToCheckpointDirectory is used when ReliableRDDCheckpointData is requested to rdd:ReliableRDDCheckpointData.md#doCheckpoint[doCheckpoint].</p> <p>== [[writePartitionerToCheckpointDir]] Writing Partitioner to Checkpoint Directory</p>"},{"location":"rdd/ReliableCheckpointRDD/#sourcescala","title":"[source,scala]","text":"<p>writePartitionerToCheckpointDir(   sc: SparkContext,   partitioner: Partitioner,   checkpointDirPath: Path): Unit</p> <p>writePartitionerToCheckpointDir creates the &lt;&gt; with the buffer size based on configuration-properties.md#spark.buffer.size[spark.buffer.size] configuration property. <p>writePartitionerToCheckpointDir requests the core:SparkEnv.md#serializer[default Serializer] for a new serializer:Serializer.md#newInstance[SerializerInstance].</p> <p>writePartitionerToCheckpointDir requests the SerializerInstance to serializer:SerializerInstance.md#serializeStream[serialize the output stream] and serializer:DeserializationStream.md#writeObject[writes] the given Partitioner.</p> <p>In the end, writePartitionerToCheckpointDir prints out the following DEBUG message to the logs:</p>"},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#written-partitioner-to-partitionerfilepath","title":"Written partitioner to [partitionerFilePath]","text":"<p>In case of any non-fatal exception, writePartitionerToCheckpointDir prints out the following DEBUG message to the logs:</p>"},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#error-writing-partitioner-partitioner-to-checkpointdirpath","title":"Error writing partitioner [partitioner] to [checkpointDirPath]","text":"<p>writePartitionerToCheckpointDir is used when ReliableCheckpointRDD is requested to &lt;&gt;. <p>== [[readCheckpointedPartitionerFile]] Reading Partitioner from Checkpointed Directory</p>"},{"location":"rdd/ReliableCheckpointRDD/#sourcescala_1","title":"[source,scala]","text":"<p>readCheckpointedPartitionerFile(   sc: SparkContext,   checkpointDirPath: String): Option[Partitioner]</p> <p>readCheckpointedPartitionerFile opens the &lt;&gt; with the buffer size based on configuration-properties.md#spark.buffer.size[spark.buffer.size] configuration property. <p>readCheckpointedPartitionerFile requests the core:SparkEnv.md#serializer[default Serializer] for a new serializer:Serializer.md#newInstance[SerializerInstance].</p> <p>readCheckpointedPartitionerFile requests the SerializerInstance to serializer:SerializerInstance.md#deserializeStream[deserialize the input stream] and serializer:DeserializationStream.md#readObject[read the Partitioner] from the partitioner file.</p> <p>readCheckpointedPartitionerFile prints out the following DEBUG message to the logs and returns the partitioner.</p>"},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext_2","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#read-partitioner-from-partitionerfilepath","title":"Read partitioner from [partitionerFilePath]","text":"<p>In case of FileNotFoundException or any non-fatal exceptions, readCheckpointedPartitionerFile prints out a corresponding message to the logs and returns None.</p> <p>readCheckpointedPartitionerFile is used when ReliableCheckpointRDD is requested for the &lt;&gt;. <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.rdd.ReliableCheckpointRDD$</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>"},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#log4jloggerorgapachesparkrddreliablecheckpointrddall","title":"log4j.logger.org.apache.spark.rdd.ReliableCheckpointRDD$=ALL","text":"<p>Refer to spark-logging.md[Logging].</p>"},{"location":"rdd/ReliableRDDCheckpointData/","title":"ReliableRDDCheckpointData","text":"<p><code>ReliableRDDCheckpointData</code> is a RDDCheckpointData for Reliable Checkpointing.</p>"},{"location":"rdd/ReliableRDDCheckpointData/#creating-instance","title":"Creating Instance","text":"<p>ReliableRDDCheckpointData takes the following to be created:</p> <ul> <li>[[rdd]] rdd:RDD.md[++RDD[T]++]</li> </ul> <p>ReliableRDDCheckpointData is created for rdd:RDD.md#checkpoint[RDD.checkpoint] operator.</p> <p>== [[cpDir]][[checkpointPath]] Checkpoint Directory</p> <p>ReliableRDDCheckpointData creates a subdirectory of the SparkContext.md#checkpointDir[application-wide checkpoint directory] for &lt;&gt; the given &lt;&gt;. <p>The name of the subdirectory uses the rdd:RDD.md#id[unique identifier] of the &lt;&gt;:"},{"location":"rdd/ReliableRDDCheckpointData/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableRDDCheckpointData/#rdd-id","title":"rdd-[id]","text":"<p>== [[doCheckpoint]] Checkpointing RDD</p>"},{"location":"rdd/ReliableRDDCheckpointData/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/ReliableRDDCheckpointData/#docheckpoint-checkpointrddt","title":"doCheckpoint(): CheckpointRDD[T]","text":"<p>doCheckpoint rdd:ReliableCheckpointRDD.md#writeRDDToCheckpointDirectory[writes] the &lt;&gt; to the &lt;&gt; (that creates a new RDD). <p>With configuration-properties.md#spark.cleaner.referenceTracking.cleanCheckpoints[spark.cleaner.referenceTracking.cleanCheckpoints] configuration property enabled, doCheckpoint requests the SparkContext.md#cleaner[ContextCleaner] to core:ContextCleaner.md#registerRDDCheckpointDataForCleanup[registerRDDCheckpointDataForCleanup] for the new RDD.</p> <p>In the end, doCheckpoint prints out the following INFO message to the logs and returns the new RDD.</p>"},{"location":"rdd/ReliableRDDCheckpointData/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableRDDCheckpointData/#done-checkpointing-rdd-id-to-cpdir-new-parent-is-rdd-id","title":"Done checkpointing RDD [id] to [cpDir], new parent is RDD [id]","text":"<p>doCheckpoint is part of the rdd:RDDCheckpointData.md#doCheckpoint[RDDCheckpointData] abstraction.</p>"},{"location":"rdd/ShuffleDependency/","title":"ShuffleDependency","text":"<p><code>ShuffleDependency</code> is a Dependency on the output of a ShuffleMapStage of a key-value RDD.</p> <p><code>ShuffleDependency</code> uses the RDD to know the number of (map-side/pre-shuffle) partitions and the Partitioner for the number of (reduce-size/post-shuffle) partitions.</p> <pre><code>ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]\n</code></pre>","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleDependency</code> takes the following to be created:</p> <ul> <li> RDD (<code>RDD[_ &lt;: Product2[K, V]]</code>) <li>Partitioner</li> <li> Serializer (default: SparkEnv.get.serializer) <li> Optional Key Ordering (default: undefined) <li>Optional Aggregator</li> <li>mapSideCombine</li> <li>ShuffleWriteProcessor</li> <p><code>ShuffleDependency</code> is created\u00a0when:</p> <ul> <li><code>CoGroupedRDD</code> is requested for the dependencies (for RDDs with different partitioners)</li> <li><code>ShuffledRDD</code> is requested for the dependencies</li> <li><code>SubtractedRDD</code> is requested for the dependencies (for an RDD with different partitioner)</li> <li><code>ShuffleExchangeExec</code> (Spark SQL) physical operator is requested to prepare a <code>ShuffleDependency</code></li> </ul> <p>When created, <code>ShuffleDependency</code> gets the shuffle id.</p> <p><code>ShuffleDependency</code> registers itself with the ShuffleManager and gets a <code>ShuffleHandle</code> (available as shuffleHandle). <code>ShuffleDependency</code> uses SparkEnv to access the ShuffleManager.</p> <p>In the end, <code>ShuffleDependency</code> registers itself with the ContextCleaner (if configured) and the ShuffleDriverComponents.</p>","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#aggregator","title":"Aggregator <pre><code>aggregator: Option[Aggregator[K, V, C]]\n</code></pre> <p><code>ShuffleDependency</code> can be given a map/reduce-side Aggregator when created.</p> <p><code>ShuffleDependency</code> asserts (when created) that an <code>Aggregator</code> is defined when the mapSideCombine flag is enabled.</p> <p><code>aggregator</code>\u00a0is used when:</p> <ul> <li><code>SortShuffleWriter</code> is requested to write records (for mapper tasks)</li> <li><code>BlockStoreShuffleReader</code> is requested to read records (for reducer tasks)</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#map-size-partial-aggregation-flag","title":"Map-Size Partial Aggregation Flag <p><code>ShuffleDependency</code> uses a <code>mapSideCombine</code> flag that controls whether to perform map-side partial aggregation (map-side combine) using the Aggregator.</p> <p><code>mapSideCombine</code> is disabled (<code>false</code>) by default and can be enabled (<code>true</code>) for some uses of ShuffledRDD.</p> <p><code>ShuffleDependency</code> requires that the optional Aggregator is actually defined for the flag enabled.</p> <p><code>mapSideCombine</code> is used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> <li><code>SortShuffleManager</code> is requested to register a shuffle</li> <li><code>SortShuffleWriter</code> is requested to write records</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#partitioner","title":"Partitioner <p><code>ShuffleDependency</code> is given a Partitioner (when created).</p> <p><code>ShuffleDependency</code> uses the <code>Partitioner</code> to partition the shuffle output.</p> <p>The <code>Partitioner</code> is used when:</p> <ul> <li><code>SortShuffleWriter</code> is requested to write records (and create an ExternalSorter)</li> <li>others (FIXME)</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#shufflewriteprocessor","title":"ShuffleWriteProcessor <p><code>ShuffleDependency</code> can be given a ShuffleWriteProcessor when created.</p> <p>The <code>ShuffleWriteProcessor</code> is used when:</p> <ul> <li><code>ShuffleMapTask</code> is requested to runTask (to write partition records out to the shuffle system)</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#shuffle-id","title":"Shuffle ID <pre><code>shuffleId: Int\n</code></pre> <p><code>ShuffleDependency</code> is identified uniquely by an application-wide shuffle ID (that is requested from SparkContext when created).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#shufflehandle","title":"ShuffleHandle <p><code>ShuffleDependency</code> registers itself with the ShuffleManager when created.</p> <p>The <code>ShuffleHandle</code> is used when:</p> <ul> <li>CoGroupedRDDs, ShuffledRDD, SubtractedRDD, and <code>ShuffledRowRDD</code> (Spark SQL) are requested to compute a partition (to get a ShuffleReader for a <code>ShuffleDependency</code>)</li> <li><code>ShuffleMapTask</code> is requested to run (to get a <code>ShuffleWriter</code> for a ShuffleDependency).</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/","title":"ShuffledRDD","text":"<p><code>ShuffledRDD</code> is an RDD of key-value pairs that represents a shuffle step in a RDD lineage (and indicates start of a new stage).</p> <p>When requested to compute a partition, <code>ShuffledRDD</code> uses the one and only ShuffleDependency for a ShuffleHandle for a ShuffleReader (from the system ShuffleManager) that is used to read the (combined) key-value pairs.</p>","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffledRDD</code> takes the following to be created:</p> <ul> <li> RDD (of <code>K</code> keys and <code>V</code> values) <li>Partitioner</li> <p><code>ShuffledRDD</code> is created\u00a0for the following RDD operators:</p> <ul> <li> <p>OrderedRDDFunctions.sortByKey and OrderedRDDFunctions.repartitionAndSortWithinPartitions</p> </li> <li> <p>PairRDDFunctions.combineByKeyWithClassTag and PairRDDFunctions.partitionBy</p> </li> <li> <p>RDD.coalesce (with <code>shuffle</code> flag enabled)</p> </li> </ul>","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#partitioner","title":"Partitioner <p><code>ShuffledRDD</code> is given a Partitioner when created:</p> <ul> <li>RangePartitioner for sortByKey</li> <li>HashPartitioner for coalesce</li> <li>Whatever passed in to the following high-level RDD operators when different from the current <code>Partitioner</code> (of the RDD):<ul> <li>repartitionAndSortWithinPartitions</li> <li>combineByKeyWithClassTag</li> <li>partitionBy</li> </ul> </li> </ul> <p>The given <code>Partitioner</code> is the partitioner of this <code>ShuffledRDD</code>.</p> <p>The <code>Partitioner</code> is also used when:</p> <ul> <li>getDependencies (to create the only ShuffleDependency)</li> <li>getPartitions (to create as many <code>ShuffledRDDPartition</code>s as the numPartitions of the <code>Partitioner</code>)</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#dependencies","title":"Dependencies  Signature <pre><code>getDependencies: Seq[Dependency[_]]\n</code></pre> <p><code>getDependencies</code> is part of the RDD abstraction.</p>  <p><code>getDependencies</code> uses the user-specified Serializer, if defined, or requests the current SerializerManager for one.</p> <p><code>getDependencies</code> uses the mapSideCombine internal flag for the types of the keys and values (i.e. <code>K</code> and <code>C</code> or <code>K</code> and <code>V</code> when the flag is enabled or not, respectively).</p> <p>In the end, <code>getDependencies</code> creates a single ShuffleDependency (with the previous RDD, the Partitioner, and the <code>Serializer</code>).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#computing-partition","title":"Computing Partition  Signature <pre><code>compute(\n  split: Partition,\n  context: TaskContext): Iterator[(K, C)]\n</code></pre> <p><code>compute</code> is part of the RDD abstraction.</p>  <p><code>compute</code> assumes that ShuffleDependency is the first dependency among the dependencies (and the only one per getDependencies).</p> <p><code>compute</code> uses the SparkEnv to access the ShuffleManager. <code>compute</code> requests the ShuffleManager for the ShuffleReader based on the following:</p>    ShuffleReader Value     ShuffleHandle ShuffleHandle of the <code>ShuffleDependency</code>   <code>startPartition</code> The index of the given <code>split</code> partition   <code>endPartition</code> <code>index + 1</code>    <p>In the end, <code>compute</code> requests the <code>ShuffleReader</code> to read the (combined) key-value pairs (of type <code>(K, C)</code>).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#key-value-and-combiner-types","title":"Key, Value and Combiner Types <pre><code>class ShuffledRDD[K: ClassTag, V: ClassTag, C: ClassTag]\n</code></pre> <p><code>ShuffledRDD</code> is given an RDD of <code>K</code> keys and <code>V</code> values to be created.</p> <p>When computed, <code>ShuffledRDD</code> produces pairs of <code>K</code> keys and <code>C</code> values.</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#isbarrier-flag","title":"isBarrier Flag <p><code>ShuffledRDD</code> has isBarrier flag always disabled (<code>false</code>).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#map-side-combine-flag","title":"Map-Side Combine Flag <p><code>ShuffledRDD</code> uses a map-side combine flag to create a ShuffleDependency when requested for the dependencies (there is always only one).</p> <p>The flag is disabled (<code>false</code>) by default and can be changed using <code>setMapSideCombine</code> method.</p> <pre><code>setMapSideCombine(\n  mapSideCombine: Boolean): ShuffledRDD[K, V, C]\n</code></pre> <p><code>setMapSideCombine</code> is used for PairRDDFunctions.combineByKeyWithClassTag transformation (which defaults to the flag enabled).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#placement-preferences-of-partition","title":"Placement Preferences of Partition  Signature <pre><code>getPreferredLocations(\n  partition: Partition): Seq[String]\n</code></pre> <p><code>getPreferredLocations</code> is part of the RDD abstraction.</p>  <p><code>getPreferredLocations</code> requests <code>MapOutputTrackerMaster</code> for the preferred locations of the given partition (BlockManagers with the most map outputs).</p> <p><code>getPreferredLocations</code> uses <code>SparkEnv</code> to access the current MapOutputTrackerMaster.</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#shuffledrddpartition","title":"ShuffledRDDPartition <p><code>ShuffledRDDPartition</code> gets an <code>index</code> to be created (that in turn is the index of partitions as calculated by the Partitioner).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#user-specified-serializer","title":"User-Specified Serializer <p>User-specified Serializer for the single ShuffleDependency dependency</p> <pre><code>userSpecifiedSerializer: Option[Serializer] = None\n</code></pre> <p><code>userSpecifiedSerializer</code> is undefined (<code>None</code>) by default and can be changed using <code>setSerializer</code> method (that is used for PairRDDFunctions.combineByKeyWithClassTag transformation).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#demos","title":"Demos","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#shuffledrdd-and-coalesce","title":"ShuffledRDD and coalesce <pre><code>val data = sc.parallelize(0 to 9)\nval coalesced = data.coalesce(numPartitions = 4, shuffle = true)\nscala&gt; println(coalesced.toDebugString)\n(4) MapPartitionsRDD[9] at coalesce at &lt;pastie&gt;:75 []\n |  CoalescedRDD[8] at coalesce at &lt;pastie&gt;:75 []\n |  ShuffledRDD[7] at coalesce at &lt;pastie&gt;:75 []\n +-(16) MapPartitionsRDD[6] at coalesce at &lt;pastie&gt;:75 []\n    |   ParallelCollectionRDD[5] at parallelize at &lt;pastie&gt;:74 []\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#shuffledrdd-and-sortbykey","title":"ShuffledRDD and sortByKey <pre><code>val data = sc.parallelize(0 to 9)\nval grouped = rdd.groupBy(_ % 2)\nval sorted = grouped.sortByKey(numPartitions = 2)\nscala&gt; println(sorted.toDebugString)\n(2) ShuffledRDD[15] at sortByKey at &lt;console&gt;:74 []\n +-(4) ShuffledRDD[12] at groupBy at &lt;console&gt;:74 []\n    +-(4) MapPartitionsRDD[11] at groupBy at &lt;console&gt;:74 []\n       |  MapPartitionsRDD[9] at coalesce at &lt;pastie&gt;:75 []\n       |  CoalescedRDD[8] at coalesce at &lt;pastie&gt;:75 []\n       |  ShuffledRDD[7] at coalesce at &lt;pastie&gt;:75 []\n       +-(16) MapPartitionsRDD[6] at coalesce at &lt;pastie&gt;:75 []\n          |   ParallelCollectionRDD[5] at parallelize at &lt;pastie&gt;:74 []\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"rdd/SubtractedRDD/","title":"SubtractedRDD","text":"<p>=== [[compute]] Computing Partition (in TaskContext) -- <code>compute</code> Method</p>"},{"location":"rdd/SubtractedRDD/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/SubtractedRDD/#computep-partition-context-taskcontext-iteratork-v","title":"compute(p: Partition, context: TaskContext): Iterator[(K, V)]","text":"<p><code>compute</code> is part of the RDD abstraction.</p> <p><code>compute</code>...FIXME</p>"},{"location":"rdd/checkpointing/","title":"RDD Checkpointing","text":"<p>RDD Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system.</p> <p>There are two types of checkpointing:</p> <ul> <li>&lt;&gt; - RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system (e.g. Hadoop DFS) <li>&lt;&gt; - RDD checkpointing that saves the data to a local file system <p>It's up to a Spark application developer to decide when and how to checkpoint using <code>RDD.checkpoint()</code> method.</p> <p>Before checkpointing is used, a Spark developer has to set the checkpoint directory using <code>SparkContext.setCheckpointDir(directory: String)</code> method.</p> <p>== [[reliable-checkpointing]] Reliable Checkpointing</p> <p>You call <code>SparkContext.setCheckpointDir(directory: String)</code> to set the checkpoint directory - the directory where RDDs are checkpointed. The <code>directory</code> must be a HDFS path if running on a cluster. The reason is that the driver may attempt to reconstruct the checkpointed RDD from its own local file system, which is incorrect because the checkpoint files are actually on the executor machines.</p> <p>You mark an RDD for checkpointing by calling <code>RDD.checkpoint()</code>. The RDD will be saved to a file inside the checkpoint directory and all references to its parent RDDs will be removed. This function has to be called before any job has been executed on this RDD.</p> <p>NOTE: It is strongly recommended that a checkpointed RDD is persisted in memory, otherwise saving it on a file will require recomputation.</p> <p>When an action is called on a checkpointed RDD, the following INFO message is printed out in the logs:</p> <pre><code>Done checkpointing RDD 5 to [path], new parent is RDD [id]\n</code></pre> <p>== [[local-checkpointing]] Local Checkpointing</p> <p>localCheckpoint allows to truncate RDD lineage graph while skipping the expensive step of replicating the materialized data to a reliable distributed file system.</p> <p>This is useful for RDDs with long lineages that need to be truncated periodically, e.g. GraphX.</p> <p>Local checkpointing trades fault-tolerance for performance.</p> <p>NOTE: The checkpoint directory set through <code>SparkContext.setCheckpointDir</code> is not used.</p> <p>== [[demo]] Demo</p>"},{"location":"rdd/checkpointing/#sourceplaintext","title":"[source,plaintext]","text":"<p>val rdd = sc.parallelize(0 to 9)</p> <p>scala&gt; rdd.checkpoint org.apache.spark.SparkException: Checkpoint directory has not been set in the SparkContext   at org.apache.spark.rdd.RDD.checkpoint(RDD.scala:1599)   ... 49 elided</p> <p>sc.setCheckpointDir(\"/tmp/rdd-checkpoint\")</p> <p>// Creates a subdirectory for this SparkContext $ ls /tmp/rdd-checkpoint/ fc21e1d1-3cd9-4d51-880f-58d1dd07f783</p> <p>// Mark the RDD to checkpoint at the earliest action rdd.checkpoint</p> <p>scala&gt; println(rdd.getCheckpointFile) Some(file:/tmp/rdd-checkpoint/fc21e1d1-3cd9-4d51-880f-58d1dd07f783/rdd-2)</p> <p>scala&gt; println(ns.id) 2</p> <p>scala&gt; println(rdd.getNumPartitions) 16</p> <p>rdd.count</p> <p>// Check out the checkpoint directory // You should find a directory for the checkpointed RDD, e.g. rdd-2 // The number of part-000* files is exactly the number of partitions $ ls -ltra /tmp/rdd-checkpoint/fc21e1d1-3cd9-4d51-880f-58d1dd07f783/rdd-2/part-000* | wc -l       16</p>"},{"location":"rdd/lineage/","title":"RDD Lineage \u2014 Logical Execution Plan","text":"<p>RDD Lineage (RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of an RDD. </p> <p>RDD lineage is built as a result of applying transformations to an RDD and creates a so-called logical execution plan.</p> <p>Note</p> <p>The execution DAG or physical execution plan is the DAG of stages.</p> <p></p> <p>The above RDD graph could be the result of the following series of transformations:</p> <pre><code>val r00 = sc.parallelize(0 to 9)\nval r01 = sc.parallelize(0 to 90 by 10)\nval r10 = r00.cartesian(r01)\nval r11 = r00.map(n =&gt; (n, n))\nval r12 = r00.zip(r01)\nval r13 = r01.keyBy(_ / 20)\nval r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)\n</code></pre> <p>A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.</p>"},{"location":"rdd/lineage/#logical-execution-plan","title":"Logical Execution Plan","text":"<p>Logical Execution Plan starts with the earliest RDDs (those with no dependencies on other RDDs or reference cached data) and ends with the RDD that produces the result of the action that has been called to execute.</p> <p>Note</p> <p>A logical plan (a DAG) is materialized and executed when <code>SparkContext</code> is requested to run a Spark job.</p>"},{"location":"rdd/spark-rdd-actions/","title":"Actions","text":"<p>RDD Actions are RDD operations that produce concrete non-RDD values. They materialize a value in a Spark program. In other words, a RDD operation that returns a value of any type but <code>RDD[T]</code> is an action.</p> <pre><code>action: RDD =&gt; a value\n</code></pre> <p>NOTE: Actions are synchronous. You can use &lt;&gt; to release a calling thread while calling actions. <p>They trigger execution of &lt;&gt; to return values. Simply put, an action evaluates the RDD lineage graph. <p>You can think of actions as a valve and until action is fired, the data to be processed is not even in the pipes, i.e. transformations. Only actions can materialize the entire processing pipeline with real data.</p> <ul> <li><code>aggregate</code></li> <li><code>collect</code></li> <li><code>count</code></li> <li><code>countApprox*</code></li> <li><code>countByValue*</code></li> <li><code>first</code></li> <li><code>fold</code></li> <li><code>foreach</code></li> <li><code>foreachPartition</code></li> <li><code>max</code></li> <li><code>min</code></li> <li><code>reduce</code></li> <li><code>saveAs*</code> (e.g. <code>saveAsTextFile</code>, <code>saveAsHadoopFile</code>)</li> <li><code>take</code></li> <li><code>takeOrdered</code></li> <li><code>takeSample</code></li> <li><code>toLocalIterator</code></li> <li><code>top</code></li> <li><code>treeAggregate</code></li> <li><code>treeReduce</code></li> </ul> <p>Actions run jobs using SparkContext.runJob or directly DAGScheduler.runJob.</p> <pre><code>scala&gt; :type words\n\nscala&gt; words.count  // &lt;1&gt;\nres0: Long = 502\n</code></pre> <p>TIP: You should cache RDDs you work with when you want to execute two or more actions on it for a better performance. Refer to spark-rdd-caching.md[RDD Caching and Persistence].</p> <p>Before calling an action, Spark does closure/function cleaning (using <code>SparkContext.clean</code>) to make it ready for serialization and sending over the wire to executors. Cleaning can throw a <code>SparkException</code> if the computation cannot be cleaned.</p> <p>NOTE: Spark uses <code>ClosureCleaner</code> to clean closures.</p> <p>=== [[AsyncRDDActions]] AsyncRDDActions</p> <p><code>AsyncRDDActions</code> class offers asynchronous actions that you can use on RDDs (thanks to the implicit conversion <code>rddToAsyncRDDActions</code> in RDD class). The methods return a &lt;&gt;. <p>The following asynchronous methods are available:</p> <ul> <li><code>countAsync</code></li> <li><code>collectAsync</code></li> <li><code>takeAsync</code></li> <li><code>foreachAsync</code></li> <li><code>foreachPartitionAsync</code></li> </ul>"},{"location":"rdd/spark-rdd-caching/","title":"Caching and Persistence","text":"<p>== RDD Caching and Persistence</p> <p>Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.</p> <p>RDDs can be cached using &lt;&gt; operation. They can also be persisted using &lt;&gt; operation. <p>The difference between <code>cache</code> and <code>persist</code> operations is purely syntactic. <code>cache</code> is a synonym of <code>persist</code> or <code>persist(MEMORY_ONLY)</code>, i.e. <code>cache</code> is merely <code>persist</code> with the default storage level <code>MEMORY_ONLY</code>.</p> <p>NOTE: Due to the very small and purely syntactic difference between caching and persistence of RDDs the two terms are often used interchangeably and I will follow the \"pattern\" here.</p> <p>RDDs can also be &lt;&gt; to remove RDD from a permanent storage like memory and/or disk. <p>=== [[cache]] Caching RDD -- <code>cache</code> Method</p>"},{"location":"rdd/spark-rdd-caching/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-caching/#cache-thistype-persist","title":"cache(): this.type = persist()","text":"<p><code>cache</code> is a synonym of &lt;&gt; with storage:StorageLevel.md[<code>MEMORY_ONLY</code> storage level]. <p>=== [[persist]] Persisting RDD -- <code>persist</code> Methods</p>"},{"location":"rdd/spark-rdd-caching/#source-scala_1","title":"[source, scala]","text":"<p>persist(): this.type persist(newLevel: StorageLevel): this.type</p> <p><code>persist</code> marks a RDD for persistence using <code>newLevel</code> storage:StorageLevel.md[storage level].</p> <p>You can only change the storage level once or <code>persist</code> reports an <code>UnsupportedOperationException</code>:</p> <pre><code>Cannot change storage level of an RDD after it was already assigned a level\n</code></pre> <p>NOTE: You can pretend to change the storage level of an RDD with already-assigned storage level only if the storage level is the same as it is currently assigned.</p> <p>If the RDD is marked as persistent the first time, the RDD is core:ContextCleaner.md#registerRDDForCleanup[registered to <code>ContextCleaner</code>] (if available) and SparkContext.md#persistRDD[<code>SparkContext</code>].</p> <p>The internal <code>storageLevel</code> attribute is set to the input <code>newLevel</code> storage level.</p> <p>=== [[unpersist]] Unpersisting RDDs (Clearing Blocks) -- <code>unpersist</code> Method</p>"},{"location":"rdd/spark-rdd-caching/#source-scala_2","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-caching/#unpersistblocking-boolean-true-thistype","title":"unpersist(blocking: Boolean = true): this.type","text":"<p>When called, <code>unpersist</code> prints the following INFO message to the logs:</p> <pre><code>INFO [RddName]: Removing RDD [id] from persistence list\n</code></pre> <p>It then calls SparkContext.md#unpersist[SparkContext.unpersistRDD(id, blocking)] and sets storage:StorageLevel.md[<code>NONE</code> storage level] as the current storage level.</p>"},{"location":"rdd/spark-rdd-operations/","title":"Operators","text":"<p>== Operators - Transformations and Actions</p> <p>RDDs have two types of operations: spark-rdd-transformations.md[transformations] and spark-rdd-actions.md[actions].</p> <p>NOTE: Operators are also called operations.</p> <p>=== Gotchas - things to watch for</p> <p>Even if you don't access it explicitly it cannot be referenced inside a closure as it is serialized and carried around across executors.</p> <p>See https://issues.apache.org/jira/browse/SPARK-5063</p>"},{"location":"rdd/spark-rdd-partitions/","title":"Partitions and Partitioning","text":"<p>== Partitions and Partitioning</p> <p>=== Introduction</p> <p>Depending on how you look at Spark (programmer, devop, admin), an RDD is about the content (developer's and data scientist's perspective) or how it gets spread out over a cluster (performance), i.e. how many partitions an RDD represents.</p> <p>A partition (aka split) is a logical chunk of a large distributed data set.</p>"},{"location":"rdd/spark-rdd-partitions/#caution","title":"[CAUTION]","text":"<p>FIXME</p> <ol> <li>How does the number of partitions map to the number of tasks? How to verify it?</li> <li> </li> </ol> <p>Spark manages data using partitions that helps  parallelize distributed data processing with minimal network traffic for sending data between executors.</p> <p>By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.</p> <p>There is a one-to-one correspondence between how data is laid out in data storage like HDFS or Cassandra (it is partitioned for the same reasons).</p> <p>Features:</p> <ul> <li>size</li> <li>number</li> <li>partitioning scheme</li> <li>node distribution</li> <li>repartitioning</li> </ul>"},{"location":"rdd/spark-rdd-partitions/#how-does-the-mapping-between-partitions-and-tasks-correspond-to-data-locality-if-any","title":"How does the mapping between partitions and tasks correspond to data locality if any?","text":""},{"location":"rdd/spark-rdd-partitions/#tip","title":"[TIP]","text":"<p>Read the following documentations to learn what experts say on the topic:</p> <ul> <li>https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html[How Many Partitions Does An RDD Have?]</li> <li> </li> </ul> <p>By default, a partition is created for each HDFS partition, which by default is 64MB (from http://spark.apache.org/docs/latest/programming-guide.html#external-datasets[Spark's Programming Guide]).</p> <p>RDDs get partitioned automatically without programmer intervention. However, there are times when you'd like to adjust the size and number of partitions or the partitioning scheme according to the needs of your application.</p> <p>You use <code>def getPartitions: Array[Partition]</code> method on a RDD to know the set of partitions in this RDD.</p> <p>As noted in https://github.com/databricks/spark-knowledgebase/blob/master/performance_optimization/how_many_partitions_does_an_rdd_have.md#view-task-execution-against-partitions-using-the-ui[View Task Execution Against Partitions Using the UI]:</p> <p>When a stage executes, you can see the number of partitions for a given stage in the Spark UI.</p> <p>Start <code>spark-shell</code> and see it yourself!</p> <pre><code>scala&gt; sc.parallelize(1 to 100).count\nres0: Long = 100\n</code></pre> <p>When you execute the Spark job, i.e. <code>sc.parallelize(1 to 100).count</code>, you should see the following in http://localhost:4040/jobs[Spark shell application UI].</p> <p>.The number of partition as Total tasks in UI image::spark-partitions-ui-stages.png[align=\"center\"]</p> <p>The reason for <code>8</code> Tasks in Total is that I'm on a 8-core laptop and by default the number of partitions is the number of all available cores.</p> <pre><code>$ sysctl -n hw.ncpu\n8\n</code></pre> <p>You can request for the minimum number of partitions, using the second input parameter to many transformations.</p> <pre><code>scala&gt; sc.parallelize(1 to 100, 2).count\nres1: Long = 100\n</code></pre> <p>.Total tasks in UI shows 2 partitions image::spark-partitions-ui-stages-2-partitions.png[align=\"center\"]</p> <p>You can always ask for the number of partitions using <code>partitions</code> method of a RDD:</p> <pre><code>scala&gt; val ints = sc.parallelize(1 to 100, 4)\nints: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at &lt;console&gt;:24\n\nscala&gt; ints.partitions.size\nres2: Int = 4\n</code></pre> <p>In general, smaller/more numerous partitions allow work to be distributed among more workers, but larger/fewer partitions allow work to be done in larger chunks,  which may result in the work getting done more quickly as long as all workers are kept busy, due to reduced overhead.</p> <p>Increasing partitions count will make each partition to have less data (or not at all!)</p> <p>Spark can only run 1 concurrent task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least have 50 partitions (and probably http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism[2-3x times that]).</p> <p>As far as choosing a \"good\" number of partitions, you generally want at least as many as the number of executors for parallelism. You can get this computed value by calling <code>sc.defaultParallelism</code>.</p> <p>Also, the number of partitions determines how many files get generated by actions that save RDDs to files.</p> <p>The maximum size of a partition is ultimately limited by the available memory of an executor.</p> <p>In the first RDD transformation, e.g. reading from a file using <code>sc.textFile(path, partition)</code>, the <code>partition</code> parameter will be applied to all further transformations and actions on this RDD.</p> <p>Partitions get redistributed among nodes whenever <code>shuffle</code> occurs. Repartitioning may cause <code>shuffle</code> to occur in some situations,  but it is not guaranteed to occur in all cases. And it usually happens during action stage.</p> <p>When creating an RDD by reading a file using <code>rdd = SparkContext().textFile(\"hdfs://.../file.txt\")</code> the number of partitions may be smaller. Ideally, you would get the same number of blocks as you see in HDFS, but if the lines in your file are too long (longer than the block size), there will be fewer partitions.</p> <p>Preferred way to set up the number of partitions for an RDD is to directly pass it as the second input parameter in the call like <code>rdd = sc.textFile(\"hdfs://.../file.txt\", 400)</code>, where <code>400</code> is the number of partitions. In this case, the partitioning makes for 400 splits that would be done by the Hadoop's <code>TextInputFormat</code>, not Spark and it would work much faster. It's also that the code spawns 400 concurrent tasks to try to load <code>file.txt</code> directly into 400 partitions.</p> <p>It will only work as described for uncompressed files.</p> <p>When using <code>textFile</code> with compressed files (<code>file.txt.gz</code> not <code>file.txt</code> or similar), Spark disables splitting that makes for an RDD with only 1 partition (as reads against gzipped files cannot be parallelized). In this case, to change the number of partitions you should do &lt;&gt;. <p>Some operations, e.g. <code>map</code>, <code>flatMap</code>, <code>filter</code>, don't preserve partitioning.</p> <p><code>map</code>, <code>flatMap</code>, <code>filter</code> operations apply a function to every partition.</p> <p>=== [[repartitioning]][[repartition]] Repartitioning RDD -- <code>repartition</code> Transformation</p>"},{"location":"rdd/spark-rdd-partitions/#httpssparkapacheorgdocslatesttuninghtmltuning-spark-the-official-documentation-of-spark","title":"https://spark.apache.org/docs/latest/tuning.html[Tuning Spark] (the official documentation of Spark)","text":""},{"location":"rdd/spark-rdd-partitions/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-partitions/#repartitionnumpartitions-intimplicit-ord-orderingt-null-rddt","title":"repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]","text":"<p><code>repartition</code> is &lt;&gt; with <code>numPartitions</code> and <code>shuffle</code> enabled. <p>With the following computation you can see that <code>repartition(5)</code> causes 5 tasks to be started using <code>NODE_LOCAL</code> data locality.</p> <pre><code>scala&gt; lines.repartition(5).count\n...\n15/10/07 08:10:00 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 7 (MapPartitionsRDD[19] at repartition at &lt;console&gt;:27)\n15/10/07 08:10:00 INFO TaskSchedulerImpl: Adding task set 7.0 with 5 tasks\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 17, localhost, partition 0,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 1.0 in stage 7.0 (TID 18, localhost, partition 1,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 2.0 in stage 7.0 (TID 19, localhost, partition 2,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 3.0 in stage 7.0 (TID 20, localhost, partition 3,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 4.0 in stage 7.0 (TID 21, localhost, partition 4,NODE_LOCAL, 2089 bytes)\n...\n</code></pre> <p>You can see a change after executing <code>repartition(1)</code> causes 2 tasks to be started using <code>PROCESS_LOCAL</code> data locality.</p> <pre><code>scala&gt; lines.repartition(1).count\n...\n15/10/07 08:14:09 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 8 (MapPartitionsRDD[20] at repartition at &lt;console&gt;:27)\n15/10/07 08:14:09 INFO TaskSchedulerImpl: Adding task set 8.0 with 2 tasks\n15/10/07 08:14:09 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 22, localhost, partition 0,PROCESS_LOCAL, 2058 bytes)\n15/10/07 08:14:09 INFO TaskSetManager: Starting task 1.0 in stage 8.0 (TID 23, localhost, partition 1,PROCESS_LOCAL, 2058 bytes)\n...\n</code></pre> <p>Please note that Spark disables splitting for compressed files and creates RDDs with only 1 partition. In such cases, it's helpful to use <code>sc.textFile('demo.gz')</code> and do repartitioning using <code>rdd.repartition(100)</code> as follows:</p> <pre><code>rdd = sc.textFile('demo.gz')\nrdd = rdd.repartition(100)\n</code></pre> <p>With the lines, you end up with <code>rdd</code> to be exactly 100 partitions of roughly equal in size.</p> <ul> <li><code>rdd.repartition(N)</code> does a <code>shuffle</code> to split data to match <code>N</code> ** partitioning is done on round robin basis</li> </ul> <p>TIP: If partitioning scheme doesn't work for you, you can write your own custom partitioner.</p> <p>TIP: It's useful to get familiar with https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html[Hadoop's TextInputFormat].</p> <p>=== [[coalesce]] <code>coalesce</code> Transformation</p>"},{"location":"rdd/spark-rdd-partitions/#source-scala_1","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-partitions/#coalescenumpartitions-int-shuffle-boolean-falseimplicit-ord-orderingt-null-rddt","title":"coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]","text":"<p>The <code>coalesce</code> transformation is used to change the number of partitions. It can trigger shuffling depending on the <code>shuffle</code> flag (disabled by default, i.e. <code>false</code>).</p> <p>In the following sample, you <code>parallelize</code> a local 10-number sequence and <code>coalesce</code> it first without and then with shuffling (note the <code>shuffle</code> parameter being <code>false</code> and <code>true</code>, respectively).</p> <p>Tip</p> <p>Use toDebugString to check out the RDD lineage graph.</p> <p><pre><code>scala&gt; val rdd = sc.parallelize(0 to 10, 8)\nrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:24\n\nscala&gt; rdd.partitions.size\nres0: Int = 8\n\nscala&gt; rdd.coalesce(numPartitions=8, shuffle=false)   // &lt;1&gt;\nres1: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[1] at coalesce at &lt;console&gt;:27\n\nscala&gt; res1.toDebugString\nres2: String =\n(8) CoalescedRDD[1] at coalesce at &lt;console&gt;:27 []\n |  ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:24 []\n\nscala&gt; rdd.coalesce(numPartitions=8, shuffle=true)\nres3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at coalesce at &lt;console&gt;:27\n\nscala&gt; res3.toDebugString\nres4: String =\n(8) MapPartitionsRDD[5] at coalesce at &lt;console&gt;:27 []\n |  CoalescedRDD[4] at coalesce at &lt;console&gt;:27 []\n |  ShuffledRDD[3] at coalesce at &lt;console&gt;:27 []\n +-(8) MapPartitionsRDD[2] at coalesce at &lt;console&gt;:27 []\n    |  ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:24 []\n</code></pre> &lt;1&gt; <code>shuffle</code> is <code>false</code> by default and it's explicitly used here for demo purposes. Note the number of partitions that remains the same as the number of partitions in the source RDD <code>rdd</code>.</p>"},{"location":"rdd/spark-rdd-transformations/","title":"Transformations -- Lazy Operations on RDD (to Create One or More RDDs)","text":"<p>Transformations are lazy operations on an rdd:RDD.md[RDD] that create one or many new RDDs.</p> <pre><code>// T and U are Scala types\ntransformation: RDD[T] =&gt; RDD[U]\ntransformation: RDD[T] =&gt; Seq[RDD[U]]\n</code></pre> <p>In other words, transformations are functions that take an RDD as the input and produce one or many RDDs as the output. Transformations do not change the input RDD (since rdd:index.md#introduction[RDDs are immutable] and hence cannot be modified), but produce one or more new RDDs by applying the computations they represent.</p> <p>[[methods]] .(Subset of) RDD Transformations (Public API) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| aggregate a| [[aggregate]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala","title":"[source, scala]","text":"<p>aggregateU(   seqOp:  (U, T) =&gt; U,   combOp: (U, U) =&gt; U): U</p> <p>| barrier a| [[barrier]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_1","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#barrier-rddbarriert","title":"barrier(): RDDBarrier[T]","text":"<p>(New in 2.4.0) Marks the current stage as a &lt;&gt; in &lt;&gt;, where Spark must launch all tasks together <p>Internally, <code>barrier</code> creates a &lt;&gt; over the RDD <p>| cache a| [[cache]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_2","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#cache-thistype","title":"cache(): this.type","text":"<p>Persists the RDD with the storage:StorageLevel.md#MEMORY_ONLY[MEMORY_ONLY] storage level</p> <p>Synonym of &lt;&gt; <p>| coalesce a| [[coalesce]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_3","title":"[source, scala]","text":"<p>coalesce(   numPartitions: Int,   shuffle: Boolean = false,   partitionCoalescer: Option[PartitionCoalescer] = Option.empty)   (implicit ord: Ordering[T] = null): RDD[T]</p> <p>| filter a| [[filter]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_4","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#filterf-t-boolean-rddt","title":"filter(f: T =&gt; Boolean): RDD[T]","text":"<p>| flatMap a| [[flatMap]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_5","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#flatmapu-rddu","title":"flatMapU: RDD[U]","text":"<p>| map a| [[map]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_6","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#mapu-rddu","title":"mapU: RDD[U]","text":"<p>| mapPartitions a| [[mapPartitions]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_7","title":"[source, scala]","text":"<p>mapPartitionsU: RDD[U]</p> <p>| mapPartitionsWithIndex a| [[mapPartitionsWithIndex]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_8","title":"[source, scala]","text":"<p>mapPartitionsWithIndexU: RDD[U]</p> <p>| randomSplit a| [[randomSplit]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_9","title":"[source, scala]","text":"<p>randomSplit(   weights: Array[Double],   seed: Long = Utils.random.nextLong): Array[RDD[T]]</p> <p>| union a| [[union]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_10","title":"[source, scala]","text":"<p>++(other: RDD[T]): RDD[T] union(other: RDD[T]): RDD[T]</p> <p>| persist a| [[persist]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_11","title":"[source, scala]","text":"<p>persist(): this.type persist(newLevel: StorageLevel): this.type</p> <p>|===</p> <p>By applying transformations you incrementally build a RDD lineage with all the parent RDDs of the final RDD(s).</p> <p>Transformations are lazy, i.e. are not executed immediately. Only after calling an action are transformations executed.</p> <p>After executing a transformation, the result RDD(s) will always be different from their parents and can be smaller (e.g. <code>filter</code>, <code>count</code>, <code>distinct</code>, <code>sample</code>), bigger (e.g. <code>flatMap</code>, <code>union</code>, <code>cartesian</code>) or the same size (e.g. <code>map</code>).</p> <p>CAUTION: There are transformations that may trigger jobs, e.g. <code>sortBy</code>, &lt;&gt;, etc. <p>.From SparkContext by transformations to the result image::rdd-sparkcontext-transformations-action.png[align=\"center\"]</p> <p>Certain transformations can be pipelined which is an optimization that Spark uses to improve performance of computations.</p>"},{"location":"rdd/spark-rdd-transformations/#sourcescala","title":"[source,scala]","text":"<p>scala&gt; val file = sc.textFile(\"README.md\") file: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at textFile at :24 <p>scala&gt; val allWords = file.flatMap(_.split(\"\\W+\")) allWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[55] at flatMap at :26 <p>scala&gt; val words = allWords.filter(!_.isEmpty) words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at filter at :28 <p>scala&gt; val pairs = words.map((_,1)) pairs: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[57] at map at :30 <p>scala&gt; val reducedByKey = pairs.reduceByKey(_ + _) reducedByKey: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[59] at reduceByKey at :32 <p>scala&gt; val top10words = reducedByKey.takeOrdered(10)(Ordering[Int].reverse.on(_._2)) INFO SparkContext: Starting job: takeOrdered at :34 ... INFO DAGScheduler: Job 18 finished: takeOrdered at :34, took 0.074386 s top10words: Array[(String, Int)] = Array((the,21), (to,14), (Spark,13), (for,11), (and,10), (##,8), (a,8), (run,7), (can,6), (is,6)) <p>There are two kinds of transformations:</p> <ul> <li>&lt;&gt; <li>&lt;&gt; <p>=== [[narrow-transformations]] Narrow Transformations</p> <p>Narrow transformations are the result of <code>map</code>, <code>filter</code> and such that is from the data from a single partition only, i.e. it is self-sustained.</p> <p>An output RDD has partitions with records that originate from a single partition in the parent RDD. Only a limited subset of partitions used to calculate the result.</p> <p>Spark groups narrow transformations as a stage which is called pipelining.</p> <p>=== [[wide-transformations]] Wide Transformations</p> <p>Wide transformations are the result of <code>groupByKey</code> and <code>reduceByKey</code>. The data required to compute the records in a single partition may reside in many partitions of the parent RDD.</p> <p>NOTE: Wide transformations are also called shuffle transformations as they may or may not depend on a shuffle.</p> <p>All of the tuples with the same key must end up in the same partition, processed by the same task. To satisfy these operations, Spark must execute a RDD shuffle, which transfers data across cluster and results in a new stage with a new set of partitions.</p>"},{"location":"rdd/spark-rdd-transformations/#zipwithindex","title":"zipWithIndex","text":""},{"location":"rdd/spark-rdd-transformations/#source-scala_12","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#zipwithindex-rddt-long","title":"zipWithIndex(): RDD[(T, Long)] <p><code>zipWithIndex</code> zips this <code>RDD[T]</code> with its element indices.</p>","text":""},{"location":"rdd/spark-rdd-transformations/#caution","title":"[CAUTION]","text":"<p>If the number of partitions of the source RDD is greater than 1, it will submit an additional job to calculate start indices.</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_13","title":"[source, scala] <p>val onePartition = sc.parallelize(0 to 9, 1)</p> <p>scala&gt; onePartition.partitions.length res0: Int = 1</p> <p>// no job submitted onePartition.zipWithIndex</p> <p>val eightPartitions = sc.parallelize(0 to 9, 8)</p> <p>scala&gt; eightPartitions.partitions.length res1: Int = 8</p> <p>// submits a job eightPartitions.zipWithIndex</p>  <p>.Spark job submitted by zipWithIndex transformation image::spark-transformations-zipWithIndex-webui.png[align=\"center\"] ====</p>","text":""},{"location":"rest/","title":"Index","text":"<p>= Status REST API -- Monitoring Spark Applications Using REST API</p> <p>Status REST API is a collection of REST endpoints under <code>/api/v1</code> URI path in the spark-api-UIRoot.md[root containers for application UI information]:</p> <ul> <li> <p>[[SparkUI]] spark-webui-SparkUI.md[SparkUI] - Application UI for an active Spark application (i.e. a Spark application that is still running)</p> </li> <li> <p>[[HistoryServer]] spark-history-server:HistoryServer.md[HistoryServer] - Application UI for active and completed Spark applications (i.e. Spark applications that are still running or have already finished)</p> </li> </ul> <p>Status REST API uses spark-api-ApiRootResource.md[ApiRootResource] main resource class that registers <code>/api/v1</code> URI &lt;&gt;. <p>[[paths]] .URI Paths [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Path | Description</p> <p>| [[applications]] <code>applications</code> | [[ApplicationListResource]] Delegates to the spark-api-ApplicationListResource.md[ApplicationListResource] resource class</p> <p>| [[applications_appId]] <code>applications/\\{appId}</code> | [[OneApplicationResource]] Delegates to the spark-api-OneApplicationResource.md[OneApplicationResource] resource class</p> <p>| [[version]] <code>version</code> | Creates a <code>VersionInfo</code> with the current version of Spark |===</p> <p>Status REST API uses the following components:</p> <ul> <li> <p>https://jersey.github.io/[Jersey RESTful Web Services framework] with support for the https://github.com/jax-rs[Java API for RESTful Web Services] (JAX-RS API)</p> </li> <li> <p>https://www.eclipse.org/jetty/[Eclipse Jetty] as the lightweight HTTP server and the https://jcp.org/en/jsr/detail?id=369[Java Servlet] container</p> </li> </ul>"},{"location":"rest/AbstractApplicationResource/","title":"AbstractApplicationResource","text":"<p>== [[AbstractApplicationResource]] AbstractApplicationResource</p> <p><code>AbstractApplicationResource</code> is a spark-api-BaseAppResource.md[BaseAppResource] with a set of &lt;&gt; that are common across &lt;&gt;. <pre><code>// start spark-shell\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 257\nContent-Type: application/json\nDate: Tue, 05 Jun 2018 18:46:32 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-05T15:04:48.328GMT\",\n                \"lastUpdatedEpoch\": 1528211088328,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-05T15:04:48.328GMT\",\n                \"startTimeEpoch\": 1528211088328\n            }\n        ],\n        \"id\": \"local-1528211089216\",\n        \"name\": \"Spark shell\"\n    }\n]\n\n$ http http://localhost:4040/api/v1/applications/local-1528211089216/storage/rdd\nHTTP/1.1 200 OK\nContent-Length: 3\nContent-Type: application/json\nDate: Tue, 05 Jun 2018 18:48:00 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[]\n\n// Execute the following query in spark-shell\nspark.range(5).cache.count\n\n$ http http://localhost:4040/api/v1/applications/local-1528211089216/storage/rdd\n// output omitted for brevity\n</code></pre> <p>[[implementations]] .AbstractApplicationResources [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | AbstractApplicationResource | Description</p> <p>| spark-api-OneApplicationResource.md[OneApplicationResource] | [[OneApplicationResource]] Handles <code>applications/appId</code> requests</p> <p>| spark-api-OneApplicationAttemptResource.md[OneApplicationAttemptResource] | [[OneApplicationAttemptResource]] |===</p> <p>[[paths]] .AbstractApplicationResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description</p> <p>| <code>allexecutors</code> | GET | &lt;&gt; <p>| <code>environment</code> | GET | &lt;&gt; <p>| <code>executors</code> | GET | &lt;&gt; <p>| <code>jobs</code> | GET | &lt;&gt; <p>| <code>jobs/{jobId: \\\\d+}</code> | GET | &lt;&gt; <p>| <code>logs</code> | GET | &lt;&gt; <code>stages</code> &lt;&gt; <p>| <code>storage/rdd/{rddId: \\\\d+}</code> | GET | &lt;&gt; <p>| [[storage_rdd]] <code>storage/rdd</code> | GET | &lt;&gt; |=== <p>=== [[rddList]] <code>rddList</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#rddlist-seqrddstorageinfo","title":"rddList(): Seq[RDDStorageInfo]","text":"<p><code>rddList</code>...FIXME</p> <p>NOTE: <code>rddList</code> is used when...FIXME</p> <p>=== [[environmentInfo]] <code>environmentInfo</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala_1","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#environmentinfo-applicationenvironmentinfo","title":"environmentInfo(): ApplicationEnvironmentInfo","text":"<p><code>environmentInfo</code>...FIXME</p> <p>NOTE: <code>environmentInfo</code> is used when...FIXME</p> <p>=== [[rddData]] <code>rddData</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala_2","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#rdddatapathparamrddid-rddid-int-rddstorageinfo","title":"rddData(@PathParam(\"rddId\") rddId: Int): RDDStorageInfo","text":"<p><code>rddData</code>...FIXME</p> <p>NOTE: <code>rddData</code> is used when...FIXME</p> <p>=== [[allExecutorList]] <code>allExecutorList</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala_3","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#allexecutorlist-seqexecutorsummary","title":"allExecutorList(): Seq[ExecutorSummary]","text":"<p><code>allExecutorList</code>...FIXME</p> <p>NOTE: <code>allExecutorList</code> is used when...FIXME</p> <p>=== [[executorList]] <code>executorList</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala_4","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#executorlist-seqexecutorsummary","title":"executorList(): Seq[ExecutorSummary]","text":"<p><code>executorList</code>...FIXME</p> <p>NOTE: <code>executorList</code> is used when...FIXME</p> <p>=== [[oneJob]] <code>oneJob</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala_5","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#onejobpathparamjobid-jobid-int-jobdata","title":"oneJob(@PathParam(\"jobId\") jobId: Int): JobData","text":"<p><code>oneJob</code>...FIXME</p> <p>NOTE: <code>oneJob</code> is used when...FIXME</p> <p>=== [[jobsList]] <code>jobsList</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala_6","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#jobslistqueryparamstatus-statuses-jlistjobexecutionstatus-seqjobdata","title":"jobsList(@QueryParam(\"status\") statuses: JList[JobExecutionStatus]): Seq[JobData]","text":"<p><code>jobsList</code>...FIXME</p> <p>NOTE: <code>jobsList</code> is used when...FIXME</p>"},{"location":"rest/ApiRequestContext/","title":"ApiRequestContext","text":"<p>== [[ApiRequestContext]] ApiRequestContext</p> <p><code>ApiRequestContext</code> is the &lt;&gt; of...FIXME <p>[[contract]] [source, scala]</p> <p>package org.apache.spark.status.api.v1</p> <p>trait ApiRequestContext {   // only required methods that have no implementation   // the others follow   @Context   var servletContext: ServletContext = _</p> <p>@Context   var httpRequest: HttpServletRequest = _ }</p> <p>NOTE: <code>ApiRequestContext</code> is a <code>private[v1]</code> contract.</p> <p>.ApiRequestContext Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| <code>httpRequest</code> | [[httpRequest]] Java Servlets' <code>HttpServletRequest</code></p> <p>Used when...FIXME</p> <p>| <code>servletContext</code> | [[servletContext]] Java Servlets' <code>ServletContext</code></p> <p>Used when...FIXME |===</p> <p>[[implementations]] .ApiRequestContexts [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | ApiRequestContext | Description</p> <p>| spark-api-ApiRootResource.md[ApiRootResource] | [[ApiRootResource]]</p> <p>| <code>ApiStreamingApp</code> | [[ApiStreamingApp]]</p> <p>| spark-api-ApplicationListResource.md[ApplicationListResource] | [[ApplicationListResource]]</p> <p>| spark-api-BaseAppResource.md[BaseAppResource] | [[BaseAppResource]]</p> <p>| <code>SecurityFilter</code> | [[SecurityFilter]] |===</p> <p>=== [[uiRoot]] Getting Current UIRoot -- <code>uiRoot</code> Method</p>"},{"location":"rest/ApiRequestContext/#source-scala","title":"[source, scala]","text":""},{"location":"rest/ApiRequestContext/#uiroot-uiroot","title":"uiRoot: UIRoot","text":"<p><code>uiRoot</code> simply requests <code>UIRootFromServletContext</code> to spark-api-UIRootFromServletContext.md#getUiRoot[get the current UIRoot] (for the given &lt;&gt;). <p>NOTE: <code>uiRoot</code> is used when...FIXME</p>"},{"location":"rest/ApiRootResource/","title":"ApiRootResource","text":"<p>== [[ApiRootResource]] ApiRootResource -- /api/v1 URI Handler</p> <p><code>ApiRootResource</code> is the spark-api-ApiRequestContext.md[ApiRequestContext] for the <code>/v1</code> URI path.</p> <p><code>ApiRootResource</code> uses <code>@Path(\"/v1\")</code> annotation at the class level. It is a partial URI path template relative to the base URI of the server on which the resource is deployed, the context root of the application, and the URL pattern to which the JAX-RS runtime responds.</p> <p>TIP: Learn more about <code>@Path</code> annotation in https://docs.oracle.com/cd/E19798-01/821-1841/6nmq2cp26/index.html[The @Path Annotation and URI Path Templates].</p> <p><code>ApiRootResource</code> &lt;&gt; the <code>/api/*</code> context handler (with the REST resources and providers in <code>org.apache.spark.status.api.v1</code> package). <p>With the <code>@Path(\"/v1\")</code> annotation and after &lt;&gt; the <code>/api/*</code> context handler, <code>ApiRootResource</code> serves HTTP requests for &lt;&gt; under the <code>/api/v1</code> URI paths for spark-webui-SparkUI.md#initialize[SparkUI] and spark-history-server:HistoryServer.md#initialize[HistoryServer]. <p><code>ApiRootResource</code> gives the metrics of a Spark application in JSON format (using JAX-RS API).</p> <pre><code>// start spark-shell\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 257\nContent-Type: application/json\nDate: Tue, 05 Jun 2018 18:36:16 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-05T15:04:48.328GMT\",\n                \"lastUpdatedEpoch\": 1528211088328,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-05T15:04:48.328GMT\",\n                \"startTimeEpoch\": 1528211088328\n            }\n        ],\n        \"id\": \"local-1528211089216\",\n        \"name\": \"Spark shell\"\n    }\n]\n\n// Fixed in Spark 2.3.1\n// https://issues.apache.org/jira/browse/SPARK-24188\n$ http http://localhost:4040/api/v1/version\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 43\nContent-Type: application/json\nDate: Thu, 14 Jun 2018 08:19:06 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n{\n    \"spark\": \"2.3.1\"\n}\n</code></pre> <p>[[paths]] .ApiRootResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description</p> [[applications]] <code>applications</code> [[ApplicationListResource]] Delegates to the spark-api-ApplicationListResource.md[ApplicationListResource] resource class [[applications_appId]] <code>applications/\\{appId}</code> [[OneApplicationResource]] Delegates to the spark-api-OneApplicationResource.md[OneApplicationResource] resource class <p>| [[version]] <code>version</code> | GET | Creates a <code>VersionInfo</code> with the current version of Spark |===</p> <p>=== [[getServletHandler]] Creating /api/* Context Handler -- <code>getServletHandler</code> Method</p>"},{"location":"rest/ApiRootResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/ApiRootResource/#getservlethandleruiroot-uiroot-servletcontexthandler","title":"getServletHandler(uiRoot: UIRoot): ServletContextHandler","text":"<p><code>getServletHandler</code> creates a Jetty <code>ServletContextHandler</code> for <code>/api</code> context path.</p> <p>NOTE: The Jetty <code>ServletContextHandler</code> created does not support HTTP sessions as REST API is stateless.</p> <p><code>getServletHandler</code> creates a Jetty <code>ServletHolder</code> with the resources and providers in <code>org.apache.spark.status.api.v1</code> package. It then registers the <code>ServletHolder</code> to serve <code>/*</code> context path (under the <code>ServletContextHandler</code> for <code>/api</code>).</p> <p><code>getServletHandler</code> requests <code>UIRootFromServletContext</code> to spark-api-UIRootFromServletContext.md#setUiRoot[setUiRoot] with the <code>ServletContextHandler</code> and the input spark-api-UIRoot.md[UIRoot].</p> <p>NOTE: <code>getServletHandler</code> is used when spark-webui-SparkUI.md#initialize[SparkUI] and spark-history-server:HistoryServer.md#initialize[HistoryServer] are requested to initialize.</p>"},{"location":"rest/ApplicationListResource/","title":"ApplicationListResource","text":"<p>== [[ApplicationListResource]] ApplicationListResource -- applications URI Handler</p> <p><code>ApplicationListResource</code> is a spark-api-ApiRequestContext.md[ApiRequestContext] that spark-api-ApiRootResource.md#applications[ApiRootResource] uses to handle &lt;&gt; URI path. <p>[[paths]] .ApplicationListResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description</p> <p>| [[root]] <code>/</code> | GET | &lt;&gt; |=== <pre><code>// start spark-shell\n// there should be a single Spark application -- the spark-shell itself\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 255\nContent-Type: application/json\nDate: Wed, 06 Jun 2018 12:40:33 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-06T12:30:19.220GMT\",\n                \"lastUpdatedEpoch\": 1528288219220,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-06T12:30:19.220GMT\",\n                \"startTimeEpoch\": 1528288219220\n            }\n        ],\n        \"id\": \"local-1528288219790\",\n        \"name\": \"Spark shell\"\n    }\n]\n</code></pre> <p>=== [[isAttemptInRange]] <code>isAttemptInRange</code> Internal Method</p>"},{"location":"rest/ApplicationListResource/#source-scala","title":"[source, scala]","text":"<p>isAttemptInRange(   attempt: ApplicationAttemptInfo,   minStartDate: SimpleDateParam,   maxStartDate: SimpleDateParam,   minEndDate: SimpleDateParam,   maxEndDate: SimpleDateParam,   anyRunning: Boolean): Boolean</p> <p><code>isAttemptInRange</code>...FIXME</p> <p>NOTE: <code>isAttemptInRange</code> is used exclusively when <code>ApplicationListResource</code> is requested to handle a &lt;&gt; HTTP request.</p> <p>=== [[appList]] <code>appList</code> Method</p>"},{"location":"rest/ApplicationListResource/#source-scala_1","title":"[source, scala]","text":"<p>appList(   @QueryParam(\"status\") status: JList[ApplicationStatus],   @DefaultValue(\"2010-01-01\") @QueryParam(\"minDate\") minDate: SimpleDateParam,   @DefaultValue(\"3000-01-01\") @QueryParam(\"maxDate\") maxDate: SimpleDateParam,   @DefaultValue(\"2010-01-01\") @QueryParam(\"minEndDate\") minEndDate: SimpleDateParam,   @DefaultValue(\"3000-01-01\") @QueryParam(\"maxEndDate\") maxEndDate: SimpleDateParam,   @QueryParam(\"limit\") limit: Integer) : Iterator[ApplicationInfo]</p> <p><code>appList</code>...FIXME</p> <p>NOTE: <code>appList</code> is used when...FIXME</p>"},{"location":"rest/BaseAppResource/","title":"BaseAppResource","text":"<p>== [[BaseAppResource]] BaseAppResource</p> <p><code>BaseAppResource</code> is the contract of spark-api-ApiRequestContext.md[ApiRequestContexts] that can &lt;&gt; and use &lt;&gt; and &lt;&gt; path parameters in URI paths. <p>[[path-params]] .BaseAppResource's Path Parameters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| <code>appId</code> | [[appId]] <code>@PathParam(\"appId\")</code></p> <p>Used when...FIXME</p> <p>| <code>attemptId</code> | [[attemptId]] <code>@PathParam(\"attemptId\")</code></p> <p>Used when...FIXME |===</p> <p>[[implementations]] .BaseAppResources [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | BaseAppResource | Description</p> <p>| spark-api-AbstractApplicationResource.md[AbstractApplicationResource] | [[AbstractApplicationResource]]</p> <p>| <code>BaseStreamingAppResource</code> | [[BaseStreamingAppResource]]</p> <p>| spark-api-StagesResource.md[StagesResource] | [[StagesResource]] |===</p> <p>NOTE: <code>BaseAppResource</code> is a <code>private[v1]</code> contract.</p> <p>=== [[withUI]] <code>withUI</code> Method</p>"},{"location":"rest/BaseAppResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/BaseAppResource/#withuit-t","title":"withUIT: T","text":"<p><code>withUI</code>...FIXME</p> <p>NOTE: <code>withUI</code> is used when...FIXME</p>"},{"location":"rest/OneApplicationAttemptResource/","title":"OneApplicationAttemptResource","text":"<p>== [[OneApplicationAttemptResource]] OneApplicationAttemptResource</p> <p><code>OneApplicationAttemptResource</code> is a spark-api-AbstractApplicationResource.md[AbstractApplicationResource] (and so a spark-api-ApiRequestContext.md[ApiRequestContext] indirectly).</p> <p><code>OneApplicationAttemptResource</code> is used when <code>AbstractApplicationResource</code> is requested to spark-api-AbstractApplicationResource.md#applicationAttempt[applicationAttempt].</p> <p>[[paths]] .OneApplicationAttemptResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description</p> <p>| [[root]] <code>/</code> | GET | &lt;&gt; |=== <pre><code>// start spark-shell\n// there should be a single Spark application -- the spark-shell itself\n// CAUTION: FIXME Demo of OneApplicationAttemptResource in Action\n</code></pre> <p>=== [[getAttempt]] <code>getAttempt</code> Method</p>"},{"location":"rest/OneApplicationAttemptResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/OneApplicationAttemptResource/#getattempt-applicationattemptinfo","title":"getAttempt(): ApplicationAttemptInfo","text":"<p><code>getAttempt</code> requests the spark-api-ApiRequestContext.md#uiRoot[UIRoot] for the spark-api-UIRoot.md#getApplicationInfo[application info] (given the spark-api-BaseAppResource.md#appId[appId]) and finds the spark-api-BaseAppResource.md#attemptId[attemptId] among the available attempts.</p> <p>NOTE: spark-api-BaseAppResource.md#appId[appId] and spark-api-BaseAppResource.md#attemptId[attemptId] are path parameters.</p> <p>In the end, <code>getAttempt</code> returns the <code>ApplicationAttemptInfo</code> if available or reports a <code>NotFoundException</code>:</p> <pre><code>unknown app [appId], attempt [attemptId]\n</code></pre>"},{"location":"rest/OneApplicationResource/","title":"OneApplicationResource","text":"<p>== [[OneApplicationResource]] OneApplicationResource -- applications/appId URI Handler</p> <p><code>OneApplicationResource</code> is a spark-api-AbstractApplicationResource.md[AbstractApplicationResource] (and so a spark-api-ApiRequestContext.md[ApiRequestContext] indirectly) that spark-api-ApiRootResource.md#applications_appId[ApiRootResource] uses to handle &lt;&gt; URI path. <p>[[paths]] .OneApplicationResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description</p> <p>| [[root]] <code>/</code> | GET | &lt;&gt; |=== <pre><code>// start spark-shell\n// there should be a single Spark application -- the spark-shell itself\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 255\nContent-Type: application/json\nDate: Wed, 06 Jun 2018 12:40:33 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-06T12:30:19.220GMT\",\n                \"lastUpdatedEpoch\": 1528288219220,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-06T12:30:19.220GMT\",\n                \"startTimeEpoch\": 1528288219220\n            }\n        ],\n        \"id\": \"local-1528288219790\",\n        \"name\": \"Spark shell\"\n    }\n]\n\n$ http http://localhost:4040/api/v1/applications/local-1528288219790\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 255\nContent-Type: application/json\nDate: Wed, 06 Jun 2018 12:41:43 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n{\n    \"attempts\": [\n        {\n            \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n            \"completed\": false,\n            \"duration\": 0,\n            \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n            \"endTimeEpoch\": -1,\n            \"lastUpdated\": \"2018-06-06T12:30:19.220GMT\",\n            \"lastUpdatedEpoch\": 1528288219220,\n            \"sparkUser\": \"jacek\",\n            \"startTime\": \"2018-06-06T12:30:19.220GMT\",\n            \"startTimeEpoch\": 1528288219220\n        }\n    ],\n    \"id\": \"local-1528288219790\",\n    \"name\": \"Spark shell\"\n}\n</code></pre> <p>=== [[getApp]] <code>getApp</code> Method</p>"},{"location":"rest/OneApplicationResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/OneApplicationResource/#getapp-applicationinfo","title":"getApp(): ApplicationInfo","text":"<p><code>getApp</code> requests the spark-api-ApiRequestContext.md#uiRoot[UIRoot] for the spark-api-UIRoot.md#getApplicationInfo[application info] (given the spark-api-BaseAppResource.md#appId[appId]).</p> <p>In the end, <code>getApp</code> returns the <code>ApplicationInfo</code> if available or reports a <code>NotFoundException</code>:</p> <pre><code>unknown app: [appId]\n</code></pre>"},{"location":"rest/StagesResource/","title":"StagesResource","text":"<p>== [[StagesResource]] StagesResource</p> <p><code>StagesResource</code> is...FIXME</p> <p>[[paths]] .StagesResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description</p> <p>| | GET | &lt;&gt; <p>| <code>{stageId: \\d+}</code> | GET | &lt;&gt; <p>| <code>{stageId: \\d+}/{stageAttemptId: \\d+}</code> | GET | &lt;&gt; <p>| <code>{stageId: \\d+}/{stageAttemptId: \\d+}/taskSummary</code> | GET | &lt;&gt; <p>| <code>{stageId: \\d+}/{stageAttemptId: \\d+}/taskList</code> | GET | &lt;&gt; |=== <p>=== [[stageList]] <code>stageList</code> Method</p>"},{"location":"rest/StagesResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/StagesResource/#stagelistqueryparamstatus-statuses-jliststagestatus-seqstagedata","title":"stageList(@QueryParam(\"status\") statuses: JList[StageStatus]): Seq[StageData]","text":"<p><code>stageList</code>...FIXME</p> <p>NOTE: <code>stageList</code> is used when...FIXME</p> <p>=== [[stageData]] <code>stageData</code> Method</p>"},{"location":"rest/StagesResource/#source-scala_1","title":"[source, scala]","text":"<p>stageData(   @PathParam(\"stageId\") stageId: Int,   @QueryParam(\"details\") @DefaultValue(\"true\") details: Boolean): Seq[StageData]</p> <p><code>stageData</code>...FIXME</p> <p>NOTE: <code>stageData</code> is used when...FIXME</p> <p>=== [[oneAttemptData]] <code>oneAttemptData</code> Method</p>"},{"location":"rest/StagesResource/#source-scala_2","title":"[source, scala]","text":"<p>oneAttemptData(   @PathParam(\"stageId\") stageId: Int,   @PathParam(\"stageAttemptId\") stageAttemptId: Int,   @QueryParam(\"details\") @DefaultValue(\"true\") details: Boolean): StageData</p> <p><code>oneAttemptData</code>...FIXME</p> <p>NOTE: <code>oneAttemptData</code> is used when...FIXME</p> <p>=== [[taskSummary]] <code>taskSummary</code> Method</p>"},{"location":"rest/StagesResource/#source-scala_3","title":"[source, scala]","text":"<p>taskSummary(   @PathParam(\"stageId\") stageId: Int,   @PathParam(\"stageAttemptId\") stageAttemptId: Int,   @DefaultValue(\"0.05,0.25,0.5,0.75,0.95\") @QueryParam(\"quantiles\") quantileString: String) : TaskMetricDistributions</p> <p><code>taskSummary</code>...FIXME</p> <p>NOTE: <code>taskSummary</code> is used when...FIXME</p> <p>=== [[taskList]] <code>taskList</code> Method</p>"},{"location":"rest/StagesResource/#source-scala_4","title":"[source, scala]","text":"<p>taskList(   @PathParam(\"stageId\") stageId: Int,   @PathParam(\"stageAttemptId\") stageAttemptId: Int,   @DefaultValue(\"0\") @QueryParam(\"offset\") offset: Int,   @DefaultValue(\"20\") @QueryParam(\"length\") length: Int,   @DefaultValue(\"ID\") @QueryParam(\"sortBy\") sortBy: TaskSorting): Seq[TaskData]</p> <p><code>taskList</code>...FIXME</p> <p>NOTE: <code>taskList</code> is used when...FIXME</p>"},{"location":"rest/UIRoot/","title":"UIRoot","text":"<p>== [[UIRoot]] UIRoot -- Contract for Root Contrainers of Application UI Information</p> <p><code>UIRoot</code> is the &lt;&gt; of the &lt;&gt;. <p>[[contract]] [source, scala]</p> <p>package org.apache.spark.status.api.v1</p> <p>trait UIRoot {   // only required methods that have no implementation   // the others follow   def withSparkUIT(fn: SparkUI =&gt; T): T   def getApplicationInfoList: Iterator[ApplicationInfo]   def getApplicationInfo(appId: String): Option[ApplicationInfo]   def securityManager: SecurityManager }</p> <p>NOTE: <code>UIRoot</code> is a <code>private[spark]</code> contract.</p> <p>.UIRoot Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| <code>getApplicationInfo</code> | [[getApplicationInfo]] Used when...FIXME</p> <p>| <code>getApplicationInfoList</code> | [[getApplicationInfoList]] Used when...FIXME</p> <p>| <code>securityManager</code> | [[securityManager]] Used when...FIXME</p> <p>| <code>withSparkUI</code> | [[withSparkUI]] Used exclusively when <code>BaseAppResource</code> is requested spark-api-BaseAppResource.md#withUI[withUI] |===</p> <p>[[implementations]] .UIRoots [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | UIRoot | Description</p> <p>| spark-history-server:HistoryServer.md[HistoryServer] | [[HistoryServer]] Application UI for active and completed Spark applications (i.e. Spark applications that are still running or have already finished)</p> <p>| spark-webui-SparkUI.md[SparkUI] | [[SparkUI]] Application UI for an active Spark application (i.e. a Spark application that is still running) |===</p> <p>=== [[writeEventLogs]] <code>writeEventLogs</code> Method</p>"},{"location":"rest/UIRoot/#source-scala","title":"[source, scala]","text":""},{"location":"rest/UIRoot/#writeeventlogsappid-string-attemptid-optionstring-zipstream-zipoutputstream-unit","title":"writeEventLogs(appId: String, attemptId: Option[String], zipStream: ZipOutputStream): Unit","text":"<p><code>writeEventLogs</code>...FIXME</p> <p>NOTE: <code>writeEventLogs</code> is used when...FIXME</p>"},{"location":"rest/UIRootFromServletContext/","title":"UIRootFromServletContext","text":"<p>== [[UIRootFromServletContext]] UIRootFromServletContext</p> <p><code>UIRootFromServletContext</code> manages the current &lt;&gt; object in a Jetty <code>ContextHandler</code>. <p>[[attribute]] <code>UIRootFromServletContext</code> uses its canonical name for the context attribute that is used to &lt;&gt; or &lt;&gt; the current spark-api-UIRoot.md[UIRoot] object (in Jetty's <code>ContextHandler</code>). <p>NOTE: https://www.eclipse.org/jetty/javadoc/current/org/eclipse/jetty/server/handler/ContextHandler.html[ContextHandler] is the environment for multiple Jetty <code>Handlers</code>, e.g. URI context path, class loader, static resource base.</p> <p>In essence, <code>UIRootFromServletContext</code> is simply a \"bridge\" between two worlds, Spark's spark-api-UIRoot.md[UIRoot] and Jetty's <code>ContextHandler</code>.</p> <p>=== [[setUiRoot]] <code>setUiRoot</code> Method</p>"},{"location":"rest/UIRootFromServletContext/#source-scala","title":"[source, scala]","text":""},{"location":"rest/UIRootFromServletContext/#setuirootcontexthandler-contexthandler-uiroot-uiroot-unit","title":"setUiRoot(contextHandler: ContextHandler, uiRoot: UIRoot): Unit","text":"<p><code>setUiRoot</code>...FIXME</p> <p>NOTE: <code>setUiRoot</code> is used exclusively when <code>ApiRootResource</code> is requested to spark-api-ApiRootResource.md#getServletHandler[register /api/* context handler].</p> <p>=== [[getUiRoot]] <code>getUiRoot</code> Method</p>"},{"location":"rest/UIRootFromServletContext/#source-scala_1","title":"[source, scala]","text":""},{"location":"rest/UIRootFromServletContext/#getuirootcontext-servletcontext-uiroot","title":"getUiRoot(context: ServletContext): UIRoot","text":"<p><code>getUiRoot</code>...FIXME</p> <p>NOTE: <code>getUiRoot</code> is used exclusively when <code>ApiRequestContext</code> is requested for the current spark-api-ApiRequestContext.md#uiRoot[UIRoot].</p>"},{"location":"rpc/","title":"RPC System","text":"<p>RPC System is a communication system of Spark services.</p> <p>The main abstractions are RpcEnv and RpcEndpoint.</p> <p></p>"},{"location":"rpc/NettyRpcEnv/","title":"NettyRpcEnv","text":"<p><code>NettyRpcEnv</code> is an RpcEnv that uses Netty (\"an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers &amp; clients\").</p>"},{"location":"rpc/NettyRpcEnv/#creating-instance","title":"Creating Instance","text":"<p><code>NettyRpcEnv</code> takes the following to be created:</p> <ul> <li> SparkConf <li> JavaSerializerInstance <li> Host Name <li> <code>SecurityManager</code> <li> Number of CPU Cores <p><code>NettyRpcEnv</code> is created\u00a0when:</p> <ul> <li><code>NettyRpcEnvFactory</code> is requested to create an RpcEnv</li> </ul>"},{"location":"rpc/NettyRpcEnvFactory/","title":"NettyRpcEnvFactory","text":"<p><code>NettyRpcEnvFactory</code> is an RpcEnvFactory for a Netty-based RpcEnv.</p>"},{"location":"rpc/NettyRpcEnvFactory/#creating-rpcenv","title":"Creating RpcEnv <pre><code>create(\n  config: RpcEnvConfig): RpcEnv\n</code></pre> <p><code>create</code> creates a JavaSerializerInstance (using a JavaSerializer).</p>  <p>Note</p> <p><code>KryoSerializer</code> is not supported.</p>  <p>create creates a rpc:NettyRpcEnv.md[] with the JavaSerializerInstance. create uses the given rpc:RpcEnvConfig.md[] for the rpc:RpcEnvConfig.md#advertiseAddress[advertised address], rpc:RpcEnvConfig.md#securityManager[SecurityManager] and rpc:RpcEnvConfig.md#numUsableCores[number of CPU cores].</p> <p>create returns the NettyRpcEnv unless the rpc:RpcEnvConfig.md#clientMode[clientMode] is turned off (server mode).</p> <p>In server mode, create attempts to start the NettyRpcEnv on a given port. create uses the given rpc:RpcEnvConfig.md[] for the rpc:RpcEnvConfig.md#port[port], rpc:RpcEnvConfig.md#bindAddress[bind address], and rpc:RpcEnvConfig.md#name[name]. With the port, the NettyRpcEnv is requested to rpc:NettyRpcEnv.md#startServer[start a server].</p> <p>create is part of the rpc:RpcEnvFactory.md#create[RpcEnvFactory] abstraction.</p>","text":""},{"location":"rpc/RpcAddress/","title":"RpcAddress","text":"<p><code>RpcAddress</code> is a logical address of an RPC system, with hostname and port.</p> <p><code>RpcAddress</code> can be encoded as a Spark URL in the format of <code>spark://host:port</code>.</p>"},{"location":"rpc/RpcAddress/#creating-instance","title":"Creating Instance","text":"<p><code>RpcAddress</code> takes the following to be created:</p> <ul> <li> Host <li> Port"},{"location":"rpc/RpcAddress/#creating-rpcaddress-based-on-spark-url","title":"Creating RpcAddress based on Spark URL <pre><code>fromSparkURL(\n  sparkUrl: String): RpcAddress\n</code></pre> <p><code>fromSparkURL</code> extract a host and a port from the input Spark URL and creates an RpcAddress.</p> <p><code>fromSparkURL</code>\u00a0is used when:</p> <ul> <li><code>StandaloneAppClient</code> (Spark Standalone) is created</li> <li><code>ClientApp</code> (Spark Standalone) is requested to <code>start</code></li> <li><code>Worker</code> (Spark Standalone) is requested to <code>startRpcEnvAndEndpoint</code></li> </ul>","text":""},{"location":"rpc/RpcEndpoint/","title":"RpcEndpoint","text":"<p><code>RpcEndpoint</code> is an abstraction of RPC endpoints that are registered to an RpcEnv to process one- (fire-and-forget) or two-way messages.</p>"},{"location":"rpc/RpcEndpoint/#contract","title":"Contract","text":""},{"location":"rpc/RpcEndpoint/#onconnected","title":"onConnected <pre><code>onConnected(\n  remoteAddress: RpcAddress): Unit\n</code></pre> <p>Invoked when RpcAddress is connected to the current node</p> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process a <code>RemoteProcessConnected</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#ondisconnected","title":"onDisconnected <pre><code>onDisconnected(\n  remoteAddress: RpcAddress): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process a <code>RemoteProcessDisconnected</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#onerror","title":"onError <pre><code>onError(\n  cause: Throwable): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process a message that threw a <code>NonFatal</code> exception</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#onnetworkerror","title":"onNetworkError <pre><code>onNetworkError(\n  cause: Throwable,\n  remoteAddress: RpcAddress): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process a <code>RemoteProcessConnectionError</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#onstart","title":"onStart <pre><code>onStart(): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process an <code>OnStart</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#onstop","title":"onStop <pre><code>onStop(): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process an <code>OnStop</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#processing-one-way-messages","title":"Processing One-Way Messages <pre><code>receive: PartialFunction[Any, Unit]\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process an <code>OneWayMessage</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#processing-two-way-messages","title":"Processing Two-Way Messages <pre><code>receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process a <code>RpcMessage</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#rpcenv","title":"RpcEnv <pre><code>rpcEnv: RpcEnv\n</code></pre> <p>RpcEnv this <code>RpcEndpoint</code> is registered to</p>","text":""},{"location":"rpc/RpcEndpoint/#implementations","title":"Implementations","text":"<ul> <li>AMEndpoint</li> <li> IsolatedRpcEndpoint <li>MapOutputTrackerMasterEndpoint</li> <li>OutputCommitCoordinatorEndpoint</li> <li>RpcEndpointVerifier</li> <li> ThreadSafeRpcEndpoint <li>WorkerWatcher</li>"},{"location":"rpc/RpcEndpoint/#self","title":"self <pre><code>self: RpcEndpointRef\n</code></pre> <p><code>self</code> requests the RpcEnv for the RpcEndpointRef of this <code>RpcEndpoint</code>.</p> <p><code>self</code> throws an <code>IllegalArgumentException</code> when the RpcEnv has not been initialized:</p> <pre><code>rpcEnv has not been initialized\n</code></pre>","text":""},{"location":"rpc/RpcEndpoint/#stopping-rpcendpoint","title":"Stopping RpcEndpoint <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> requests the RpcEnv to stop this RpcEndpoint</p>","text":""},{"location":"rpc/RpcEndpointAddress/","title":"RpcEndpointAddress","text":"<p>= RpcEndpointAddress</p> <p>RpcEndpointAddress is a logical address of an endpoint in an RPC system, with &lt;&gt; and name. <p>RpcEndpointAddress is in the format of <code>spark://[name]@[rpcAddress.host]:[rpcAddress.port]</code>.</p>"},{"location":"rpc/RpcEndpointRef/","title":"RpcEndpointRef","text":"<p><code>RpcEndpointRef</code> is a reference to a rpc:RpcEndpoint.md[RpcEndpoint] in a rpc:index.md[RpcEnv].</p> <p>RpcEndpointRef is a serializable entity and so you can send it over a network or save it for later use (it can however be deserialized using the owning <code>RpcEnv</code> only).</p> <p>A RpcEndpointRef has &lt;&gt; (a Spark URL), and a name. <p>You can send asynchronous one-way messages to the corresponding RpcEndpoint using &lt;&gt; method. <p>You can send a semi-synchronous message, i.e. \"subscribe\" to be notified when a response arrives, using <code>ask</code> method. You can also block the current calling thread for a response using <code>askWithRetry</code> method.</p> <ul> <li><code>spark.rpc.numRetries</code> (default: <code>3</code>) - the number of times to retry connection attempts.</li> <li><code>spark.rpc.retry.wait</code> (default: <code>3s</code>) - the number of milliseconds to wait on each retry.</li> </ul> <p>It also uses rpc:index.md#endpoint-lookup-timeout[lookup timeouts].</p> <p>== [[send]] send Method</p> <p>CAUTION: FIXME</p> <p>== [[askWithRetry]] askWithRetry Method</p> <p>CAUTION: FIXME</p>"},{"location":"rpc/RpcEnv/","title":"RpcEnv","text":"<p><code>RpcEnv</code> is an abstraction of RPC environments.</p>"},{"location":"rpc/RpcEnv/#contract","title":"Contract","text":""},{"location":"rpc/RpcEnv/#address","title":"address <pre><code>address: RpcAddress\n</code></pre> <p>RpcAddress of this RPC environments</p>","text":""},{"location":"rpc/RpcEnv/#asyncsetupendpointrefbyuri","title":"asyncSetupEndpointRefByURI <pre><code>asyncSetupEndpointRefByURI(\n  uri: String): Future[RpcEndpointRef]\n</code></pre> <p>Looking up a RpcEndpointRef of the RPC endpoint by URI (asynchronously)</p> <p>Used when:</p> <ul> <li><code>WorkerWatcher</code> is created</li> <li><code>CoarseGrainedExecutorBackend</code> is requested to onStart</li> <li><code>RpcEnv</code> is requested to setupEndpointRefByURI</li> </ul>","text":""},{"location":"rpc/RpcEnv/#awaittermination","title":"awaitTermination <pre><code>awaitTermination(): Unit\n</code></pre> <p>Blocks the current thread till the RPC environment terminates</p> <p>Used when:</p> <ul> <li><code>SparkEnv</code> is requested to stop</li> <li><code>ClientApp</code> (Spark Standalone) is requested to <code>start</code></li> <li><code>LocalSparkCluster</code> (Spark Standalone) is requested to <code>stop</code></li> <li><code>Master</code> (Spark Standalone) and <code>Worker</code> (Spark Standalone) are launched</li> <li><code>CoarseGrainedExecutorBackend</code> is requested to run</li> </ul>","text":""},{"location":"rpc/RpcEnv/#deserialize","title":"deserialize <pre><code>deserialize[T](\n  deserializationAction: () =&gt; T): T\n</code></pre> <p>Used when:</p> <ul> <li><code>PersistenceEngine</code> is requested to <code>readPersistedData</code></li> <li><code>NettyRpcEnv</code> is requested to deserialize</li> </ul>","text":""},{"location":"rpc/RpcEnv/#endpointref","title":"endpointRef <pre><code>endpointRef(\n  endpoint: RpcEndpoint): RpcEndpointRef\n</code></pre> <p>Used when:</p> <ul> <li><code>RpcEndpoint</code> is requested for the RpcEndpointRef to itself</li> </ul>","text":""},{"location":"rpc/RpcEnv/#rpcenvfileserver","title":"RpcEnvFileServer <pre><code>fileServer: RpcEnvFileServer\n</code></pre> <p>RpcEnvFileServer of this RPC environment</p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested to addFile, addJar and is created (and registers the REPL's output directory)</li> </ul>","text":""},{"location":"rpc/RpcEnv/#openchannel","title":"openChannel <pre><code>openChannel(\n  uri: String): ReadableByteChannel\n</code></pre> <p>Opens a channel to download a file at the given URI</p> <p>Used when:</p> <ul> <li><code>Utils</code> utility is used to doFetchFile</li> <li><code>ExecutorClassLoader</code> is requested to <code>getClassFileInputStreamFromSparkRPC</code></li> </ul>","text":""},{"location":"rpc/RpcEnv/#setupendpoint","title":"setupEndpoint <pre><code>setupEndpoint(\n  name: String,\n  endpoint: RpcEndpoint): RpcEndpointRef\n</code></pre>","text":""},{"location":"rpc/RpcEnv/#shutdown","title":"shutdown <pre><code>shutdown(): Unit\n</code></pre> <p>Shuts down this RPC environment asynchronously (and to make sure this <code>RpcEnv</code> exits successfully, use awaitTermination)</p> <p>Used when:</p> <ul> <li><code>SparkEnv</code> is requested to stop</li> <li><code>LocalSparkCluster</code> (Spark Standalone) is requested to <code>stop</code></li> <li><code>DriverWrapper</code> is launched</li> <li><code>CoarseGrainedExecutorBackend</code> is launched</li> <li><code>NettyRpcEnvFactory</code> is requested to create an RpcEnv (in server mode and failed to assign a port)</li> </ul>","text":""},{"location":"rpc/RpcEnv/#stopping-rpcendpointref","title":"Stopping RpcEndpointRef <pre><code>stop(\n  endpoint: RpcEndpointRef): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested to stop</li> <li><code>RpcEndpoint</code> is requested to stop</li> <li><code>BlockManager</code> is requested to stop</li> <li>in Spark SQL</li> </ul>","text":""},{"location":"rpc/RpcEnv/#implementations","title":"Implementations","text":"<ul> <li>NettyRpcEnv</li> </ul>"},{"location":"rpc/RpcEnv/#creating-instance","title":"Creating Instance","text":"<p><code>RpcEnv</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>RpcEnv</code> is created using RpcEnv.create utility.</p> Abstract Class <p><code>RpcEnv</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete RpcEnvs.</p>"},{"location":"rpc/RpcEnv/#creating-rpcenv","title":"Creating RpcEnv <pre><code>create(\n  name: String,\n  host: String,\n  port: Int,\n  conf: SparkConf,\n  securityManager: SecurityManager,\n  clientMode: Boolean = false): RpcEnv // (1)\ncreate(\n  name: String,\n  bindAddress: String,\n  advertiseAddress: String,\n  port: Int,\n  conf: SparkConf,\n  securityManager: SecurityManager,\n  numUsableCores: Int,\n  clientMode: Boolean): RpcEnv\n</code></pre> <ol> <li>Uses <code>0</code> for <code>numUsableCores</code></li> </ol> <p><code>create</code> creates a NettyRpcEnvFactory and requests it to create an RpcEnv (with a new RpcEnvConfig with all the given arguments).</p> <p><code>create</code> is used when:</p> <ul> <li><code>SparkEnv</code> utility is requested to create a SparkEnv (<code>clientMode</code> flag is turned on for executors and off for the driver)</li> <li> <p>With <code>clientMode</code> flag <code>true</code>:</p> <ul> <li><code>CoarseGrainedExecutorBackend</code> is requested to run</li> <li><code>ClientApp</code> (Spark Standalone) is requested to <code>start</code></li> <li><code>Master</code> (Spark Standalone) is requested to <code>startRpcEnvAndEndpoint</code></li> <li><code>Worker</code> (Spark Standalone) is requested to <code>startRpcEnvAndEndpoint</code></li> <li><code>DriverWrapper</code> is launched</li> <li><code>ApplicationMaster</code> (Spark on YARN) is requested to <code>runExecutorLauncher</code> (in client deploy mode)</li> </ul> </li> </ul>","text":""},{"location":"rpc/RpcEnv/#default-endpoint-lookup-timeout","title":"Default Endpoint Lookup Timeout <p><code>RpcEnv</code> uses the default lookup timeout for...FIXME</p> <p>When a remote endpoint is resolved, a local RPC environment connects to the remote one (endpoint lookup). To configure the time needed for the endpoint lookup you can use the following settings.</p> <p>It is a prioritized list of lookup timeout properties (the higher on the list, the more important):</p> <ul> <li>spark.rpc.lookupTimeout</li> <li>spark.network.timeout</li> </ul>","text":""},{"location":"rpc/RpcEnvConfig/","title":"RpcEnvConfig","text":"<p>= RpcEnvConfig :page-toclevels: -1</p> <p>[[creating-instance]] RpcEnvConfig is a configuration of an rpc:RpcEnv.md[]:</p> <ul> <li>[[conf]] SparkConf.md[]</li> <li>[[name]] System Name</li> <li>[[bindAddress]] Bind Address</li> <li>[[advertiseAddress]] Advertised Address</li> <li>[[port]] Port</li> <li>[[securityManager]] SecurityManager</li> <li>[[numUsableCores]] Number of CPU cores</li> <li>&lt;&gt; <p>RpcEnvConfig is created when RpcEnv utility is used to rpc:RpcEnv.md#create[create an RpcEnv] (using rpc:RpcEnvFactory.md[]).</p> <p>== [[clientMode]] Client Mode</p> <p>When an RPC Environment is initialized core:SparkEnv.md#createDriverEnv[as part of the initialization of the driver] or core:SparkEnv.md#createExecutorEnv[executors] (using <code>RpcEnv.create</code>), <code>clientMode</code> is <code>false</code> for the driver and <code>true</code> for executors.</p> <p>Copied (almost verbatim) from https://issues.apache.org/jira/browse/SPARK-10997[SPARK-10997 Netty-based RPC env should support a \"client-only\" mode] and the https://github.com/apache/spark/commit/71d1c907dec446db566b19f912159fd8f46deb7d[commit]:</p> <p>\"Client mode\" means the RPC env will not listen for incoming connections.</p> <p>This allows certain processes in the Spark stack (such as Executors or tha YARN client-mode AM) to act as pure clients when using the netty-based RPC backend, reducing the number of sockets Spark apps need to use and also the number of open ports.</p> <p>The AM connects to the driver in \"client mode\", and that connection is used for all driver -- AM communication, and so the AM is properly notified when the connection goes down.</p> <p>In \"general\", non-YARN case, <code>clientMode</code> flag is therefore enabled for executors and disabled for the driver.</p> <p>In Spark on YARN in <code>client</code> deploy mode, <code>clientMode</code> flag is however enabled explicitly when Spark on YARN's spark-yarn-applicationmaster.md#runExecutorLauncher-sparkYarnAM[ApplicationMaster] creates the <code>sparkYarnAM</code> RPC Environment.</p>"},{"location":"rpc/RpcEnvFactory/","title":"RpcEnvFactory","text":"<p>= RpcEnvFactory</p> <p>RpcEnvFactory is an abstraction of &lt;&gt; to &lt;&gt;. <p>== [[implementations]] Available RpcEnvFactories</p> <p>rpc:NettyRpcEnvFactory.md[] is the default and only known RpcEnvFactory in Apache Spark (as of https://github.com/apache/spark/commit/4f5a24d7e73104771f233af041eeba4f41675974[this commit]).</p> <p>== [[create]] Creating RpcEnv</p>"},{"location":"rpc/RpcEnvFactory/#sourcescala","title":"[source,scala]","text":"<p>create(   config: RpcEnvConfig): RpcEnv</p> <p>create is used when RpcEnv utility is requested to rpc:RpcEnv.md#create[create an RpcEnv].</p>"},{"location":"rpc/RpcEnvFileServer/","title":"RpcEnvFileServer","text":"<p>= RpcEnvFileServer</p> <p>RpcEnvFileServer is...FIXME</p>"},{"location":"rpc/RpcUtils/","title":"RpcUtils","text":""},{"location":"rpc/RpcUtils/#maximum-message-size","title":"Maximum Message Size <pre><code>maxMessageSizeBytes(\n  conf: SparkConf): Int\n</code></pre> <p><code>maxMessageSizeBytes</code> is the value of spark.rpc.message.maxSize configuration property in bytes (by multiplying the value by <code>1024 * 1024</code>).</p> <p><code>maxMessageSizeBytes</code> throws an <code>IllegalArgumentException</code> when the value is above <code>2047</code> MB:</p> <pre><code>spark.rpc.message.maxSize should not be greater than 2047 MB\n</code></pre> <p><code>maxMessageSizeBytes</code> is used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested for the maxRpcMessageSize</li> <li><code>Executor</code> is requested for the maxDirectResultSize</li> <li><code>CoarseGrainedSchedulerBackend</code> is requested for the maxRpcMessageSize</li> </ul>","text":""},{"location":"rpc/RpcUtils/#makedriverref","title":"makeDriverRef <pre><code>makeDriverRef(\n  name: String,\n  conf: SparkConf,\n  rpcEnv: RpcEnv): RpcEndpointRef\n</code></pre> <p><code>makeDriverRef</code>...FIXME</p>  <p><code>makeDriverRef</code> is used when:</p> <ul> <li>BarrierTaskContext is created</li> <li><code>SparkEnv</code> utility is used to create a SparkEnv (on executors)</li> <li>Executor is created</li> <li><code>PluginContextImpl</code> is requested for <code>driverEndpoint</code></li> </ul>","text":""},{"location":"rpc/spark-rpc-netty/","title":"Netty-Based RpcEnv","text":"<p>Netty-based RPC Environment is created by <code>NettyRpcEnvFactory</code> when rpc:index.md#settings[spark.rpc] is <code>netty</code> or <code>org.apache.spark.rpc.netty.NettyRpcEnvFactory</code>.</p> <p>NettyRpcEnv is only started on spark-driver.md[the driver]. See &lt;&gt;. <p>The default port to listen to is <code>7077</code>.</p> <p>When NettyRpcEnv starts, the following INFO message is printed out in the logs:</p> <pre><code>Successfully started service 'NettyRpcEnv' on port 0.\n</code></pre> <p>== [[thread-pools]] Thread Pools</p> <p>=== shuffle-server-ID</p> <p><code>EventLoopGroup</code> uses a daemon thread pool called <code>shuffle-server-ID</code>, where <code>ID</code> is a unique integer for <code>NioEventLoopGroup</code> (<code>NIO</code>) or <code>EpollEventLoopGroup</code> (<code>EPOLL</code>) for the Shuffle server.</p> <p>CAUTION: FIXME Review Netty's <code>NioEventLoopGroup</code>.</p> <p>CAUTION: FIXME Where are <code>SO_BACKLOG</code>, <code>SO_RCVBUF</code>, <code>SO_SNDBUF</code> channel options used?</p> <p>=== dispatcher-event-loop-ID</p> <p>NettyRpcEnv's Dispatcher uses the daemon fixed thread pool with &lt;&gt; threads. <p>Thread names are formatted as <code>dispatcher-event-loop-ID</code>, where <code>ID</code> is a unique, sequentially assigned integer.</p> <p>It starts the message processing loop on all of the threads.</p> <p>=== netty-rpc-env-timeout</p> <p>NettyRpcEnv uses the daemon single-thread scheduled thread pool <code>netty-rpc-env-timeout</code>.</p> <pre><code>\"netty-rpc-env-timeout\" #87 daemon prio=5 os_prio=31 tid=0x00007f887775a000 nid=0xc503 waiting on condition [0x0000000123397000]\n</code></pre> <p>=== netty-rpc-connection-ID</p> <p>NettyRpcEnv uses the daemon cached thread pool with up to &lt;&gt; threads. <p>Thread names are formatted as <code>netty-rpc-connection-ID</code>, where <code>ID</code> is a unique, sequentially assigned integer.</p> <p>== [[settings]] Settings</p> <p>The Netty-based implementation uses the following properties:</p> <ul> <li><code>spark.rpc.io.mode</code> (default: <code>NIO</code>) - <code>NIO</code> or <code>EPOLL</code> for low-level IO. <code>NIO</code> is always available, while <code>EPOLL</code> is only available on Linux. <code>NIO</code> uses <code>io.netty.channel.nio.NioEventLoopGroup</code> while <code>EPOLL</code> <code>io.netty.channel.epoll.EpollEventLoopGroup</code>.</li> <li><code>spark.shuffle.io.numConnectionsPerPeer</code> always equals <code>1</code></li> <li><code>spark.rpc.io.threads</code> (default: <code>0</code>; maximum: <code>8</code>) - the number of threads to use for the Netty client and server thread pools. ** <code>spark.shuffle.io.serverThreads</code> (default: the value of <code>spark.rpc.io.threads</code>) ** <code>spark.shuffle.io.clientThreads</code> (default: the value of <code>spark.rpc.io.threads</code>)</li> <li><code>spark.rpc.netty.dispatcher.numThreads</code> (default: the number of processors available to JVM)</li> <li><code>spark.rpc.connect.threads</code> (default: <code>64</code>) - used in cluster mode to communicate with a remote RPC endpoint</li> <li><code>spark.port.maxRetries</code> (default: <code>16</code> or <code>100</code> for testing when <code>spark.testing</code> is set) controls the maximum number of binding attempts/retries to a port before giving up.</li> </ul> <p>== [[endpoints]] Endpoints</p> <ul> <li><code>endpoint-verifier</code> (<code>RpcEndpointVerifier</code>) - a rpc:RpcEndpoint.md[RpcEndpoint] for remote RpcEnvs to query whether an RpcEndpoint exists or not. It uses <code>Dispatcher</code> that keeps track of registered endpoints and responds <code>true</code>/<code>false</code> to <code>CheckExistence</code> message.</li> </ul> <p><code>endpoint-verifier</code> is used to check out whether a given endpoint exists or not before the endpoint's reference is given back to clients.</p> <p>One use case is when an spark-standalone.md#AppClient[AppClient connects to standalone Masters] before it registers the application it acts for.</p> <p>CAUTION: FIXME Who'd like to use <code>endpoint-verifier</code> and how?</p> <p>== Message Dispatcher</p> <p>A message dispatcher is responsible for routing RPC messages to the appropriate endpoint(s).</p> <p>It uses the daemon fixed thread pool <code>dispatcher-event-loop</code> with <code>spark.rpc.netty.dispatcher.numThreads</code> threads for dispatching messages.</p> <pre><code>\"dispatcher-event-loop-0\" #26 daemon prio=5 os_prio=31 tid=0x00007f8877153800 nid=0x7103 waiting on condition [0x000000011f78b000]\n</code></pre>"},{"location":"scheduler/","title":"Spark Scheduler","text":"<p>Spark Scheduler is a core component of Apache Spark that is responsible for scheduling tasks for execution.</p> <p>Spark Scheduler uses the high-level stage-oriented DAGScheduler and the low-level task-oriented TaskScheduler.</p>"},{"location":"scheduler/#stage-execution","title":"Stage Execution","text":"<p>Every partition of a Stage is transformed into a Task (ShuffleMapTask or ResultTask for ShuffleMapStage and ResultStage, respectively).</p> <p>Submitting a stage can therefore trigger execution of a series of dependent parent stages.</p> <p></p> <p>When a Spark job is submitted, a new stage is created (they can be created from scratch or linked to, i.e. shared, if other jobs use them already).</p> <p></p> <p><code>DAGScheduler</code> splits up a job into a collection of Stages. A <code>Stage</code> contains a sequence of narrow transformations that can be completed without shuffling data set, separated at shuffle boundaries (where shuffle occurs). Stages are thus a result of breaking the RDD graph at shuffle boundaries.</p> <p></p> <p>Shuffle boundaries introduce a barrier where stages/tasks must wait for the previous stage to finish before they fetch map outputs.</p> <p></p>"},{"location":"scheduler/#resources","title":"Resources","text":"<ul> <li>Deep Dive into the Apache Spark Scheduler by Xingbo Jiang (Databricks)</li> </ul>"},{"location":"scheduler/ActiveJob/","title":"ActiveJob","text":"<p><code>ActiveJob</code> (job, action job) is a top-level work item (computation) submitted to DAGScheduler for execution (usually to compute the result of an <code>RDD</code> action).</p> <p></p> <p>Executing a job is equivalent to computing the partitions of the RDD an action has been executed upon. The number of partitions (<code>numPartitions</code>) to compute in a job depends on the type of a stage (ResultStage or ShuffleMapStage).</p> <p>A job starts with a single target RDD, but can ultimately include other <code>RDD</code>s that are all part of RDD lineage.</p> <p>The parent stages are always ShuffleMapStages.</p> <p></p> <p>Note</p> <p>Not always all partitions have to be computed for ResultStages (e.g. for actions like <code>first()</code> and <code>lookup()</code>).</p>"},{"location":"scheduler/ActiveJob/#creating-instance","title":"Creating Instance","text":"<p><code>ActiveJob</code> takes the following to be created:</p> <ul> <li> Job ID <li>Final Stage</li> <li> <code>CallSite</code> <li> JobListener <li> <code>Properties</code> <p><code>ActiveJob</code> is created when:</p> <ul> <li><code>DAGScheduler</code> is requested to handleJobSubmitted and handleMapStageSubmitted</li> </ul>"},{"location":"scheduler/ActiveJob/#final-stage","title":"Final Stage <p><code>ActiveJob</code> is given a Stage when created that determines a logical type:</p> <ol> <li>Map-Stage Job that computes the map output files for a ShuffleMapStage (for <code>submitMapStage</code>) before any downstream stages are submitted</li> <li>Result job that computes a ResultStage to execute an action</li> </ol>","text":""},{"location":"scheduler/ActiveJob/#finished-computed-partitions","title":"Finished (Computed) Partitions <p><code>ActiveJob</code> uses <code>finished</code> registry of flags to track partitions that have already been computed (<code>true</code>) or not (<code>false</code>).</p>","text":""},{"location":"scheduler/BlacklistTracker/","title":"BlacklistTracker","text":"<p><code>BlacklistTracker</code> is...FIXME</p>"},{"location":"scheduler/CoarseGrainedSchedulerBackend/","title":"CoarseGrainedSchedulerBackend","text":"<p><code>CoarseGrainedSchedulerBackend</code> is a base SchedulerBackend for coarse-grained schedulers.</p> <p><code>CoarseGrainedSchedulerBackend</code> is an ExecutorAllocationClient.</p> <p><code>CoarseGrainedSchedulerBackend</code> is responsible for requesting resources from a cluster manager for executors that it in turn uses to launch tasks (on CoarseGrainedExecutorBackend).</p> <p><code>CoarseGrainedSchedulerBackend</code> holds executors for the duration of the Spark job rather than relinquishing executors whenever a task is done and asking the scheduler to launch a new executor for each new task.</p> <p><code>CoarseGrainedSchedulerBackend</code> registers CoarseGrainedScheduler RPC Endpoint that executors use for RPC communication.</p> <p>Note</p> <p>Active executors are executors that are not pending to be removed or lost.</p>"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#implementations","title":"Implementations","text":"<ul> <li><code>KubernetesClusterSchedulerBackend</code> (Spark on Kubernetes)</li> <li><code>MesosCoarseGrainedSchedulerBackend</code> (Spark on Mesos)</li> <li><code>StandaloneSchedulerBackend</code> (Spark Standalone)</li> <li><code>YarnSchedulerBackend</code> (Spark on YARN)</li> </ul>"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#creating-instance","title":"Creating Instance","text":"<p><code>CoarseGrainedSchedulerBackend</code> takes the following to be created:</p> <ul> <li> TaskSchedulerImpl <li> RpcEnv"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#driverEndpoint","title":"CoarseGrainedScheduler RPC Endpoint","text":"<pre><code>driverEndpoint: RpcEndpointRef\n</code></pre> <p><code>CoarseGrainedSchedulerBackend</code> registers a DriverEndpoint RPC endpoint known as CoarseGrainedScheduler when created.</p>"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#createDriverEndpoint","title":"Creating DriverEndpoint","text":"<pre><code>createDriverEndpoint(): DriverEndpoint\n</code></pre> <p><code>createDriverEndpoint</code> creates a new DriverEndpoint.</p> <p>Note</p> <p>The purpose of <code>createDriverEndpoint</code> is to let CoarseGrainedSchedulerBackends to provide their own custom implementations:</p> <ul> <li><code>KubernetesClusterSchedulerBackend</code> (Spark on Kubernetes)</li> <li><code>StandaloneSchedulerBackend</code></li> </ul> <p><code>createDriverEndpoint</code> is used when:</p> <ul> <li><code>CoarseGrainedSchedulerBackend</code> is created (and registers CoarseGrainedScheduler RPC endpoint)</li> </ul>"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks","text":"SchedulerBackend <pre><code>maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n</code></pre> <p><code>maxNumConcurrentTasks</code> is part of the SchedulerBackend abstraction.</p> <p><code>maxNumConcurrentTasks</code> uses the Available Executors registry to find out about available ResourceProfiles, total number of CPU cores and ExecutorResourceInfos of every active executor.</p> <p>In the end, <code>maxNumConcurrentTasks</code> calculates the available (parallel) slots for the given ResourceProfile (and given the available executor resources).</p>"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#totalregisteredexecutors-registry","title":"totalRegisteredExecutors Registry <pre><code>totalRegisteredExecutors: AtomicInteger\n</code></pre> <p><code>totalRegisteredExecutors</code> is an internal registry of the number of registered executors (a Java AtomicInteger).</p> <p><code>totalRegisteredExecutors</code> starts from <code>0</code>.</p> <p><code>totalRegisteredExecutors</code> is incremented when:</p> <ul> <li><code>DriverEndpoint</code> is requested to handle a RegisterExecutor message</li> </ul> <p><code>totalRegisteredExecutors</code> is decremented when:</p> <ul> <li><code>DriverEndpoint</code> is requested to remove an executor</li> </ul>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#sufficient-resources-registered","title":"Sufficient Resources Registered <pre><code>sufficientResourcesRegistered(): Boolean\n</code></pre> <p><code>sufficientResourcesRegistered</code> is <code>true</code> (and is supposed to be overriden by custom CoarseGrainedSchedulerBackends).</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#minimum-resources-available-ratio","title":"Minimum Resources Available Ratio <pre><code>minRegisteredRatio: Double\n</code></pre> <p><code>minRegisteredRatio</code> is a ratio of the minimum resources available to the total expected resources for the <code>CoarseGrainedSchedulerBackend</code> to be ready for scheduling tasks (for execution).</p> <p><code>minRegisteredRatio</code> uses spark.scheduler.minRegisteredResourcesRatio configuration property if defined or defaults to <code>0.0</code>.</p> <p><code>minRegisteredRatio</code> can be between <code>0.0</code> and <code>1.0</code> (inclusive).</p> <p><code>minRegisteredRatio</code> is used when:</p> <ul> <li><code>CoarseGrainedSchedulerBackend</code> is requested to isReady</li> <li><code>StandaloneSchedulerBackend</code> is requested to <code>sufficientResourcesRegistered</code></li> <li><code>KubernetesClusterSchedulerBackend</code> is requested to <code>sufficientResourcesRegistered</code></li> <li><code>MesosCoarseGrainedSchedulerBackend</code> is requested to <code>sufficientResourcesRegistered</code></li> <li><code>YarnSchedulerBackend</code> is requested to <code>sufficientResourcesRegistered</code></li> </ul>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#available-executors-registry","title":"Available Executors Registry <pre><code>executorDataMap: HashMap[String, ExecutorData]\n</code></pre> <p><code>CoarseGrainedSchedulerBackend</code> tracks available executors using <code>executorDataMap</code> registry (of ExecutorDatas by executor id).</p> <p>A new entry is added when <code>DriverEndpoint</code> is requested to handle RegisterExecutor message.</p> <p>An entry is removed when <code>DriverEndpoint</code> is requested to handle RemoveExecutor message or a remote host (with one or many executors) disconnects.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#revive-messages-scheduler-service","title":"Revive Messages Scheduler Service <pre><code>reviveThread: ScheduledExecutorService\n</code></pre> <p><code>CoarseGrainedSchedulerBackend</code> creates a Java ScheduledExecutorService when created.</p> <p>The <code>ScheduledExecutorService</code> is used by <code>DriverEndpoint</code> RPC Endpoint to post ReviveOffers messages regularly.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#maximum-size-of-rpc-message","title":"Maximum Size of RPC Message <p><code>maxRpcMessageSize</code> is the value of spark.rpc.message.maxSize configuration property.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#making-fake-resource-offers-on-executors","title":"Making Fake Resource Offers on Executors <pre><code>makeOffers(): Unit\nmakeOffers(\n  executorId: String): Unit\n</code></pre> <p><code>makeOffers</code> takes the active executors (out of the &lt;&gt; internal registry) and creates <code>WorkerOffer</code> resource offers for each (one per executor with the executor's id, host and free cores). <p>CAUTION: Only free cores are considered in making offers. Memory is not! Why?!</p> <p>It then requests TaskSchedulerImpl.md#resourceOffers[<code>TaskSchedulerImpl</code> to process the resource offers] to create a collection of TaskDescription collections that it in turn uses to launch tasks.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#getting-executor-ids","title":"Getting Executor Ids <p>When called, <code>getExecutorIds</code> simply returns executor ids from the internal &lt;&gt; registry. <p>NOTE: It is called when SparkContext.md#getExecutorIds[SparkContext calculates executor ids].</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#requesting-executors","title":"Requesting Executors <pre><code>requestExecutors(\n  numAdditionalExecutors: Int): Boolean\n</code></pre> <p><code>requestExecutors</code> is a \"decorator\" method that ultimately calls a cluster-specific doRequestTotalExecutors method and returns whether the request was acknowledged or not (it is assumed <code>false</code> by default).</p> <p><code>requestExecutors</code> method is part of the ExecutorAllocationClient abstraction.</p> <p>When called, you should see the following INFO message followed by DEBUG message in the logs:</p> <pre><code>Requesting [numAdditionalExecutors] additional executor(s) from the cluster manager\nNumber of pending executors is now [numPendingExecutors]\n</code></pre> <p>&lt;&gt; is increased by the input <code>numAdditionalExecutors</code>. <p><code>requestExecutors</code> requests executors from a cluster manager (that reflects the current computation needs). The \"new executor total\" is a sum of the internal &lt;&gt; and &lt;&gt; decreased by the &lt;&gt;. <p>If <code>numAdditionalExecutors</code> is negative, a <code>IllegalArgumentException</code> is thrown:</p> <pre><code>Attempted to request a negative number of additional executor(s) [numAdditionalExecutors] from the cluster manager. Please specify a positive number!\n</code></pre> <p>NOTE: It is a final method that no other scheduler backends could customize further.</p> <p>NOTE: The method is a synchronized block that makes multiple concurrent requests be handled in a serial fashion, i.e. one by one.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#requesting-exact-number-of-executors","title":"Requesting Exact Number of Executors <pre><code>requestTotalExecutors(\n  numExecutors: Int,\n  localityAwareTasks: Int,\n  hostToLocalTaskCount: Map[String, Int]): Boolean\n</code></pre> <p><code>requestTotalExecutors</code> is a \"decorator\" method that ultimately calls a cluster-specific doRequestTotalExecutors method and returns whether the request was acknowledged or not (it is assumed <code>false</code> by default).</p> <p><code>requestTotalExecutors</code> is part of the ExecutorAllocationClient abstraction.</p> <p>It sets the internal &lt;&gt; and &lt;&gt; registries. It then calculates the exact number of executors which is the input <code>numExecutors</code> and the &lt;&gt; decreased by the number of &lt;&gt;. <p>If <code>numExecutors</code> is negative, a <code>IllegalArgumentException</code> is thrown:</p> <pre><code>Attempted to request a negative number of executor(s) [numExecutors] from the cluster manager. Please specify a positive number!\n</code></pre> <p>NOTE: It is a final method that no other scheduler backends could customize further.</p> <p>NOTE: The method is a synchronized block that makes multiple concurrent requests be handled in a serial fashion, i.e. one by one.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#finding-default-level-of-parallelism","title":"Finding Default Level of Parallelism <pre><code>defaultParallelism(): Int\n</code></pre> <p><code>defaultParallelism</code> is part of the SchedulerBackend abstraction.</p> <p><code>defaultParallelism</code> is spark.default.parallelism configuration property if defined.</p> <p>Otherwise, <code>defaultParallelism</code> is the maximum of totalCoreCount or <code>2</code>.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#killing-task","title":"Killing Task <pre><code>killTask(\n  taskId: Long,\n  executorId: String,\n  interruptThread: Boolean): Unit\n</code></pre> <p><code>killTask</code> is part of the SchedulerBackend abstraction.</p> <p><code>killTask</code> simply sends a KillTask message to &lt;&gt;.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#stopping-all-executors","title":"Stopping All Executors <p><code>stopExecutors</code> sends a blocking &lt;&gt; message to &lt;&gt; (if already initialized). <p>NOTE: It is called exclusively while <code>CoarseGrainedSchedulerBackend</code> is &lt;&gt;. <p>You should see the following INFO message in the logs:</p> <pre><code>Shutting down all executors\n</code></pre>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#reset-state","title":"Reset State <p><code>reset</code> resets the internal state:</p> <ol> <li>Sets &lt;&gt; to 0 <li>Clears <code>executorsPendingToRemove</code></li> <li>Sends a blocking &lt;&gt; message to &lt;&gt; for every executor (in the internal <code>executorDataMap</code>) to inform it about <code>SlaveLost</code> with the message: + <pre><code>Stale executor after cluster manager re-registered.\n</code></pre>  <p><code>reset</code> is a method that is defined in <code>CoarseGrainedSchedulerBackend</code>, but used and overriden exclusively by yarn/spark-yarn-yarnschedulerbackend.md[YarnSchedulerBackend].</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#remove-executor","title":"Remove Executor <pre><code>removeExecutor(executorId: String, reason: ExecutorLossReason)\n</code></pre> <p><code>removeExecutor</code> sends a blocking &lt;&gt; message to &lt;&gt;. <p>NOTE: It is called by subclasses spark-standalone.md#SparkDeploySchedulerBackend[SparkDeploySchedulerBackend], spark-mesos/spark-mesos.md#CoarseMesosSchedulerBackend[CoarseMesosSchedulerBackend], and yarn/spark-yarn-yarnschedulerbackend.md[YarnSchedulerBackend].</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#coarsegrainedscheduler-rpc-endpoint","title":"CoarseGrainedScheduler RPC Endpoint <p>When &lt;&gt;, it registers CoarseGrainedScheduler RPC endpoint to be the driver's communication endpoint. <p><code>driverEndpoint</code> is a DriverEndpoint.</p>  <p>Note</p> <p><code>CoarseGrainedSchedulerBackend</code> is created while SparkContext is being created that in turn lives inside a Spark driver. That explains the name <code>driverEndpoint</code> (at least partially).</p>  <p>It is called standalone scheduler's driver endpoint internally.</p> <p>It tracks:</p> <p>It uses <code>driver-revive-thread</code> daemon single-thread thread pool for ...FIXME</p> <p>CAUTION: FIXME A potential issue with <code>driverEndpoint.asInstanceOf[NettyRpcEndpointRef].toURI</code> - doubles <code>spark://</code> prefix.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#starting-coarsegrainedschedulerbackend","title":"Starting CoarseGrainedSchedulerBackend <pre><code>start(): Unit\n</code></pre> <p><code>start</code> is part of the SchedulerBackend abstraction.</p> <p><code>start</code> takes all <code>spark.</code>-prefixed properties and registers the &lt;CoarseGrainedScheduler RPC endpoint&gt;&gt; (backed by DriverEndpoint ThreadSafeRpcEndpoint). <p></p> <p>NOTE: <code>start</code> uses &lt;&gt; to access the current SparkContext.md[SparkContext] and in turn SparkConf.md[SparkConf]. <p>NOTE: <code>start</code> uses &lt;&gt; that was given when &lt;CoarseGrainedSchedulerBackend was created&gt;&gt;.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#checking-if-sufficient-compute-resources-available-or-waiting-time-passedmethod","title":"Checking If Sufficient Compute Resources Available Or Waiting Time PassedMethod <pre><code>isReady(): Boolean\n</code></pre> <p><code>isReady</code> is part of the SchedulerBackend abstraction.</p> <p><code>isReady</code> allows to delay task launching until &lt;&gt; or &lt;&gt; passes. <p>Internally, <code>isReady</code> &lt;&gt;. <p>NOTE: &lt;&gt; by default responds that sufficient resources are available. <p>If the &lt;&gt;, you should see the following INFO message in the logs and <code>isReady</code> is positive. <pre><code>SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: [minRegisteredRatio]\n</code></pre> <p>If there are no sufficient resources available yet (the above requirement does not hold), <code>isReady</code> checks whether the time since &lt;&gt; passed &lt;&gt; to give a way to launch tasks (even when &lt;&gt; not being reached yet). <p>You should see the following INFO message in the logs and <code>isReady</code> is positive.</p> <pre><code>SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: [maxRegisteredWaitingTimeMs](ms)\n</code></pre> <p>Otherwise, when &lt;&gt; and &lt;&gt; has not elapsed, <code>isReady</code> is negative.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#reviving-resource-offers","title":"Reviving Resource Offers <pre><code>reviveOffers(): Unit\n</code></pre> <p><code>reviveOffers</code> is part of the SchedulerBackend abstraction.</p> <p><code>reviveOffers</code> simply sends a ReviveOffers message to CoarseGrainedSchedulerBackend RPC endpoint.</p> <p></p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#stopping-schedulerbackend","title":"Stopping SchedulerBackend <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> is part of the SchedulerBackend abstraction.</p> <p><code>stop</code> &lt;&gt; and &lt;CoarseGrainedScheduler RPC endpoint&gt;&gt; (by sending a blocking StopDriver message). <p>In case of any <code>Exception</code>, <code>stop</code> reports a <code>SparkException</code> with the message:</p> <pre><code>Error stopping standalone scheduler's driver endpoint\n</code></pre>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#createdriverendpointref","title":"createDriverEndpointRef <pre><code>createDriverEndpointRef(\n  properties: ArrayBuffer[(String, String)]): RpcEndpointRef\n</code></pre> <p><code>createDriverEndpointRef</code> &lt;&gt; and rpc:index.md#setupEndpoint[registers it] as CoarseGrainedScheduler. <p><code>createDriverEndpointRef</code> is used when <code>CoarseGrainedSchedulerBackend</code> is requested to &lt;&gt;.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#checking-whether-executor-is-active","title":"Checking Whether Executor is Active <pre><code>isExecutorActive(\n  id: String): Boolean\n</code></pre> <p><code>isExecutorActive</code> is part of the ExecutorAllocationClient abstraction.</p> <p><code>isExecutorActive</code>...FIXME</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#requesting-executors-from-cluster-manager","title":"Requesting Executors from Cluster Manager <pre><code>doRequestTotalExecutors(\n  requestedTotal: Int): Future[Boolean]\n</code></pre> <p><code>doRequestTotalExecutors</code> returns a completed <code>Future</code> with <code>false</code> value.</p> <p><code>doRequestTotalExecutors</code> is used when:</p> <ul> <li><code>CoarseGrainedSchedulerBackend</code> is requested to requestExecutors, requestTotalExecutors and killExecutors</li> </ul>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/CompressedMapStatus/","title":"CompressedMapStatus","text":"<p><code>CompressedMapStatus</code> is...FIXME</p>"},{"location":"scheduler/DAGScheduler/","title":"DAGScheduler","text":""},{"location":"scheduler/DAGScheduler/#dagscheduler","title":"DAGScheduler","text":"<p>Note</p> <p>The introduction that follows was highly influenced by the scaladoc of org.apache.spark.scheduler.DAGScheduler. As <code>DAGScheduler</code> is a <code>private class</code> it does not appear in the official API documentation. You are strongly encouraged to read the sources and only then read this and the related pages afterwards.</p>"},{"location":"scheduler/DAGScheduler/#introduction","title":"Introduction","text":"<p><code>DAGScheduler</code> is the scheduling layer of Apache Spark that implements stage-oriented scheduling using Jobs and Stages.</p> <p><code>DAGScheduler</code> transforms a logical execution plan (RDD lineage of dependencies built using RDD transformations) to a physical execution plan (using stages).</p> <p></p> <p>After an action has been called on an <code>RDD</code>, SparkContext hands over a logical plan to <code>DAGScheduler</code> that it in turn translates to a set of stages that are submitted as TaskSets for execution.</p> <p></p> <p><code>DAGScheduler</code> works solely on the driver and is created as part of SparkContext's initialization (right after TaskScheduler and SchedulerBackend are ready).</p> <p></p> <p><code>DAGScheduler</code> does three things in Spark:</p> <ul> <li>Computes an execution DAG (DAG of stages) for a job</li> <li>Determines the preferred locations to run each task on</li> <li>Handles failures due to shuffle output files being lost</li> </ul> <p>DAGScheduler computes a directed acyclic graph (DAG) of stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a minimal schedule to run jobs. It then submits stages to TaskScheduler.</p> <p></p> <p>In addition to coming up with the execution DAG, DAGScheduler also determines the preferred locations to run each task on, based on the current cache status, and passes the information to TaskScheduler.</p> <p>DAGScheduler tracks which rdd/spark-rdd-caching.md[RDDs are cached (or persisted)] to avoid \"recomputing\" them, i.e. redoing the map side of a shuffle. DAGScheduler remembers what ShuffleMapStage.md[ShuffleMapStage]s have already produced output files (that are stored in BlockManagers).</p> <p><code>DAGScheduler</code> is only interested in cache location coordinates, i.e. host and executor id, per partition of a RDD.</p> <p>Furthermore, it handles failures due to shuffle output files being lost, in which case old stages may need to be resubmitted. Failures within a stage that are not caused by shuffle file loss are handled by the TaskScheduler itself, which will retry each task a small number of times before cancelling the whole stage.</p> <p>DAGScheduler uses an event queue architecture in which a thread can post <code>DAGSchedulerEvent</code> events, e.g. a new job or stage being submitted, that DAGScheduler reads and executes sequentially. See the section Event Bus.</p> <p>DAGScheduler runs stages in topological order.</p> <p>DAGScheduler uses SparkContext, TaskScheduler, LiveListenerBus.md[], MapOutputTracker.md[MapOutputTracker] and storage:BlockManager.md[BlockManager] for its services. However, at the very minimum, DAGScheduler takes a <code>SparkContext</code> only (and requests <code>SparkContext</code> for the other services).</p> <p>When DAGScheduler schedules a job as a result of rdd/index.md#actions[executing an action on a RDD] or calling SparkContext.runJob() method directly, it spawns parallel tasks to compute (partial) results per partition.</p>"},{"location":"scheduler/DAGScheduler/#creating-instance","title":"Creating Instance","text":"<p><code>DAGScheduler</code> takes the following to be created:</p> <ul> <li> SparkContext <li> TaskScheduler <li> LiveListenerBus <li> MapOutputTrackerMaster <li> BlockManagerMaster <li> SparkEnv <li> <code>Clock</code> <p><code>DAGScheduler</code> is created\u00a0when SparkContext is created.</p> <p>While being created, <code>DAGScheduler</code> requests the TaskScheduler to associate itself with and requests DAGScheduler Event Bus to start accepting events.</p>"},{"location":"scheduler/DAGScheduler/#submitting-mapstage-for-execution-posting-mapstagesubmitted","title":"Submitting MapStage for Execution (Posting MapStageSubmitted) <pre><code>submitMapStage[K, V, C](\n  dependency: ShuffleDependency[K, V, C],\n  callback: MapOutputStatistics =&gt; Unit,\n  callSite: CallSite,\n  properties: Properties): JobWaiter[MapOutputStatistics]\n</code></pre> <p><code>submitMapStage</code> requests the given ShuffleDependency for the RDD.</p> <p><code>submitMapStage</code> gets the job ID and increments it (for future submissions).</p> <p><code>submitMapStage</code> creates a JobWaiter to wait for a MapOutputStatistics. The <code>JobWaiter</code> waits for 1 task and, when completed successfully, executes the given <code>callback</code> function with the computed <code>MapOutputStatistics</code>.</p> <p>In the end, <code>submitMapStage</code> posts a MapStageSubmitted and returns the <code>JobWaiter</code>.</p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested to submit a MapStage for execution</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#dagschedulersource","title":"DAGSchedulerSource <p><code>DAGScheduler</code> uses DAGSchedulerSource for performance metrics.</p>","text":""},{"location":"scheduler/DAGScheduler/#dagscheduler-event-bus","title":"DAGScheduler Event Bus <p><code>DAGScheduler</code> uses an event bus to process scheduling events on a separate thread (one by one and asynchronously).</p> <p><code>DAGScheduler</code> requests the event bus to start right when created and stops it when requested to stop.</p> <p><code>DAGScheduler</code> defines event-posting methods for posting DAGSchedulerEvent events to the event bus.</p>","text":""},{"location":"scheduler/DAGScheduler/#taskscheduler","title":"TaskScheduler <p><code>DAGScheduler</code> is given a TaskScheduler when created.</p> <p><code>TaskScheduler</code> is used for the following:</p> <ul> <li>Submitting missing tasks of a stage</li> <li>Handling task completion (CompletionEvent)</li> <li>Killing a task</li> <li>Failing a job and all other independent single-job stages</li> <li>Stopping itself</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#running-job","title":"Running Job <pre><code>runJob[T, U](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) =&gt; U,\n  partitions: Seq[Int],\n  callSite: CallSite,\n  resultHandler: (Int, U) =&gt; Unit,\n  properties: Properties): Unit\n</code></pre> <p><code>runJob</code> submits a job and waits until a result is available.</p> <p><code>runJob</code> prints out the following INFO message to the logs when the job has finished successfully:</p> <pre><code>Job [jobId] finished: [callSite], took [time] s\n</code></pre> <p><code>runJob</code> prints out the following INFO message to the logs when the job has failed:</p> <pre><code>Job [jobId] failed: [callSite], took [time] s\n</code></pre> <p><code>runJob</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to run a job</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#submitting-job","title":"Submitting Job <pre><code>submitJob[T, U](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) =&gt; U,\n  partitions: Seq[Int],\n  callSite: CallSite,\n  resultHandler: (Int, U) =&gt; Unit,\n  properties: Properties): JobWaiter[U]\n</code></pre> <p><code>submitJob</code> increments the nextJobId internal counter.</p> <p><code>submitJob</code> creates a JobWaiter for the (number of) partitions and the given <code>resultHandler</code> function.</p> <p><code>submitJob</code> requests the DAGSchedulerEventProcessLoop to post a JobSubmitted.</p> <p>In the end, <code>submitJob</code> returns the <code>JobWaiter</code>.</p> <p>For empty partitions (no partitions to compute), <code>submitJob</code> requests the LiveListenerBus to post a SparkListenerJobStart and SparkListenerJobEnd (with <code>JobSucceeded</code> result marker) events and returns a JobWaiter with no tasks to wait for.</p> <p><code>submitJob</code> throws an <code>IllegalArgumentException</code> when the partitions indices are not among the partitions of the given <code>RDD</code>:</p> <pre><code>Attempting to access a non-existent partition: [p]. Total number of partitions: [maxPartitions]\n</code></pre> <p><code>submitJob</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to submit a job</li> <li><code>DAGScheduler</code> is requested to run a job</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#partition-placement-preferences","title":"Partition Placement Preferences <p><code>DAGScheduler</code> keeps track of block locations per RDD and partition.</p> <p><code>DAGScheduler</code> uses TaskLocation that includes a host name and an executor id on that host (as <code>ExecutorCacheTaskLocation</code>).</p> <p>The keys are RDDs (their ids) and the values are arrays indexed by partition numbers.</p> <p>Each entry is a set of block locations where a RDD partition is cached, i.e. the BlockManagers of the blocks.</p> <p>Initialized empty when <code>DAGScheduler</code> is created.</p> <p>Used when <code>DAGScheduler</code> is requested for the locations of the cache blocks of a RDD.</p>","text":""},{"location":"scheduler/DAGScheduler/#activejobs","title":"ActiveJobs <p><code>DAGScheduler</code> tracks ActiveJobs:</p> <ul> <li> <p>Adds a new <code>ActiveJob</code> when requested to handle JobSubmitted or MapStageSubmitted events</p> </li> <li> <p>Removes an <code>ActiveJob</code> when requested to clean up after an ActiveJob and independent stages.</p> </li> <li> <p>Removes all <code>ActiveJobs</code> when requested to doCancelAllJobs.</p> </li> </ul> <p><code>DAGScheduler</code> uses <code>ActiveJobs</code> registry when requested to handle JobGroupCancelled or TaskCompletion events, to cleanUpAfterSchedulerStop and to abort a stage.</p> <p>The number of ActiveJobs is available using job.activeJobs performance metric.</p>","text":""},{"location":"scheduler/DAGScheduler/#createResultStage","title":"Creating ResultStage for RDD <pre><code>createResultStage(\n  rdd: RDD[_],\n  func: (TaskContext, Iterator[_]) =&gt; _,\n  partitions: Array[Int],\n  jobId: Int,\n  callSite: CallSite): ResultStage\n</code></pre> <p><code>createResultStage</code> creates a new ResultStage for the ShuffleDependencies and ResourceProfiles of the given RDD.</p>  <p><code>createResultStage</code> finds the ShuffleDependencies and ResourceProfiles for the given RDD.</p> <p><code>createResultStage</code> merges the ResourceProfiles for the Stage (if enabled or reports an exception).</p> <p><code>createResultStage</code> does the following checks (that may report violations and break the execution):</p> <ul> <li>checkBarrierStageWithDynamicAllocation</li> <li>checkBarrierStageWithNumSlots</li> <li>checkBarrierStageWithRDDChainPattern</li> </ul> <p><code>createResultStage</code> getOrCreateParentStages (with the <code>ShuffleDependency</code>ies and the given <code>jobId</code>).</p> <p><code>createResultStage</code> uses the nextStageId counter for a stage ID.</p> <p><code>createResultStage</code> creates a new ResultStage (with the unique id of a ResourceProfile among others).</p> <p><code>createResultStage</code> registers the <code>ResultStage</code> with the stage ID in stageIdToStage.</p> <p><code>createResultStage</code> updateJobIdStageIdMaps and returns the <code>ResultStage</code>.</p>  <p><code>createResultStage</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handle a JobSubmitted event</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#creating-shufflemapstage-for-shuffledependency","title":"Creating ShuffleMapStage for ShuffleDependency <pre><code>createShuffleMapStage(\n  shuffleDep: ShuffleDependency[_, _, _],\n  jobId: Int): ShuffleMapStage\n</code></pre> <p><code>createShuffleMapStage</code> creates a ShuffleMapStage for the given ShuffleDependency as follows:</p> <ul> <li> <p>Stage ID is generated based on nextStageId internal counter</p> </li> <li> <p>RDD is taken from the given ShuffleDependency</p> </li> <li> <p>Number of tasks is the number of partitions of the RDD</p> </li> <li> <p>Parent RDDs</p> </li> <li> <p>MapOutputTrackerMaster</p> </li> </ul> <p><code>createShuffleMapStage</code> registers the <code>ShuffleMapStage</code> in the stageIdToStage and shuffleIdToMapStage internal registries.</p> <p><code>createShuffleMapStage</code> updateJobIdStageIdMaps.</p> <p><code>createShuffleMapStage</code> requests the MapOutputTrackerMaster to check whether it contains the shuffle ID or not.</p> <p>If not, <code>createShuffleMapStage</code> prints out the following INFO message to the logs and requests the MapOutputTrackerMaster to register the shuffle.</p> <pre><code>Registering RDD [id] ([creationSite]) as input to shuffle [shuffleId]\n</code></pre> <p></p> <p><code>createShuffleMapStage</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to find or create a ShuffleMapStage for a given ShuffleDependency</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#cleaning-up-after-job-and-independent-stages","title":"Cleaning Up After Job and Independent Stages <pre><code>cleanupStateForJobAndIndependentStages(\n  job: ActiveJob): Unit\n</code></pre> <p><code>cleanupStateForJobAndIndependentStages</code> cleans up the state for <code>job</code> and any stages that are not part of any other job.</p> <p><code>cleanupStateForJobAndIndependentStages</code> looks the <code>job</code> up in the internal jobIdToStageIds registry.</p> <p>If no stages are found, the following ERROR is printed out to the logs:</p> <pre><code>No stages registered for job [jobId]\n</code></pre> <p>Oterwise, <code>cleanupStateForJobAndIndependentStages</code> uses stageIdToStage registry to find the stages (the real objects not ids!).</p> <p>For each stage, <code>cleanupStateForJobAndIndependentStages</code> reads the jobs the stage belongs to.</p> <p>If the <code>job</code> does not belong to the jobs of the stage, the following ERROR is printed out to the logs:</p> <pre><code>Job [jobId] not registered for stage [stageId] even though that stage was registered for the job\n</code></pre> <p>If the <code>job</code> was the only job for the stage, the stage (and the stage id) gets cleaned up from the registries, i.e. runningStages, shuffleIdToMapStage, waitingStages, failedStages and stageIdToStage.</p> <p>While removing from runningStages, you should see the following DEBUG message in the logs:</p> <pre><code>Removing running stage [stageId]\n</code></pre> <p>While removing from waitingStages, you should see the following DEBUG message in the logs:</p> <pre><code>Removing stage [stageId] from waiting set.\n</code></pre> <p>While removing from failedStages, you should see the following DEBUG message in the logs:</p> <pre><code>Removing stage [stageId] from failed set.\n</code></pre> <p>After all cleaning (using stageIdToStage as the source registry), if the stage belonged to the one and only <code>job</code>, you should see the following DEBUG message in the logs:</p> <pre><code>After removal of stage [stageId], remaining stages = [stageIdToStage.size]\n</code></pre> <p>The <code>job</code> is removed from jobIdToStageIds, jobIdToActiveJob, activeJobs registries.</p> <p>The final stage of the <code>job</code> is removed, i.e. ResultStage or ShuffleMapStage.</p> <p><code>cleanupStateForJobAndIndependentStages</code> is used in handleTaskCompletion when a <code>ResultTask</code> has completed successfully, failJobAndIndependentStages and markMapStageJobAsFinished.</p>","text":""},{"location":"scheduler/DAGScheduler/#marking-shufflemapstage-job-finished","title":"Marking ShuffleMapStage Job Finished <pre><code>markMapStageJobAsFinished(\n  job: ActiveJob,\n  stats: MapOutputStatistics): Unit\n</code></pre> <p><code>markMapStageJobAsFinished</code> marks the given ActiveJob finished and posts a SparkListenerJobEnd.</p>  <p><code>markMapStageJobAsFinished</code> requests the given ActiveJob to turn on (<code>true</code>) the 0<sup>th</sup> bit in the finished partitions registry and increase the number of tasks finished.</p> <p><code>markMapStageJobAsFinished</code> requests the given <code>ActiveJob</code> for the JobListener that is requested to taskSucceeded (with the 0<sup>th</sup> index and the given MapOutputStatistics).</p> <p><code>markMapStageJobAsFinished</code> cleanupStateForJobAndIndependentStages.</p> <p>In the end, <code>markMapStageJobAsFinished</code> requests the LiveListenerBus to post a SparkListenerJobEnd.</p> <p><code>markMapStageJobAsFinished</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handleMapStageSubmitted and markMapStageJobsAsFinished</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#finding-or-creating-missing-direct-parent-shufflemapstages-for-shuffledependencies-of-rdd","title":"Finding Or Creating Missing Direct Parent ShuffleMapStages (For ShuffleDependencies) of RDD <pre><code>getOrCreateParentStages(\n  rdd: RDD[_],\n  firstJobId: Int): List[Stage]\n</code></pre> <p><code>getOrCreateParentStages</code> &lt;ShuffleDependencies&gt;&gt; of the input <code>rdd</code> and then &lt;ShuffleMapStage stages&gt;&gt; for each ShuffleDependency. <p><code>getOrCreateParentStages</code> is used when <code>DAGScheduler</code> is requested to create a ShuffleMapStage or a ResultStage.</p>","text":""},{"location":"scheduler/DAGScheduler/#marking-stage-finished","title":"Marking Stage Finished <pre><code>markStageAsFinished(\n  stage: Stage,\n  errorMessage: Option[String] = None,\n  willRetry: Boolean = false): Unit\n</code></pre> <p><code>markStageAsFinished</code>...FIXME</p> <p><code>markStageAsFinished</code> is used when...FIXME</p>","text":""},{"location":"scheduler/DAGScheduler/#looking-up-shufflemapstage-for-shuffledependency","title":"Looking Up ShuffleMapStage for ShuffleDependency <pre><code>getOrCreateShuffleMapStage(\n  shuffleDep: ShuffleDependency[_, _, _],\n  firstJobId: Int): ShuffleMapStage\n</code></pre> <p><code>getOrCreateShuffleMapStage</code> finds a ShuffleMapStage by the shuffleId of the given ShuffleDependency in the shuffleIdToMapStage internal registry and returns it if available.</p> <p>If not found, <code>getOrCreateShuffleMapStage</code> finds all the missing ancestor shuffle dependencies and creates the missing ShuffleMapStage stages (including one for the input <code>ShuffleDependency</code>).</p> <p><code>getOrCreateShuffleMapStage</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to find or create missing direct parent ShuffleMapStages of an RDD, find missing parent ShuffleMapStages for a stage, handle a MapStageSubmitted event, and check out stage dependency on a stage</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#missing-shuffledependencies-of-rdd","title":"Missing ShuffleDependencies of RDD <pre><code>getMissingAncestorShuffleDependencies(\n   rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]]\n</code></pre> <p><code>getMissingAncestorShuffleDependencies</code> finds all the missing ShuffleDependencies for the given RDD (traversing its RDD lineage).</p>  <p>Note</p> <p>A ShuffleDependency (of an <code>RDD</code>) is considered missing when not registered in the shuffleIdToMapStage internal registry.</p>  <p>Internally, <code>getMissingAncestorShuffleDependencies</code> finds direct parent shuffle dependencies\u2009of the input <code>RDD</code> and collects the ones that are not registered in the shuffleIdToMapStage internal registry. It repeats the process for the <code>RDD</code>s of the parent shuffle dependencies.</p>","text":""},{"location":"scheduler/DAGScheduler/#finding-direct-parent-shuffle-dependencies-of-rdd","title":"Finding Direct Parent Shuffle Dependencies of RDD <pre><code>getShuffleDependencies(\n   rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]]\n</code></pre> <p><code>getShuffleDependencies</code> finds direct parent shuffle dependencies for the given RDD.</p> <p></p> <p>Internally, <code>getShuffleDependencies</code> takes the direct rdd/index.md#dependencies[shuffle dependencies of the input RDD] and direct shuffle dependencies of all the parent non-<code>ShuffleDependencies</code> in the RDD lineage.</p> <p><code>getShuffleDependencies</code> is used when <code>DAGScheduler</code> is requested to find or create missing direct parent ShuffleMapStages (for ShuffleDependencies of a RDD) and find all missing shuffle dependencies for a given RDD.</p>","text":""},{"location":"scheduler/DAGScheduler/#failing-job-and-independent-single-job-stages","title":"Failing Job and Independent Single-Job Stages <pre><code>failJobAndIndependentStages(\n  job: ActiveJob,\n  failureReason: String,\n  exception: Option[Throwable] = None): Unit\n</code></pre> <p><code>failJobAndIndependentStages</code> fails the input <code>job</code> and all the stages that are only used by the job.</p> <p>Internally, <code>failJobAndIndependentStages</code> uses <code>jobIdToStageIds</code> internal registry to look up the stages registered for the job.</p> <p>If no stages could be found, you should see the following ERROR message in the logs:</p> <pre><code>No stages registered for job [id]\n</code></pre> <p>Otherwise, for every stage, <code>failJobAndIndependentStages</code> finds the job ids the stage belongs to.</p> <p>If no stages could be found or the job is not referenced by the stages, you should see the following ERROR message in the logs:</p> <pre><code>Job [id] not registered for stage [id] even though that stage was registered for the job\n</code></pre> <p>Only when there is exactly one job registered for the stage and the stage is in RUNNING state (in <code>runningStages</code> internal registry), TaskScheduler.md#contract[<code>TaskScheduler</code> is requested to cancel the stage's tasks] and &lt;&gt;. <p>NOTE: <code>failJobAndIndependentStages</code> uses jobIdToStageIds, stageIdToStage, and runningStages internal registries.</p> <p><code>failJobAndIndependentStages</code> is used when...FIXME</p>","text":""},{"location":"scheduler/DAGScheduler/#aborting-stage","title":"Aborting Stage <pre><code>abortStage(\n  failedStage: Stage,\n  reason: String,\n  exception: Option[Throwable]): Unit\n</code></pre> <p><code>abortStage</code> is an internal method that finds all the active jobs that depend on the <code>failedStage</code> stage and fails them.</p> <p>Internally, <code>abortStage</code> looks the <code>failedStage</code> stage up in the internal stageIdToStage registry and exits if there the stage was not registered earlier.</p> <p>If it was, <code>abortStage</code> finds all the active jobs (in the internal activeJobs registry) with the &lt;failedStage stage&gt;&gt;. <p>At this time, the <code>completionTime</code> property (of the failed stage's StageInfo) is assigned to the current time (millis).</p> <p>All the active jobs that depend on the failed stage (as calculated above) and the stages that do not belong to other jobs (aka independent stages) are &lt;&gt; (with the failure reason being \"Job aborted due to stage failure: [reason]\" and the input <code>exception</code>). <p>If there are no jobs depending on the failed stage, you should see the following INFO message in the logs:</p> <pre><code>Ignoring failure of [failedStage] because all jobs depending on it are done\n</code></pre> <p><code>abortStage</code> is used when <code>DAGScheduler</code> is requested to handle a TaskSetFailed event, submit a stage, submit missing tasks of a stage, handle a TaskCompletion event.</p>","text":""},{"location":"scheduler/DAGScheduler/#checking-out-stage-dependency-on-given-stage","title":"Checking Out Stage Dependency on Given Stage <pre><code>stageDependsOn(\n  stage: Stage,\n  target: Stage): Boolean\n</code></pre> <p><code>stageDependsOn</code> compares two stages and returns whether the <code>stage</code> depends on <code>target</code> stage (i.e. <code>true</code>) or not (i.e. <code>false</code>).</p> <p>NOTE: A stage <code>A</code> depends on stage <code>B</code> if <code>B</code> is among the ancestors of <code>A</code>.</p> <p>Internally, <code>stageDependsOn</code> walks through the graph of RDDs of the input <code>stage</code>. For every RDD in the RDD's dependencies (using <code>RDD.dependencies</code>) <code>stageDependsOn</code> adds the RDD of a NarrowDependency to a stack of RDDs to visit while for a ShuffleDependency it &lt;ShuffleMapStage stages for a <code>ShuffleDependency</code>&gt;&gt; for the dependency and the <code>stage</code>'s first job id that it later adds to a stack of RDDs to visit if the map stage is ready, i.e. all the partitions have shuffle outputs. <p>After all the RDDs of the input <code>stage</code> are visited, <code>stageDependsOn</code> checks if the <code>target</code>'s RDD is among the RDDs of the <code>stage</code>, i.e. whether the <code>stage</code> depends on <code>target</code> stage.</p> <p><code>stageDependsOn</code> is used when <code>DAGScheduler</code> is requested to abort a stage.</p>","text":""},{"location":"scheduler/DAGScheduler/#submitting-waiting-child-stages-for-execution","title":"Submitting Waiting Child Stages for Execution <pre><code>submitWaitingChildStages(\n  parent: Stage): Unit\n</code></pre> <p><code>submitWaitingChildStages</code> submits for execution all waiting stages for which the input <code>parent</code> Stage.md[Stage] is the direct parent.</p> <p>NOTE: Waiting stages are the stages registered in <code>waitingStages</code> internal registry.</p> <p>When executed, you should see the following <code>TRACE</code> messages in the logs:</p> <pre><code>Checking if any dependencies of [parent] are now runnable\nrunning: [runningStages]\nwaiting: [waitingStages]\nfailed: [failedStages]\n</code></pre> <p><code>submitWaitingChildStages</code> finds child stages of the input <code>parent</code> stage, removes them from <code>waitingStages</code> internal registry, and &lt;&gt; one by one sorted by their job ids. <p><code>submitWaitingChildStages</code> is used when <code>DAGScheduler</code> is requested to submits missing tasks for a stage and handles a successful ShuffleMapTask completion.</p>","text":""},{"location":"scheduler/DAGScheduler/#submitting-stage-with-missing-parents-for-execution","title":"Submitting Stage (with Missing Parents) for Execution <pre><code>submitStage(\n  stage: Stage): Unit\n</code></pre> <p><code>submitStage</code> submits the input <code>stage</code> or its missing parents (if there any stages not computed yet before the input <code>stage</code> could).</p> <p>NOTE: <code>submitStage</code> is also used to DAGSchedulerEventProcessLoop.md#resubmitFailedStages[resubmit failed stages].</p> <p><code>submitStage</code> recursively submits any missing parents of the <code>stage</code>.</p> <p>Internally, <code>submitStage</code> first finds the earliest-created job id that needs the <code>stage</code>.</p> <p>NOTE: A stage itself tracks the jobs (their ids) it belongs to (using the internal <code>jobIds</code> registry).</p> <p>The following steps depend on whether there is a job or not.</p> <p>If there are no jobs that require the <code>stage</code>, <code>submitStage</code> &lt;&gt; with the reason: <pre><code>No active job for stage [id]\n</code></pre> <p>If however there is a job for the <code>stage</code>, you should see the following DEBUG message in the logs:</p> <pre><code>submitStage([stage])\n</code></pre> <p><code>submitStage</code> checks the status of the <code>stage</code> and continues when it was not recorded in waiting, running or failed internal registries. It simply exits otherwise.</p> <p>With the <code>stage</code> ready for submission, <code>submitStage</code> calculates the &lt;stage&gt;&gt; (sorted by their job ids). You should see the following DEBUG message in the logs: <pre><code>missing: [missing]\n</code></pre> <p>When the <code>stage</code> has no parent stages missing, you should see the following INFO message in the logs:</p> <pre><code>Submitting [stage] ([stage.rdd]), which has no missing parents\n</code></pre> <p><code>submitStage</code> &lt;stage&gt;&gt; (with the earliest-created job id) and finishes. <p>If however there are missing parent stages for the <code>stage</code>, <code>submitStage</code> &lt;&gt;, and the <code>stage</code> is recorded in the internal waitingStages registry. <p><code>submitStage</code> is used recursively for missing parents of the given stage and when DAGScheduler is requested for the following:</p> <ul> <li> <p>resubmitFailedStages (ResubmitFailedStages event)</p> </li> <li> <p>submitWaitingChildStages (CompletionEvent event)</p> </li> <li> <p>Handle JobSubmitted, MapStageSubmitted and TaskCompletion events</p> </li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#stage-attempts","title":"Stage Attempts <p>A single stage can be re-executed in multiple attempts due to fault recovery. The number of attempts is configured (FIXME).</p> <p>If <code>TaskScheduler</code> reports that a task failed because a map output file from a previous stage was lost, the DAGScheduler resubmits the lost stage. This is detected through a DAGSchedulerEventProcessLoop.md#handleTaskCompletion-FetchFailed[<code>CompletionEvent</code> with <code>FetchFailed</code>], or an &lt;&gt; event. DAGScheduler will wait a small amount of time to see whether other nodes or tasks fail, then resubmit <code>TaskSets</code> for any lost stage(s) that compute the missing tasks. <p>Please note that tasks from the old attempts of a stage could still be running.</p> <p>A stage object tracks multiple StageInfo objects to pass to Spark listeners or the web UI.</p> <p>The latest <code>StageInfo</code> for the most recent attempt for a stage is accessible through <code>latestInfo</code>.</p>","text":""},{"location":"scheduler/DAGScheduler/#preferred-locations","title":"Preferred Locations <p>DAGScheduler computes where to run each task in a stage based on the rdd/index.md#getPreferredLocations[preferred locations of its underlying RDDs], or &lt;&gt;.","text":""},{"location":"scheduler/DAGScheduler/#adaptive-query-planning-adaptive-scheduling","title":"Adaptive Query Planning / Adaptive Scheduling <p>See SPARK-9850 Adaptive execution in Spark for the design document. The work is currently in progress.</p> <p>DAGScheduler.submitMapStage method is used for adaptive query planning, to run map stages and look at statistics about their outputs before submitting downstream stages.</p>","text":""},{"location":"scheduler/DAGScheduler/#scheduledexecutorservice-daemon-services","title":"ScheduledExecutorService daemon services <p>DAGScheduler uses the following ScheduledThreadPoolExecutors (with the policy of removing cancelled tasks from a work queue at time of cancellation):</p> <ul> <li><code>dag-scheduler-message</code> - a daemon thread pool using <code>j.u.c.ScheduledThreadPoolExecutor</code> with core pool size <code>1</code>. It is used to post a DAGSchedulerEventProcessLoop.md#ResubmitFailedStages[ResubmitFailedStages] event when DAGSchedulerEventProcessLoop.md#handleTaskCompletion-FetchFailed[<code>FetchFailed</code> is reported].</li> </ul> <p>They are created using <code>ThreadUtils.newDaemonSingleThreadScheduledExecutor</code> method that uses Guava DSL to instantiate a ThreadFactory.</p>","text":""},{"location":"scheduler/DAGScheduler/#finding-missing-parent-shufflemapstages-for-stage","title":"Finding Missing Parent ShuffleMapStages For Stage <pre><code>getMissingParentStages(\n  stage: Stage): List[Stage]\n</code></pre> <p><code>getMissingParentStages</code> finds missing parent ShuffleMapStages in the dependency graph of the input <code>stage</code> (using the breadth-first search algorithm).</p> <p>Internally, <code>getMissingParentStages</code> starts with the <code>stage</code>'s RDD and walks up the tree of all parent RDDs to find &lt;&gt;. <p>NOTE: A <code>Stage</code> tracks the associated RDD using Stage.md#rdd[<code>rdd</code> property].</p> <p>NOTE: An uncached partition of a RDD is a partition that has <code>Nil</code> in the &lt;&gt; (which results in no RDD blocks in any of the active storage:BlockManager.md[BlockManager]s on executors). <p><code>getMissingParentStages</code> traverses the rdd/index.md#dependencies[parent dependencies of the RDD] and acts according to their type, i.e. ShuffleDependency or NarrowDependency.</p> <p>NOTE: ShuffleDependency and NarrowDependency are the main top-level Dependencies.</p> <p>For each <code>NarrowDependency</code>, <code>getMissingParentStages</code> simply marks the corresponding RDD to visit and moves on to a next dependency of a RDD or works on another unvisited parent RDD.</p> <p>NOTE: NarrowDependency is a RDD dependency that allows for pipelined execution.</p> <p><code>getMissingParentStages</code> focuses on <code>ShuffleDependency</code> dependencies.</p> <p>NOTE: ShuffleDependency is a RDD dependency that represents a dependency on the output of a ShuffleMapStage, i.e. shuffle map stage.</p> <p>For each <code>ShuffleDependency</code>, <code>getMissingParentStages</code> &lt;ShuffleMapStage stages&gt;&gt;. If the <code>ShuffleMapStage</code> is not available, it is added to the set of missing (map) stages. <p>NOTE: A <code>ShuffleMapStage</code> is available when all its partitions are computed, i.e. results are available (as blocks).</p> <p>CAUTION: FIXME...IMAGE with ShuffleDependencies queried</p> <p><code>getMissingParentStages</code> is used when <code>DAGScheduler</code> is requested to submit a stage and handle JobSubmitted and MapStageSubmitted events.</p>","text":""},{"location":"scheduler/DAGScheduler/#submitting-missing-tasks-of-stage","title":"Submitting Missing Tasks of Stage <pre><code>submitMissingTasks(\n  stage: Stage,\n  jobId: Int): Unit\n</code></pre> <p><code>submitMissingTasks</code> prints out the following DEBUG message to the logs:</p> <pre><code>submitMissingTasks([stage])\n</code></pre> <p><code>submitMissingTasks</code> requests the given Stage for the missing partitions (partitions that need to be computed).</p> <p><code>submitMissingTasks</code> adds the stage to the runningStages internal registry.</p> <p><code>submitMissingTasks</code> notifies the OutputCommitCoordinator that stage execution started.</p> <p> <code>submitMissingTasks</code> determines preferred locations (task locality preferences) of the missing partitions. <p><code>submitMissingTasks</code> requests the stage for a new stage attempt.</p> <p><code>submitMissingTasks</code> requests the LiveListenerBus to post a SparkListenerStageSubmitted event.</p> <p><code>submitMissingTasks</code> uses the closure Serializer to serialize the stage and create a so-called task binary. <code>submitMissingTasks</code> serializes the RDD (of the stage) and either the <code>ShuffleDependency</code> or the compute function based on the type of the stage (<code>ShuffleMapStage</code> or <code>ResultStage</code>, respectively).</p> <p><code>submitMissingTasks</code> creates a broadcast variable for the task binary.</p>  <p>Note</p> <p>That shows how important broadcast variables are for Spark itself to distribute data among executors in a Spark application in the most efficient way.</p>  <p><code>submitMissingTasks</code> creates tasks for every missing partition:</p> <ul> <li> <p>ShuffleMapTasks for a ShuffleMapStage</p> </li> <li> <p>ResultTasks for a ResultStage</p> </li> </ul> <p>If there are tasks to submit for execution (i.e. there are missing partitions in the stage), submitMissingTasks prints out the following INFO message to the logs:</p> <pre><code>Submitting [size] missing tasks from [stage] ([rdd]) (first 15 tasks are for partitions [partitionIds])\n</code></pre> <p><code>submitMissingTasks</code> requests the &lt;&gt; to TaskScheduler.md#submitTasks[submit the tasks for execution] (as a new TaskSet.md[TaskSet]). <p>With no tasks to submit for execution, <code>submitMissingTasks</code> &lt;&gt;. <p><code>submitMissingTasks</code> prints out the following DEBUG messages based on the type of the stage:</p> <pre><code>Stage [stage] is actually done; (available: [isAvailable],available outputs: [numAvailableOutputs],partitions: [numPartitions])\n</code></pre> <p>or</p> <pre><code>Stage [stage] is actually done; (partitions: [numPartitions])\n</code></pre> <p>for <code>ShuffleMapStage</code> and <code>ResultStage</code>, respectively.</p> <p>In the end, with no tasks to submit for execution, <code>submitMissingTasks</code> &lt;&gt; and exits. <p><code>submitMissingTasks</code> is used when <code>DAGScheduler</code> is requested to submit a stage for execution.</p>","text":""},{"location":"scheduler/DAGScheduler/#finding-preferred-locations-for-missing-partitions","title":"Finding Preferred Locations for Missing Partitions <pre><code>getPreferredLocs(\n   rdd: RDD[_],\n  partition: Int): Seq[TaskLocation]\n</code></pre> <p><code>getPreferredLocs</code> is simply an alias for the internal (recursive) &lt;&gt;. <p><code>getPreferredLocs</code> is used when...FIXME</p>","text":""},{"location":"scheduler/DAGScheduler/#finding-blockmanagers-executors-for-cached-rdd-partitions-aka-block-location-discovery","title":"Finding BlockManagers (Executors) for Cached RDD Partitions (aka Block Location Discovery) <pre><code>getCacheLocs(\n   rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]]\n</code></pre> <p><code>getCacheLocs</code> gives TaskLocations (block locations) for the partitions of the input <code>rdd</code>. <code>getCacheLocs</code> caches lookup results in &lt;&gt; internal registry. <p>NOTE: The size of the collection from <code>getCacheLocs</code> is exactly the number of partitions in <code>rdd</code> RDD.</p> <p>NOTE: The size of every TaskLocation collection (i.e. every entry in the result of <code>getCacheLocs</code>) is exactly the number of blocks managed using storage:BlockManager.md[BlockManagers] on executors.</p> <p>Internally, <code>getCacheLocs</code> finds <code>rdd</code> in the &lt;&gt; internal registry (of partition locations per RDD). <p>If <code>rdd</code> is not in &lt;&gt; internal registry, <code>getCacheLocs</code> branches per its storage:StorageLevel.md[storage level]. <p>For <code>NONE</code> storage level (i.e. no caching), the result is an empty locations (i.e. no location preference).</p> <p>For other non-<code>NONE</code> storage levels, <code>getCacheLocs</code> storage:BlockManagerMaster.md#getLocations-block-array[requests <code>BlockManagerMaster</code> for block locations] that are then mapped to TaskLocations with the hostname of the owning <code>BlockManager</code> for a block (of a partition) and the executor id.</p> <p>NOTE: <code>getCacheLocs</code> uses &lt;&gt; that was defined when &lt;&gt;. <p><code>getCacheLocs</code> records the computed block locations per partition (as TaskLocation) in &lt;&gt; internal registry. <p>NOTE: <code>getCacheLocs</code> requests locations from <code>BlockManagerMaster</code> using storage:BlockId.md#RDDBlockId[RDDBlockId] with the RDD id and the partition indices (which implies that the order of the partitions matters to request proper blocks).</p> <p>NOTE: DAGScheduler uses TaskLocation.md[TaskLocations] (with host and executor) while storage:BlockManagerMaster.md[BlockManagerMaster] uses storage:BlockManagerId.md[] (to track similar information, i.e. block locations).</p> <p><code>getCacheLocs</code> is used when <code>DAGScheduler</code> is requested to find missing parent MapStages and getPreferredLocsInternal.</p>","text":""},{"location":"scheduler/DAGScheduler/#finding-placement-preferences-for-rdd-partition-recursively","title":"Finding Placement Preferences for RDD Partition (recursively) <pre><code>getPreferredLocsInternal(\n   rdd: RDD[_],\n  partition: Int,\n  visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation]\n</code></pre> <p><code>getPreferredLocsInternal</code> first &lt;TaskLocations for the <code>partition</code> of the <code>rdd</code>&gt;&gt; (using &lt;&gt; internal cache) and returns them. <p>Otherwise, if not found, <code>getPreferredLocsInternal</code> rdd/index.md#preferredLocations[requests <code>rdd</code> for the preferred locations of <code>partition</code>] and returns them.</p> <p>NOTE: Preferred locations of the partitions of a RDD are also called placement preferences or locality preferences.</p> <p>Otherwise, if not found, <code>getPreferredLocsInternal</code> finds the first parent NarrowDependency and (recursively) finds <code>TaskLocations</code>.</p> <p>If all the attempts fail to yield any non-empty result, <code>getPreferredLocsInternal</code> returns an empty collection of TaskLocation.md[TaskLocations].</p> <p><code>getPreferredLocsInternal</code> is used when <code>DAGScheduler</code> is requested for the preferred locations for missing partitions.</p>","text":""},{"location":"scheduler/DAGScheduler/#stopping-dagscheduler","title":"Stopping DAGScheduler <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> stops the internal <code>dag-scheduler-message</code> thread pool, dag-scheduler-event-loop, and TaskScheduler.</p> <p><code>stop</code> is used when <code>SparkContext</code> is requested to stop.</p>","text":""},{"location":"scheduler/DAGScheduler/#killing-task","title":"Killing Task <pre><code>killTaskAttempt(\n  taskId: Long,\n  interruptThread: Boolean,\n  reason: String): Boolean\n</code></pre> <p><code>killTaskAttempt</code> requests the TaskScheduler to kill a task.</p> <p><code>killTaskAttempt</code> is used when <code>SparkContext</code> is requested to kill a task.</p>","text":""},{"location":"scheduler/DAGScheduler/#cleanupafterschedulerstop","title":"cleanUpAfterSchedulerStop <pre><code>cleanUpAfterSchedulerStop(): Unit\n</code></pre> <p><code>cleanUpAfterSchedulerStop</code>...FIXME</p> <p><code>cleanUpAfterSchedulerStop</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to onStop.</p>","text":""},{"location":"scheduler/DAGScheduler/#removeexecutorandunregisteroutputs","title":"removeExecutorAndUnregisterOutputs <pre><code>removeExecutorAndUnregisterOutputs(\n  execId: String,\n  fileLost: Boolean,\n  hostToUnregisterOutputs: Option[String],\n  maybeEpoch: Option[Long] = None): Unit\n</code></pre> <p>removeExecutorAndUnregisterOutputs...FIXME</p> <p>removeExecutorAndUnregisterOutputs is used when DAGScheduler is requested to handle &lt;&gt; (due to a fetch failure) and &lt;&gt; events.","text":""},{"location":"scheduler/DAGScheduler/#markmapstagejobsasfinished","title":"markMapStageJobsAsFinished <pre><code>markMapStageJobsAsFinished(\n  shuffleStage: ShuffleMapStage): Unit\n</code></pre> <p><code>markMapStageJobsAsFinished</code> checks out whether the given ShuffleMapStage is fully-available yet there are still map-stage jobs running.</p> <p>If so, <code>markMapStageJobsAsFinished</code> requests the MapOutputTrackerMaster for the statistics (for the ShuffleDependency of the given ShuffleMapStage).</p> <p>For every map-stage job, <code>markMapStageJobsAsFinished</code> marks the map-stage job as finished (with the statistics).</p> <p><code>markMapStageJobsAsFinished</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit missing tasks (of a <code>ShuffleMapStage</code> that has just been computed) and processShuffleMapStageCompletion</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#processshufflemapstagecompletion","title":"processShuffleMapStageCompletion <pre><code>processShuffleMapStageCompletion(\n  shuffleStage: ShuffleMapStage): Unit\n</code></pre> <p><code>processShuffleMapStageCompletion</code>...FIXME</p> <p><code>processShuffleMapStageCompletion</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handleTaskCompletion and handleShuffleMergeFinalized</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#handleshufflemergefinalized","title":"handleShuffleMergeFinalized <pre><code>handleShuffleMergeFinalized(\n  stage: ShuffleMapStage): Unit\n</code></pre> <p><code>handleShuffleMergeFinalized</code>...FIXME</p> <p><code>handleShuffleMergeFinalized</code> is used when:</p> <ul> <li><code>DAGSchedulerEventProcessLoop</code> is requested to handle a ShuffleMergeFinalized event</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#scheduleshufflemergefinalize","title":"scheduleShuffleMergeFinalize <pre><code>scheduleShuffleMergeFinalize(\n  stage: ShuffleMapStage): Unit\n</code></pre> <p><code>scheduleShuffleMergeFinalize</code>...FIXME</p> <p><code>scheduleShuffleMergeFinalize</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handle a task completion</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#finalizeshufflemerge","title":"finalizeShuffleMerge <pre><code>finalizeShuffleMerge(\n  stage: ShuffleMapStage): Unit\n</code></pre> <p><code>finalizeShuffleMerge</code>...FIXME</p>","text":""},{"location":"scheduler/DAGScheduler/#updatejobidstageidmaps","title":"updateJobIdStageIdMaps <pre><code>updateJobIdStageIdMaps(\n  jobId: Int,\n  stage: Stage): Unit\n</code></pre> <p><code>updateJobIdStageIdMaps</code>...FIXME</p> <p><code>updateJobIdStageIdMaps</code> is used when <code>DAGScheduler</code> is requested to create ShuffleMapStage and ResultStage stages.</p>","text":""},{"location":"scheduler/DAGScheduler/#executorheartbeatreceived","title":"executorHeartbeatReceived <pre><code>executorHeartbeatReceived(\n  execId: String,\n  // (taskId, stageId, stageAttemptId, accumUpdates)\n  accumUpdates: Array[(Long, Int, Int, Seq[AccumulableInfo])],\n  blockManagerId: BlockManagerId,\n  // (stageId, stageAttemptId) -&gt; metrics\n  executorUpdates: mutable.Map[(Int, Int), ExecutorMetrics]): Boolean\n</code></pre> <p><code>executorHeartbeatReceived</code> posts a SparkListenerExecutorMetricsUpdate (to listenerBus) and informs BlockManagerMaster that <code>blockManagerId</code> block manager is alive (by posting BlockManagerHeartbeat).</p> <p><code>executorHeartbeatReceived</code> is used when <code>TaskSchedulerImpl</code> is requested to handle an executor heartbeat.</p>","text":""},{"location":"scheduler/DAGScheduler/#event-handlers","title":"Event Handlers","text":""},{"location":"scheduler/DAGScheduler/#alljobscancelled-event-handler","title":"AllJobsCancelled Event Handler <pre><code>doCancelAllJobs(): Unit\n</code></pre> <p><code>doCancelAllJobs</code>...FIXME</p> <p><code>doCancelAllJobs</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle an AllJobsCancelled event and onError.</p>","text":""},{"location":"scheduler/DAGScheduler/#beginevent-event-handler","title":"BeginEvent Event Handler <pre><code>handleBeginEvent(\n  task: Task[_],\n  taskInfo: TaskInfo): Unit\n</code></pre> <p><code>handleBeginEvent</code>...FIXME</p> <p><code>handleBeginEvent</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a BeginEvent event.</p>","text":""},{"location":"scheduler/DAGScheduler/#handling-task-completion-event","title":"Handling Task Completion Event <pre><code>handleTaskCompletion(\n  event: CompletionEvent): Unit\n</code></pre>  <p><code>handleTaskCompletion</code> handles a CompletionEvent.</p> <p><code>handleTaskCompletion</code> notifies the OutputCommitCoordinator that a task completed.</p> <p><code>handleTaskCompletion</code> finds the stage in the stageIdToStage registry. If not found, <code>handleTaskCompletion</code> postTaskEnd and quits.</p> <p><code>handleTaskCompletion</code> updateAccumulators.</p> <p><code>handleTaskCompletion</code> announces task completion application-wide.</p> <p><code>handleTaskCompletion</code> branches off per <code>TaskEndReason</code> (as <code>event.reason</code>).</p>    TaskEndReason Description     Success Acts according to the type of the task that completed, i.e. ShuffleMapTask and ResultTask   Resubmitted    others","text":""},{"location":"scheduler/DAGScheduler/#handling-successful-task-completion","title":"Handling Successful Task Completion <p>When a task has finished successfully (i.e. <code>Success</code> end reason), <code>handleTaskCompletion</code> marks the partition as no longer pending (i.e. the partition the task worked on is removed from <code>pendingPartitions</code> of the stage).</p> <p>NOTE: A <code>Stage</code> tracks its own pending partitions using scheduler:Stage.md#pendingPartitions[<code>pendingPartitions</code> property].</p> <p><code>handleTaskCompletion</code> branches off given the type of the task that completed, i.e. &lt;&gt; and &lt;&gt;.","text":""},{"location":"scheduler/DAGScheduler/#handling-successful-resulttask-completion","title":"Handling Successful ResultTask Completion <p>For scheduler:ResultTask.md[ResultTask], the stage is assumed a scheduler:ResultStage.md[ResultStage].</p> <p><code>handleTaskCompletion</code> finds the <code>ActiveJob</code> associated with the <code>ResultStage</code>.</p> <p>NOTE: scheduler:ResultStage.md[ResultStage] tracks the optional <code>ActiveJob</code> as scheduler:ResultStage.md#activeJob[<code>activeJob</code> property]. There could only be one active job for a <code>ResultStage</code>.</p> <p>If there is no job for the <code>ResultStage</code>, you should see the following INFO message in the logs:</p> <pre><code>Ignoring result from [task] because its job has finished\n</code></pre> <p>Otherwise, when the <code>ResultStage</code> has a <code>ActiveJob</code>, <code>handleTaskCompletion</code> checks the status of the partition output for the partition the <code>ResultTask</code> ran for.</p> <p>NOTE: <code>ActiveJob</code> tracks task completions in <code>finished</code> property with flags for every partition in a stage. When the flag for a partition is enabled (i.e. <code>true</code>), it is assumed that the partition has been computed (and no results from any <code>ResultTask</code> are expected and hence simply ignored).</p> <p>CAUTION: FIXME Describe why could a partition has more <code>ResultTask</code> running.</p> <p><code>handleTaskCompletion</code> ignores the <code>CompletionEvent</code> when the partition has already been marked as completed for the stage and simply exits.</p> <p><code>handleTaskCompletion</code> scheduler:DAGScheduler.md#updateAccumulators[updates accumulators].</p> <p>The partition for the <code>ActiveJob</code> (of the <code>ResultStage</code>) is marked as computed and the number of partitions calculated increased.</p> <p>NOTE: <code>ActiveJob</code> tracks what partitions have already been computed and their number.</p> <p>If the <code>ActiveJob</code> has finished (when the number of partitions computed is exactly the number of partitions in a stage) <code>handleTaskCompletion</code> does the following (in order):</p> <ol> <li>scheduler:DAGScheduler.md#markStageAsFinished[Marks <code>ResultStage</code> computed].</li> <li>scheduler:DAGScheduler.md#cleanupStateForJobAndIndependentStages[Cleans up after <code>ActiveJob</code> and independent stages].</li> <li>Announces the job completion application-wide (by posting a SparkListener.md#SparkListenerJobEnd[SparkListenerJobEnd] to scheduler:LiveListenerBus.md[]).</li> </ol> <p>In the end, <code>handleTaskCompletion</code> notifies <code>JobListener</code> of the <code>ActiveJob</code> that the task succeeded.</p> <p>NOTE: A task succeeded notification holds the output index and the result.</p> <p>When the notification throws an exception (because it runs user code), <code>handleTaskCompletion</code> notifies <code>JobListener</code> about the failure (wrapping it inside a <code>SparkDriverExecutionException</code> exception).</p>","text":""},{"location":"scheduler/DAGScheduler/#handling-successful-shufflemaptask-completion","title":"Handling Successful ShuffleMapTask Completion <p>For scheduler:ShuffleMapTask.md[ShuffleMapTask], the stage is assumed a  scheduler:ShuffleMapStage.md[ShuffleMapStage].</p> <p><code>handleTaskCompletion</code> scheduler:DAGScheduler.md#updateAccumulators[updates accumulators].</p> <p>The task's result is assumed scheduler:MapStatus.md[MapStatus] that knows the executor where the task has finished.</p> <p>You should see the following DEBUG message in the logs:</p> <pre><code>ShuffleMapTask finished on [execId]\n</code></pre> <p>If the executor is registered in scheduler:DAGScheduler.md#failedEpoch[<code>failedEpoch</code> internal registry] and the epoch of the completed task is not greater than that of the executor (as in <code>failedEpoch</code> registry), you should see the following INFO message in the logs:</p> <pre><code>Ignoring possibly bogus [task] completion from executor [executorId]\n</code></pre> <p>Otherwise, <code>handleTaskCompletion</code> scheduler:ShuffleMapStage.md#addOutputLoc[registers the <code>MapStatus</code> result for the partition with the stage] (of the completed task).</p> <p><code>handleTaskCompletion</code> does more processing only if the <code>ShuffleMapStage</code> is registered as still running (in scheduler:DAGScheduler.md#runningStages[<code>runningStages</code> internal registry]) and the scheduler:Stage.md#pendingPartitions[<code>ShuffleMapStage</code> stage has no pending partitions to compute].</p> <p>The <code>ShuffleMapStage</code> is &lt;&gt;. <p>You should see the following INFO messages in the logs:</p> <pre><code>looking for newly runnable stages\nrunning: [runningStages]\nwaiting: [waitingStages]\nfailed: [failedStages]\n</code></pre> <p><code>handleTaskCompletion</code> scheduler:MapOutputTrackerMaster.md#registerMapOutputs[registers the shuffle map outputs of the <code>ShuffleDependency</code> with <code>MapOutputTrackerMaster</code>] (with the epoch incremented) and scheduler:DAGScheduler.md#clearCacheLocs[clears internal cache of the stage's RDD block locations].</p> <p>NOTE: scheduler:MapOutputTrackerMaster.md[MapOutputTrackerMaster] is given when scheduler:DAGScheduler.md#creating-instance[DAGScheduler is created].</p> <p>If the scheduler:ShuffleMapStage.md#isAvailable[<code>ShuffleMapStage</code> stage is ready], all scheduler:ShuffleMapStage.md#mapStageJobs[active jobs of the stage] (aka map-stage jobs) are scheduler:DAGScheduler.md#markMapStageJobAsFinished[marked as finished] (with scheduler:MapOutputTrackerMaster.md#getStatistics[<code>MapOutputStatistics</code> from <code>MapOutputTrackerMaster</code> for the <code>ShuffleDependency</code>]).</p> <p>NOTE: A <code>ShuffleMapStage</code> stage is ready (aka available) when all partitions have shuffle outputs, i.e. when their tasks have completed.</p> <p>Eventually, <code>handleTaskCompletion</code> scheduler:DAGScheduler.md#submitWaitingChildStages[submits waiting child stages (of the ready <code>ShuffleMapStage</code>)].</p> <p>If however the <code>ShuffleMapStage</code> is not ready, you should see the following INFO message in the logs:</p> <pre><code>Resubmitting [shuffleStage] ([shuffleStage.name]) because some of its tasks had failed: [missingPartitions]\n</code></pre> <p>In the end, <code>handleTaskCompletion</code> scheduler:DAGScheduler.md#submitStage[submits the <code>ShuffleMapStage</code> for execution].</p>","text":""},{"location":"scheduler/DAGScheduler/#taskendreason-resubmitted","title":"TaskEndReason: Resubmitted <p>For <code>Resubmitted</code> case, you should see the following INFO message in the logs:</p> <pre><code>Resubmitted [task], so marking it as still running\n</code></pre> <p>The task (by <code>task.partitionId</code>) is added to the collection of pending partitions of the stage (using <code>stage.pendingPartitions</code>).</p> <p>TIP: A stage knows how many partitions are yet to be calculated. A task knows about the partition id for which it was launched.</p>","text":""},{"location":"scheduler/DAGScheduler/#task-failed-with-fetchfailed-exception","title":"Task Failed with FetchFailed Exception <pre><code>FetchFailed(\n  bmAddress: BlockManagerId,\n  shuffleId: Int,\n  mapId: Int,\n  reduceId: Int,\n  message: String)\nextends TaskFailedReason\n</code></pre> <p>When <code>FetchFailed</code> happens, <code>stageIdToStage</code> is used to access the failed stage (using <code>task.stageId</code> and the <code>task</code> is available in <code>event</code> in <code>handleTaskCompletion(event: CompletionEvent)</code>). <code>shuffleToMapStage</code> is used to access the map stage (using <code>shuffleId</code>).</p> <p>If <code>failedStage.latestInfo.attemptId != task.stageAttemptId</code>, you should see the following INFO in the logs:</p> <pre><code>Ignoring fetch failure from [task] as it's from [failedStage] attempt [task.stageAttemptId] and there is a more recent attempt for that stage (attempt ID [failedStage.latestInfo.attemptId]) running\n</code></pre> <p>CAUTION: FIXME What does <code>failedStage.latestInfo.attemptId != task.stageAttemptId</code> mean?</p> <p>And the case finishes. Otherwise, the case continues.</p> <p>If the failed stage is in <code>runningStages</code>, the following INFO message shows in the logs:</p> <pre><code>Marking [failedStage] ([failedStage.name]) as failed due to a fetch failure from [mapStage] ([mapStage.name])\n</code></pre> <p><code>markStageAsFinished(failedStage, Some(failureMessage))</code> is called.</p> <p>CAUTION: FIXME What does <code>markStageAsFinished</code> do?</p> <p>If the failed stage is not in <code>runningStages</code>, the following DEBUG message shows in the logs:</p> <pre><code>Received fetch failure from [task], but its from [failedStage] which is no longer running\n</code></pre> <p>When <code>disallowStageRetryForTest</code> is set, <code>abortStage(failedStage, \"Fetch failure will not retry stage due to testing config\", None)</code> is called.</p> <p>CAUTION: FIXME Describe <code>disallowStageRetryForTest</code> and <code>abortStage</code>.</p> <p>If the scheduler:Stage.md#failedOnFetchAndShouldAbort[number of fetch failed attempts for the stage exceeds the allowed number], the scheduler:DAGScheduler.md#abortStage[failed stage is aborted] with the reason:</p> <pre><code>[failedStage] ([name]) has failed the maximum allowable number of times: 4. Most recent failure reason: [failureMessage]\n</code></pre> <p>If there are no failed stages reported (scheduler:DAGScheduler.md#failedStages[DAGScheduler.failedStages] is empty), the following INFO shows in the logs:</p> <pre><code>Resubmitting [mapStage] ([mapStage.name]) and [failedStage] ([failedStage.name]) due to fetch failure\n</code></pre> <p>And the following code is executed:</p> <pre><code>messageScheduler.schedule(\n  new Runnable {\n    override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages)\n  }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS)\n</code></pre> <p>CAUTION: FIXME What does the above code do?</p> <p>For all the cases, the failed stage and map stages are both added to the internal scheduler:DAGScheduler.md#failedStages[registry of failed stages].</p> <p>If <code>mapId</code> (in the <code>FetchFailed</code> object for the case) is provided, the map stage output is cleaned up (as it is broken) using <code>mapStage.removeOutputLoc(mapId, bmAddress)</code> and scheduler:MapOutputTracker.md#unregisterMapOutput[MapOutputTrackerMaster.unregisterMapOutput(shuffleId, mapId, bmAddress)] methods.</p> <p>CAUTION: FIXME What does <code>mapStage.removeOutputLoc</code> do?</p> <p>If <code>BlockManagerId</code> (as <code>bmAddress</code> in the <code>FetchFailed</code> object) is defined, <code>handleTaskCompletion</code> &lt;&gt; (with <code>filesLost</code> enabled and <code>maybeEpoch</code> from the scheduler:Task.md#epoch[Task] that completed). <p><code>handleTaskCompletion</code> is used when:</p> <ul> <li>DAGSchedulerEventProcessLoop is requested to handle a CompletionEvent event.</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#executoradded-event-handler","title":"ExecutorAdded Event Handler <pre><code>handleExecutorAdded(\n  execId: String,\n  host: String): Unit\n</code></pre> <p><code>handleExecutorAdded</code>...FIXME</p> <p><code>handleExecutorAdded</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle an ExecutorAdded event.</p>","text":""},{"location":"scheduler/DAGScheduler/#executorlost-event-handler","title":"ExecutorLost Event Handler <pre><code>handleExecutorLost(\n  execId: String,\n  workerLost: Boolean): Unit\n</code></pre> <p><code>handleExecutorLost</code> checks whether the input optional <code>maybeEpoch</code> is defined and if not requests the scheduler:MapOutputTracker.md#getEpoch[current epoch from <code>MapOutputTrackerMaster</code>].</p> <p>NOTE: <code>MapOutputTrackerMaster</code> is passed in (as <code>mapOutputTracker</code>) when scheduler:DAGScheduler.md#creating-instance[DAGScheduler is created].</p> <p>CAUTION: FIXME When is <code>maybeEpoch</code> passed in?</p> <p>.DAGScheduler.handleExecutorLost image::dagscheduler-handleExecutorLost.png[align=\"center\"]</p> <p>Recurring <code>ExecutorLost</code> events lead to the following repeating DEBUG message in the logs:</p> <pre><code>DEBUG Additional executor lost message for [execId] (epoch [currentEpoch])\n</code></pre> <p>NOTE: <code>handleExecutorLost</code> handler uses <code>DAGScheduler</code>'s <code>failedEpoch</code> and FIXME internal registries.</p> <p>Otherwise, when the executor <code>execId</code> is not in the scheduler:DAGScheduler.md#failedEpoch[list of executor lost] or the executor failure's epoch is smaller than the input <code>maybeEpoch</code>, the executor's lost event is recorded in scheduler:DAGScheduler.md#failedEpoch[<code>failedEpoch</code> internal registry].</p> <p>CAUTION: FIXME Describe the case above in simpler non-technical words. Perhaps change the order, too.</p> <p>You should see the following INFO message in the logs:</p> <pre><code>INFO Executor lost: [execId] (epoch [epoch])\n</code></pre> <p>storage:BlockManagerMaster.md#removeExecutor[<code>BlockManagerMaster</code> is requested to remove the lost executor <code>execId</code>].</p> <p>CAUTION: FIXME Review what's <code>filesLost</code>.</p> <p><code>handleExecutorLost</code> exits unless the <code>ExecutorLost</code> event was for a map output fetch operation (and the input <code>filesLost</code> is <code>true</code>) or external shuffle service is not used.</p> <p>In such a case, you should see the following INFO message in the logs:</p> <pre><code>Shuffle files lost for executor: [execId] (epoch [epoch])\n</code></pre> <p><code>handleExecutorLost</code> walks over all scheduler:ShuffleMapStage.md[ShuffleMapStage]s in scheduler:DAGScheduler.md#shuffleToMapStage[DAGScheduler's <code>shuffleToMapStage</code> internal registry] and do the following (in order):</p> <ol> <li><code>ShuffleMapStage.removeOutputsOnExecutor(execId)</code> is called</li> <li>scheduler:MapOutputTrackerMaster.md#registerMapOutputs[MapOutputTrackerMaster.registerMapOutputs(shuffleId, stage.outputLocInMapOutputTrackerFormat(), changeEpoch = true)] is called.</li> </ol> <p>In case scheduler:DAGScheduler.md#shuffleToMapStage[DAGScheduler's <code>shuffleToMapStage</code> internal registry] has no shuffles registered,  scheduler:MapOutputTrackerMaster.md#incrementEpoch[<code>MapOutputTrackerMaster</code> is requested to increment epoch].</p> <p>Ultimatelly, DAGScheduler scheduler:DAGScheduler.md#clearCacheLocs[clears the internal cache of RDD partition locations].</p> <p><code>handleExecutorLost</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle an ExecutorLost event.</p>","text":""},{"location":"scheduler/DAGScheduler/#gettingresultevent-event-handler","title":"GettingResultEvent Event Handler <pre><code>handleGetTaskResult(\n  taskInfo: TaskInfo): Unit\n</code></pre> <p><code>handleGetTaskResult</code>...FIXME</p> <p><code>handleGetTaskResult</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a GettingResultEvent event.</p>","text":""},{"location":"scheduler/DAGScheduler/#jobcancelled-event-handler","title":"JobCancelled Event Handler <pre><code>handleJobCancellation(\n  jobId: Int,\n  reason: Option[String]): Unit\n</code></pre> <p><code>handleJobCancellation</code> looks up the active job for the input job ID (in jobIdToActiveJob internal registry) and fails it and all associated independent stages with failure reason:</p> <pre><code>Job [jobId] cancelled [reason]\n</code></pre> <p>When the input job ID is not found, <code>handleJobCancellation</code> prints out the following DEBUG message to the logs:</p> <pre><code>Trying to cancel unregistered job [jobId]\n</code></pre> <p><code>handleJobCancellation</code> is used when <code>DAGScheduler</code> is requested to handle a JobCancelled event, doCancelAllJobs, handleJobGroupCancelled, handleStageCancellation.</p>","text":""},{"location":"scheduler/DAGScheduler/#jobgroupcancelled-event-handler","title":"JobGroupCancelled Event Handler <pre><code>handleJobGroupCancelled(\n  groupId: String): Unit\n</code></pre> <p><code>handleJobGroupCancelled</code> finds active jobs in a group and cancels them.</p> <p>Internally, <code>handleJobGroupCancelled</code> computes all the active jobs (registered in the internal collection of active jobs) that have <code>spark.jobGroup.id</code> scheduling property set to <code>groupId</code>.</p> <p><code>handleJobGroupCancelled</code> then cancels every active job in the group one by one and the cancellation reason:</p> <pre><code>part of cancelled job group [groupId]\n</code></pre> <p><code>handleJobGroupCancelled</code> is used when <code>DAGScheduler</code> is requested to handle JobGroupCancelled event.</p>","text":""},{"location":"scheduler/DAGScheduler/#handleJobSubmitted","title":"Handling JobSubmitted Event <pre><code>handleJobSubmitted(\n  jobId: Int,\n  finalRDD: RDD[_],\n  func: (TaskContext, Iterator[_]) =&gt; _,\n  partitions: Array[Int],\n  callSite: CallSite,\n  listener: JobListener,\n  properties: Properties): Unit\n</code></pre> <p><code>handleJobSubmitted</code> creates a ResultStage (<code>finalStage</code>) for the given RDD, <code>func</code>, <code>partitions</code>, <code>jobId</code> and <code>callSite</code>.</p>  BarrierJobSlotsNumberCheckFailed Exception <p>Creating a ResultStage may fail with a BarrierJobSlotsNumberCheckFailed exception.</p>  <p></p> <p><code>handleJobSubmitted</code> removes the given <code>jobId</code> from the barrierJobIdToNumTasksCheckFailures.</p> <p><code>handleJobSubmitted</code> creates an ActiveJob for the ResultStage (with the given <code>jobId</code>, the <code>callSite</code>, the JobListener and the <code>properties</code>).</p> <p><code>handleJobSubmitted</code> clears the internal cache of RDD partition locations.</p>  FIXME Why is this clearing here so important?  <p><code>handleJobSubmitted</code> prints out the following INFO messages to the logs (with missingParentStages):</p> <pre><code>Got job [id] ([callSite]) with [number] output partitions\nFinal stage: [finalStage] ([name])\nParents of final stage: [parents]\nMissing parents: [missingParentStages]\n</code></pre> <p><code>handleJobSubmitted</code> registers the new <code>ActiveJob</code> in jobIdToActiveJob and activeJobs internal registries.</p> <p><code>handleJobSubmitted</code> requests the <code>ResultStage</code> to associate itself with the ActiveJob.</p> <p><code>handleJobSubmitted</code> uses the jobIdToStageIds internal registry to find all registered stages for the given <code>jobId</code>. <code>handleJobSubmitted</code> uses the stageIdToStage internal registry to request the <code>Stages</code> for the latestInfo.</p> <p>In the end, <code>handleJobSubmitted</code> posts a SparkListenerJobStart message to the LiveListenerBus and submits the ResultStage.</p>  <p><code>handleJobSubmitted</code> is used when:</p> <ul> <li><code>DAGSchedulerEventProcessLoop</code> is requested to handle a JobSubmitted event</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#handleJobSubmitted-BarrierJobSlotsNumberCheckFailed","title":"BarrierJobSlotsNumberCheckFailed <p>In case of a BarrierJobSlotsNumberCheckFailed exception while creating a ResultStage, <code>handleJobSubmitted</code> increments the number of failures in the barrierJobIdToNumTasksCheckFailures for the given <code>jobId</code>.</p> <p><code>handleJobSubmitted</code> prints out the following WARN message to the logs (with spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures):</p> <pre><code>Barrier stage in job [jobId] requires [requiredConcurrentTasks] slots, but only [maxConcurrentTasks] are available. Will retry up to [maxFailures] more times\n</code></pre> <p>If the number of failures is below the spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures threshold, <code>handleJobSubmitted</code> requests the messageScheduler to schedule a one-shot task that requests the DAGSchedulerEventProcessLoop to post a <code>JobSubmitted</code> event (after spark.scheduler.barrier.maxConcurrentTasksCheck.interval seconds).</p>  <p>Note</p> <p>Posting a <code>JobSubmitted</code> event is to request the <code>DAGScheduler</code> to re-consider the request, hoping that there will be enough resources to fulfill the resource requirements of a barrier job.</p>  <p>Otherwise, if the number of failures crossed the spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures threshold, <code>handleJobSubmitted</code> removes the <code>jobId</code> from the barrierJobIdToNumTasksCheckFailures and informs the given JobListener that the jobFailed.</p>","text":""},{"location":"scheduler/DAGScheduler/#mapstagesubmitted","title":"MapStageSubmitted <pre><code>handleMapStageSubmitted(\n  jobId: Int,\n  dependency: ShuffleDependency[_, _, _],\n  callSite: CallSite,\n  listener: JobListener,\n  properties: Properties): Unit\n</code></pre>   <p>Note</p> <p><code>MapStageSubmitted</code> event processing is very similar to JobSubmitted event's.</p>  <p><code>handleMapStageSubmitted</code> finds or creates a new ShuffleMapStage for the given ShuffleDependency and <code>jobId</code>.</p> <p><code>handleMapStageSubmitted</code> creates an ActiveJob (with the given <code>jobId</code>, the <code>ShuffleMapStage</code>, the given <code>JobListener</code>).</p> <p><code>handleMapStageSubmitted</code> clears the internal cache of RDD partition locations.</p> <p><code>handleMapStageSubmitted</code> prints out the following INFO messages to the logs:</p> <pre><code>Got map stage job [id] ([callSite]) with [number] output partitions\nFinal stage: [stage] ([name])\nParents of final stage: [parents]\nMissing parents: [missingParentStages]\n</code></pre> <p><code>handleMapStageSubmitted</code> adds the new <code>ActiveJob</code> to jobIdToActiveJob and activeJobs internal registries, and the ShuffleMapStage.</p>  <p>Note</p> <p><code>ShuffleMapStage</code> can have multiple <code>ActiveJob</code>s registered.</p>  <p><code>handleMapStageSubmitted</code> finds all the registered stages for the input <code>jobId</code> and collects their latest <code>StageInfo</code>.</p> <p>In the end, <code>handleMapStageSubmitted</code> posts a SparkListenerJobStart event to the LiveListenerBus and submits the ShuffleMapStage.</p> <p>When the ShuffleMapStage is available already, <code>handleMapStageSubmitted</code> marks the job finished.</p>  <p>When <code>handleMapStageSubmitted</code> could not find or create a <code>ShuffleMapStage</code>, <code>handleMapStageSubmitted</code> prints out the following WARN message to the logs.</p> <pre><code>Creating new stage failed due to exception - job: [id]\n</code></pre> <p><code>handleMapStageSubmitted</code> notifies the JobListener about the job failure and exits.</p>  <p><code>handleMapStageSubmitted</code> is used when:</p> <ul> <li>DAGSchedulerEventProcessLoop is requested to handle a MapStageSubmitted event</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#resubmitfailedstages-event-handler","title":"ResubmitFailedStages Event Handler <pre><code>resubmitFailedStages(): Unit\n</code></pre> <p><code>resubmitFailedStages</code> iterates over the internal collection of failed stages and submits them.</p>  <p>Note</p> <p><code>resubmitFailedStages</code> does nothing when there are no failed stages reported.</p>  <p><code>resubmitFailedStages</code> prints out the following INFO message to the logs:</p> <pre><code>Resubmitting failed stages\n</code></pre> <p><code>resubmitFailedStages</code> clears the internal cache of RDD partition locations and makes a copy of the collection of failed stages to track failed stages afresh.</p>  <p>Note</p> <p>At this point DAGScheduler has no failed stages reported.</p>  <p>The previously-reported failed stages are sorted by the corresponding job ids in incremental order and resubmitted.</p> <p><code>resubmitFailedStages</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a ResubmitFailedStages event.</p>","text":""},{"location":"scheduler/DAGScheduler/#speculativetasksubmitted-event-handler","title":"SpeculativeTaskSubmitted Event Handler <pre><code>handleSpeculativeTaskSubmitted(): Unit\n</code></pre> <p><code>handleSpeculativeTaskSubmitted</code>...FIXME</p> <p><code>handleSpeculativeTaskSubmitted</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a SpeculativeTaskSubmitted event.</p>","text":""},{"location":"scheduler/DAGScheduler/#stagecancelled-event-handler","title":"StageCancelled Event Handler <pre><code>handleStageCancellation(): Unit\n</code></pre> <p><code>handleStageCancellation</code>...FIXME</p> <p><code>handleStageCancellation</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a StageCancelled event.</p>","text":""},{"location":"scheduler/DAGScheduler/#tasksetfailed-event-handler","title":"TaskSetFailed Event Handler <pre><code>handleTaskSetFailed(): Unit\n</code></pre> <p><code>handleTaskSetFailed</code>...FIXME</p> <p><code>handleTaskSetFailed</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a TaskSetFailed event.</p>","text":""},{"location":"scheduler/DAGScheduler/#workerremoved-event-handler","title":"WorkerRemoved Event Handler <pre><code>handleWorkerRemoved(\n  workerId: String,\n  host: String,\n  message: String): Unit\n</code></pre> <p><code>handleWorkerRemoved</code>...FIXME</p> <p><code>handleWorkerRemoved</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a WorkerRemoved event.</p>","text":""},{"location":"scheduler/DAGScheduler/#internal-properties","title":"Internal Properties","text":""},{"location":"scheduler/DAGScheduler/#failedepoch","title":"failedEpoch <p>The lookup table of lost executors and the epoch of the event.</p>","text":""},{"location":"scheduler/DAGScheduler/#failedstages","title":"failedStages <p>Stages that failed due to fetch failures (when a DAGSchedulerEventProcessLoop.md#handleTaskCompletion-FetchFailed[task fails with <code>FetchFailed</code> exception]).</p>","text":""},{"location":"scheduler/DAGScheduler/#jobidtoactivejob","title":"jobIdToActiveJob <p>The lookup table of <code>ActiveJob</code>s per job id.</p>","text":""},{"location":"scheduler/DAGScheduler/#jobidtostageids","title":"jobIdToStageIds <p>The lookup table of all stages per <code>ActiveJob</code> id</p>","text":""},{"location":"scheduler/DAGScheduler/#nextjobid-counter","title":"nextJobId Counter <pre><code>nextJobId: AtomicInteger\n</code></pre> <p><code>nextJobId</code> is a Java AtomicInteger for job IDs.</p> <p><code>nextJobId</code> starts at <code>0</code>.</p> <p>Used when <code>DAGScheduler</code> is requested for numTotalJobs, to submitJob, runApproximateJob and submitMapStage.</p>","text":""},{"location":"scheduler/DAGScheduler/#nextstageid","title":"nextStageId <p>The next stage id counting from <code>0</code>.</p> <p>Used when DAGScheduler creates a &lt;&gt; and a &lt;&gt;. It is the key in stageIdToStage.","text":""},{"location":"scheduler/DAGScheduler/#runningstages","title":"runningStages <p>The set of stages that are currently \"running\".</p> <p>A stage is added when &lt;&gt; gets executed (without first checking if the stage has not already been added).","text":""},{"location":"scheduler/DAGScheduler/#shuffleidtomapstage","title":"shuffleIdToMapStage <p>A lookup table of ShuffleMapStages by ShuffleDependency</p>","text":""},{"location":"scheduler/DAGScheduler/#stageidtostage","title":"stageIdToStage <p>A lookup table of stages by stage ID</p> <p>Used when DAGScheduler creates a shuffle map stage, creates a result stage, cleans up job state and independent stages, is informed that a task is started, a taskset has failed, a job is submitted (to compute a <code>ResultStage</code>), a map stage was submitted, a task has completed or a stage was cancelled, updates accumulators, aborts a stage and fails a job and independent stages.</p>","text":""},{"location":"scheduler/DAGScheduler/#waitingstages","title":"waitingStages <p>Stages with parents to be computed</p>","text":""},{"location":"scheduler/DAGScheduler/#event-posting-methods","title":"Event Posting Methods","text":""},{"location":"scheduler/DAGScheduler/#posting-alljobscancelled","title":"Posting AllJobsCancelled <p>Posts an AllJobsCancelled</p> <p>Used when <code>SparkContext</code> is requested to cancel all running or scheduled Spark jobs</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-jobcancelled","title":"Posting JobCancelled <p>Posts a JobCancelled</p> <p>Used when SparkContext or JobWaiter are requested to cancel a Spark job</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-jobgroupcancelled","title":"Posting JobGroupCancelled <p>Posts a JobGroupCancelled</p> <p>Used when <code>SparkContext</code> is requested to cancel a job group</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-stagecancelled","title":"Posting StageCancelled <p>Posts a StageCancelled</p> <p>Used when <code>SparkContext</code> is requested to cancel a stage</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-executoradded","title":"Posting ExecutorAdded <p>Posts an ExecutorAdded</p> <p>Used when <code>TaskSchedulerImpl</code> is requested to handle resource offers (and a new executor is found in the resource offers)</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-executorlost","title":"Posting ExecutorLost <p>Posts a ExecutorLost</p> <p>Used when <code>TaskSchedulerImpl</code> is requested to handle a task status update (and a task gets lost which is used to indicate that the executor got broken and hence should be considered lost) or executorLost</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-jobsubmitted","title":"Posting JobSubmitted <p>Posts a JobSubmitted</p> <p>Used when <code>SparkContext</code> is requested to run an approximate job</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-speculativetasksubmitted","title":"Posting SpeculativeTaskSubmitted <p>Posts a SpeculativeTaskSubmitted</p> <p>Used when <code>TaskSetManager</code> is requested to checkAndSubmitSpeculatableTask</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-completionevent","title":"Posting CompletionEvent <p>Posts a CompletionEvent</p> <p>Used when <code>TaskSetManager</code> is requested to handleSuccessfulTask, handleFailedTask, and executorLost</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-gettingresultevent","title":"Posting GettingResultEvent <p>Posts a GettingResultEvent</p> <p>Used when <code>TaskSetManager</code> is requested to handle a task fetching result</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-tasksetfailed","title":"Posting TaskSetFailed <p>Posts a TaskSetFailed</p> <p>Used when <code>TaskSetManager</code> is requested to abort</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-beginevent","title":"Posting BeginEvent <p>Posts a BeginEvent</p> <p>Used when <code>TaskSetManager</code> is requested to start a task</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-workerremoved","title":"Posting WorkerRemoved <p>Posts a WorkerRemoved</p> <p>Used when <code>TaskSchedulerImpl</code> is requested to handle a removed worker event</p>","text":""},{"location":"scheduler/DAGScheduler/#updating-accumulators-of-completed-tasks","title":"Updating Accumulators of Completed Tasks <pre><code>updateAccumulators(\n  event: CompletionEvent): Unit\n</code></pre> <p><code>updateAccumulators</code> merges the partial values of accumulators from a completed task (based on the given CompletionEvent) into their \"source\" accumulators on the driver.</p> <p>For every AccumulatorV2 update (in the given CompletionEvent), <code>updateAccumulators</code> finds the corresponding accumulator on the driver and requests the <code>AccumulatorV2</code> to merge the updates.</p> <p><code>updateAccumulators</code>...FIXME</p> <p>For named accumulators with the update value being a non-zero value, i.e. not <code>Accumulable.zero</code>:</p> <ul> <li><code>stage.latestInfo.accumulables</code> for the <code>AccumulableInfo.id</code> is set</li> <li><code>CompletionEvent.taskInfo.accumulables</code> has a new AccumulableInfo added.</li> </ul> <p>CAUTION: FIXME Where are <code>Stage.latestInfo.accumulables</code> and <code>CompletionEvent.taskInfo.accumulables</code> used?</p> <p><code>updateAccumulators</code> is used when <code>DAGScheduler</code> is requested to handle a task completion.</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-sparklistenertaskend-at-task-completion","title":"Posting SparkListenerTaskEnd (at Task Completion) <pre><code>postTaskEnd(\n  event: CompletionEvent): Unit\n</code></pre> <p><code>postTaskEnd</code> reconstructs task metrics (from the accumulator updates in the <code>CompletionEvent</code>).</p> <p>In the end, <code>postTaskEnd</code> creates a SparkListenerTaskEnd and requests the LiveListenerBus to post it.</p> <p><code>postTaskEnd</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handle a task completion</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#checkBarrierStageWithNumSlots","title":"checkBarrierStageWithNumSlots <pre><code>checkBarrierStageWithNumSlots(\n  rdd: RDD[_],\n  rp: ResourceProfile): Unit\n</code></pre>  Noop for Non-Barrier RDDs <p>Unless the given <code>RDD</code> is isBarrier, <code>checkBarrierStageWithNumSlots</code> does nothing (is a noop).</p>  <p><code>checkBarrierStageWithNumSlots</code> requests the given <code>RDD</code> for the number of partitions.</p> <p><code>checkBarrierStageWithNumSlots</code> requests the SparkContext for the maximum number of concurrent tasks for the given ResourceProfile.</p> <p>If the number of partitions (based on the RDD) is greater than the maximum number of concurrent tasks (based on the ResourceProfile), <code>checkBarrierStageWithNumSlots</code> reports a BarrierJobSlotsNumberCheckFailed exception.</p>  <p><code>checkBarrierStageWithNumSlots</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to create a ShuffleMapStage or a ResultStage stage</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#utilities","title":"Utilities  <p>Danger</p> <p>The section includes (hides) utility methods that do not really contribute to the understanding of how <code>DAGScheduler</code> works internally.</p> <p>It's very likely they should not even be part of this page.</p>","text":""},{"location":"scheduler/DAGScheduler/#getShuffleDependenciesAndResourceProfiles","title":"Finding Shuffle Dependencies and ResourceProfiles of RDD <pre><code>getShuffleDependenciesAndResourceProfiles(\n  rdd: RDD[_]): (HashSet[ShuffleDependency[_, _, _]], HashSet[ResourceProfile])\n</code></pre> <p><code>getShuffleDependenciesAndResourceProfiles</code> returns the direct ShuffleDependencies and all the ResourceProfiles of the given RDD and parent non-shuffle <code>RDD</code>s, if available.</p>  <p><code>getShuffleDependenciesAndResourceProfiles</code> collects ResourceProfiles of the given RDD and any parent <code>RDD</code>s, if available.</p> <p><code>getShuffleDependenciesAndResourceProfiles</code> collects direct ShuffleDependencies of the given RDD and any parent <code>RDD</code>s of non-<code>ShuffleDependency</code>ies, if available.</p>  <p><code>getShuffleDependenciesAndResourceProfiles</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to create a ShuffleMapStage and a ResultStage, and for the missing ShuffleDependencies of a RDD</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.DAGScheduler</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.DAGScheduler=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/","title":"DAGSchedulerEvent","text":"<p><code>DAGSchedulerEvent</code> is an abstraction of events that are handled by the DAGScheduler (on dag-scheduler-event-loop daemon thread).</p>"},{"location":"scheduler/DAGSchedulerEvent/#alljobscancelled","title":"AllJobsCancelled <p>Carries no extra information</p> <p>Posted when <code>DAGScheduler</code> is requested to cancelAllJobs</p> <p>Event handler: doCancelAllJobs</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#beginevent","title":"BeginEvent <p>Carries the following:</p> <ul> <li>Task</li> <li>TaskInfo</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to taskStarted</p> <p>Event handler: handleBeginEvent</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#completionevent","title":"CompletionEvent <p>Carries the following:</p> <ul> <li> Completed Task <li> <code>TaskEndReason</code> <li> Result (value computed) <li> AccumulatorV2 Updates <li> Metric Peaks <li> TaskInfo  <p>Posted when <code>DAGScheduler</code> is requested to taskEnded</p> <p>Event handler: handleTaskCompletion</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#executoradded","title":"ExecutorAdded <p>Carries the following:</p> <ul> <li>Executor ID</li> <li>Host name</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to executorAdded</p> <p>Event handler: handleExecutorAdded</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#executorlost","title":"ExecutorLost <p>Carries the following:</p> <ul> <li>Executor ID</li> <li>Reason</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to executorLost</p> <p>Event handler: handleExecutorLost</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#gettingresultevent","title":"GettingResultEvent <p>Carries the following:</p> <ul> <li>TaskInfo</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to taskGettingResult</p> <p>Event handler: handleGetTaskResult</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#jobcancelled","title":"JobCancelled <p>JobCancelled event carries the following:</p> <ul> <li>Job ID</li> <li>Reason (optional)</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to cancelJob</p> <p>Event handler: handleJobCancellation</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#jobgroupcancelled","title":"JobGroupCancelled <p>Carries the following:</p> <ul> <li>Group ID</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to cancelJobGroup</p> <p>Event handler: handleJobGroupCancelled</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#jobsubmitted","title":"JobSubmitted <p>Carries the following:</p> <ul> <li>Job ID</li> <li>RDD</li> <li>Partition processing function (with a TaskContext and the partition data, i.e. <code>(TaskContext, Iterator[_]) =&gt; _</code>)</li> <li>Partition IDs to compute</li> <li><code>CallSite</code></li> <li>JobListener to keep updated about the status of the stage execution</li> <li>Execution properties</li> </ul> <p>Posted when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit a job, run an approximate job and handleJobSubmitted</li> </ul> <p>Event handler: handleJobSubmitted</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#mapstagesubmitted","title":"MapStageSubmitted <p>Carries the following:</p> <ul> <li>Job ID</li> <li>ShuffleDependency</li> <li>CallSite</li> <li>JobListener</li> <li>Execution properties</li> </ul> <p>Posted when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit a MapStage for execution</li> </ul> <p>Event handler: handleMapStageSubmitted</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#resubmitfailedstages","title":"ResubmitFailedStages <p>Carries no extra information.</p> <p>Posted when <code>DAGScheduler</code> is requested to handleTaskCompletion</p> <p>Event handler: resubmitFailedStages</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#shufflemergefinalized","title":"ShuffleMergeFinalized <p>Carries the following:</p> <ul> <li>ShuffleMapStage</li> </ul> <p>Posted when:</p> <ul> <li><code>DAGScheduler</code> is requested to finalizeShuffleMerge</li> </ul> <p>Event handler: handleShuffleMergeFinalized</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#speculativetasksubmitted","title":"SpeculativeTaskSubmitted <p>Carries the following:</p> <ul> <li>Task</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to speculativeTaskSubmitted</p> <p>Event handler: handleSpeculativeTaskSubmitted</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#stagecancelled","title":"StageCancelled <p>Carries the following:</p> <ul> <li>Stage ID</li> <li>Reason (optional)</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to cancelStage</p> <p>Event handler: handleStageCancellation</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#tasksetfailed","title":"TaskSetFailed <p>Carries the following:</p> <ul> <li>TaskSet</li> <li>Reason</li> <li>Exception (optional)</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to taskSetFailed</p> <p>Event handler: handleTaskSetFailed</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#workerremoved","title":"WorkerRemoved <p>Carries the following:</p> <ul> <li>Worked ID</li> <li>Host name</li> <li>Reason</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to workerRemoved</p> <p>Event handler: handleWorkerRemoved</p>","text":""},{"location":"scheduler/DAGSchedulerEventProcessLoop/","title":"DAGSchedulerEventProcessLoop","text":"<p><code>DAGSchedulerEventProcessLoop</code> is an event processing daemon thread to handle DAGSchedulerEvents (on a separate thread from the parent DAGScheduler's).</p> <p><code>DAGSchedulerEventProcessLoop</code> is registered under the name of dag-scheduler-event-loop.</p> <p><code>DAGSchedulerEventProcessLoop</code> uses java.util.concurrent.LinkedBlockingDeque blocking deque that can grow indefinitely.</p>"},{"location":"scheduler/DAGSchedulerEventProcessLoop/#creating-instance","title":"Creating Instance","text":"<p><code>DAGSchedulerEventProcessLoop</code> takes the following to be created:</p> <ul> <li> DAGScheduler <p><code>DAGSchedulerEventProcessLoop</code> is created\u00a0when:</p> <ul> <li><code>DAGScheduler</code> is created</li> </ul>"},{"location":"scheduler/DAGSchedulerEventProcessLoop/#processing-event","title":"Processing Event    DAGSchedulerEvent Event Handler     AllJobsCancelled doCancelAllJobs   BeginEvent handleBeginEvent   CompletionEvent handleTaskCompletion   ExecutorAdded handleExecutorAdded   ExecutorLost handleExecutorLost   GettingResultEvent handleGetTaskResult   JobCancelled handleJobCancellation   JobGroupCancelled handleJobGroupCancelled   JobSubmitted handleJobSubmitted   MapStageSubmitted handleMapStageSubmitted   ResubmitFailedStages resubmitFailedStages   SpeculativeTaskSubmitted handleSpeculativeTaskSubmitted   StageCancelled handleStageCancellation   TaskSetFailed handleTaskSetFailed   WorkerRemoved handleWorkerRemoved","text":""},{"location":"scheduler/DAGSchedulerEventProcessLoop/#shufflemergefinalized","title":"ShuffleMergeFinalized <ul> <li>Event: ShuffleMergeFinalized</li> <li>Event handler: handleShuffleMergeFinalized</li> </ul>","text":""},{"location":"scheduler/DAGSchedulerEventProcessLoop/#messageprocessingtime-timer","title":"messageProcessingTime Timer <p><code>DAGSchedulerEventProcessLoop</code> uses messageProcessingTime timer to measure time of processing events.</p>","text":""},{"location":"scheduler/DAGSchedulerSource/","title":"DAGSchedulerSource","text":"<p><code>DAGSchedulerSource</code> is the metrics source of DAGScheduler.</p> <p>The name of the source is DAGScheduler.</p> <p><code>DAGSchedulerSource</code> emits the following metrics:</p> <ul> <li>stage.failedStages - the number of failed stages</li> <li>stage.runningStages - the number of running stages</li> <li>stage.waitingStages - the number of waiting stages</li> <li>job.allJobs - the number of all jobs</li> <li>job.activeJobs - the number of active jobs</li> </ul>"},{"location":"scheduler/DriverEndpoint/","title":"DriverEndpoint","text":"<p><code>DriverEndpoint</code> is a ThreadSafeRpcEndpoint that is a message handler for CoarseGrainedSchedulerBackend to communicate with CoarseGrainedExecutorBackend.</p> <p></p> <p><code>DriverEndpoint</code> is registered under the name CoarseGrainedScheduler by CoarseGrainedSchedulerBackend.</p> <p><code>DriverEndpoint</code> uses executorDataMap internal registry of all the executors that registered with the driver. An executor sends a RegisterExecutor message to inform that it wants to register.</p> <p></p>"},{"location":"scheduler/DriverEndpoint/#creating-instance","title":"Creating Instance","text":"<p><code>DriverEndpoint</code> takes no arguments to be created.</p> <p><code>DriverEndpoint</code> is created when:</p> <ul> <li><code>CoarseGrainedSchedulerBackend</code> is created (and registers a CoarseGrainedScheduler RPC endpoint)</li> </ul>"},{"location":"scheduler/DriverEndpoint/#executorlogurlhandler","title":"ExecutorLogUrlHandler <pre><code>logUrlHandler: ExecutorLogUrlHandler\n</code></pre> <p><code>DriverEndpoint</code> creates an ExecutorLogUrlHandler (based on spark.ui.custom.executor.log.url configuration property) when created.</p> <p><code>DriverEndpoint</code> uses the <code>ExecutorLogUrlHandler</code> to create an ExecutorData when requested to handle a RegisterExecutor message.</p>","text":""},{"location":"scheduler/DriverEndpoint/#onStart","title":"Starting DriverEndpoint  RpcEndpoint <pre><code>onStart(): Unit\n</code></pre> <p><code>onStart</code> is part of the RpcEndpoint abstraction.</p>  <p><code>onStart</code> requests the Revive Messages Scheduler Service to schedule a periodic action that sends ReviveOffers messages every revive interval (based on spark.scheduler.revive.interval configuration property).</p>","text":""},{"location":"scheduler/DriverEndpoint/#makeOffers","title":"Launching Tasks <p>There are two <code>makeOffers</code> methods to launch tasks that differ by the number of active executor (from the executorDataMap registry) they work with:</p> <ul> <li>All Active Executors</li> <li>Single Executor</li> </ul>","text":""},{"location":"scheduler/DriverEndpoint/#on-all-active-executors","title":"On All Active Executors","text":"<pre><code>makeOffers(): Unit\n</code></pre> <p><code>makeOffers</code> builds WorkerOffers for every active executor (in the executorDataMap registry) and requests the TaskSchedulerImpl to generate tasks for the available worker offers (that creates TaskDescriptions).</p> <p>With tasks (<code>TaskDescription</code>s) to be launched, <code>makeOffers</code> launches them.</p> <p><code>makeOffers</code> is used when:</p> <ul> <li><code>DriverEndpoint</code> handles ReviveOffers messages</li> </ul>"},{"location":"scheduler/DriverEndpoint/#on-single-executor","title":"On Single Executor","text":"<pre><code>makeOffers(\n  executorId: String): Unit\n</code></pre> <p>Note</p> <p><code>makeOffers</code> with a single executor is makeOffers for all active executors for just one executor.</p> <p><code>makeOffers</code> is used when:</p> <ul> <li><code>DriverEndpoint</code> handles StatusUpdate and LaunchedExecutor messages</li> </ul>"},{"location":"scheduler/DriverEndpoint/#launchTasks","title":"Launching Tasks","text":"<pre><code>launchTasks(\n  tasks: Seq[Seq[TaskDescription]]): Unit\n</code></pre> <p>Note</p> <p>The input <code>tasks</code> collection contains one or more TaskDescriptions per executor (and the \"task partitioning\" per executor is of no use in <code>launchTasks</code> so it simply flattens the input data structure).</p> <p>For every TaskDescription (in the given <code>tasks</code> collection), <code>launchTasks</code> encodes it and makes sure that the encoded task size is below the allowed message size.</p> <p><code>launchTasks</code> looks up the <code>ExecutorData</code> of the executor that has been assigned to execute the task (in executorDataMap internal registry) and decreases the executor's free cores (based on spark.task.cpus configuration property).</p> <p>Note</p> <p>Scheduling in Spark relies on cores only (not memory), i.e. the number of tasks Spark can run on an executor is limited by the number of cores available only. When submitting a Spark application for execution both executor resources -- memory and cores -- can however be specified explicitly. It is the job of a cluster manager to monitor the memory and take action when its use exceeds what was assigned.</p> <p><code>launchTasks</code> prints out the following DEBUG message to the logs:</p> <pre><code>Launching task [taskId] on executor id: [executorId] hostname: [executorHost].\n</code></pre> <p>In the end, <code>launchTasks</code> sends the (serialized) task to the executor (by sending a LaunchTask message to the executor's RPC endpoint with the serialized task insize <code>SerializableBuffer</code>).</p> <p>Note</p> <p>This is the moment in a task's lifecycle when the driver sends the serialized task to an assigned executor.</p>"},{"location":"scheduler/DriverEndpoint/#task-exceeds-allowed-size","title":"Task Exceeds Allowed Size <p>In case the size of a serialized <code>TaskDescription</code> equals or exceeds the maximum allowed RPC message size, <code>launchTasks</code> looks up the TaskSetManager for the <code>TaskDescription</code> (in taskIdToTaskSetManager registry) and aborts it with the following message:</p> <pre><code>Serialized task [id]:[index] was [limit] bytes, which exceeds max allowed: spark.rpc.message.maxSize ([maxRpcMessageSize] bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.\n</code></pre>","text":""},{"location":"scheduler/DriverEndpoint/#messages","title":"Messages","text":""},{"location":"scheduler/DriverEndpoint/#killexecutorsonhost","title":"KillExecutorsOnHost <p><code>CoarseGrainedSchedulerBackend</code> is requested to kill all executors on a node</p>","text":""},{"location":"scheduler/DriverEndpoint/#killtask","title":"KillTask <p><code>CoarseGrainedSchedulerBackend</code> is requested to kill a task.</p> <pre><code>KillTask(\n  taskId: Long,\n  executor: String,\n  interruptThread: Boolean)\n</code></pre> <p><code>KillTask</code> is sent when <code>CoarseGrainedSchedulerBackend</code> kills a task.</p> <p>When <code>KillTask</code> is received, <code>DriverEndpoint</code> finds <code>executor</code> (in executorDataMap registry).</p> <p>If found, <code>DriverEndpoint</code> passes the message on to the executor (using its registered RPC endpoint for <code>CoarseGrainedExecutorBackend</code>).</p> <p>Otherwise, you should see the following WARN in the logs:</p> <pre><code>Attempted to kill task [taskId] for unknown executor [executor].\n</code></pre>","text":""},{"location":"scheduler/DriverEndpoint/#launchedexecutor","title":"LaunchedExecutor","text":""},{"location":"scheduler/DriverEndpoint/#registerexecutor","title":"RegisterExecutor <p><code>CoarseGrainedExecutorBackend</code> registers with the driver</p> <pre><code>RegisterExecutor(\n  executorId: String,\n  executorRef: RpcEndpointRef,\n  hostname: String,\n  cores: Int,\n  logUrls: Map[String, String])\n</code></pre> <p><code>RegisterExecutor</code> is sent when <code>CoarseGrainedExecutorBackend</code> RPC Endpoint is requested to start.</p> <p></p> <p>When received, <code>DriverEndpoint</code> makes sure that no other executors were registered under the input <code>executorId</code> and that the input <code>hostname</code> is not blacklisted.</p> <p>If the requirements hold, you should see the following INFO message in the logs:</p> <pre><code>Registered executor [executorRef] ([address]) with ID [executorId]\n</code></pre> <p><code>DriverEndpoint</code> does the bookkeeping:</p> <ul> <li>Registers <code>executorId</code> (in addressToExecutorId)</li> <li>Adds <code>cores</code> (in totalCoreCount)</li> <li>Increments totalRegisteredExecutors</li> <li>Creates and registers <code>ExecutorData</code> for <code>executorId</code> (in executorDataMap)</li> <li>Updates currentExecutorIdCounter if the input <code>executorId</code> is greater than the current value.</li> </ul> <p>If numPendingExecutors is greater than <code>0</code>, you should see the following DEBUG message in the logs and DriverEndpoint decrements <code>numPendingExecutors</code>.</p> <pre><code>Decremented number of pending executors ([numPendingExecutors] left)\n</code></pre> <p><code>DriverEndpoint</code> sends RegisteredExecutor message back (that is to confirm that the executor was registered successfully).</p> <p><code>DriverEndpoint</code> replies <code>true</code> (to acknowledge the message).</p> <p><code>DriverEndpoint</code> then announces the new executor by posting SparkListenerExecutorAdded to LiveListenerBus.</p> <p>In the end, <code>DriverEndpoint</code> makes executor resource offers (for launching tasks).</p> <p>If however there was already another executor registered under the input <code>executorId</code>, <code>DriverEndpoint</code> sends RegisterExecutorFailed message back with the reason:</p> <pre><code>Duplicate executor ID: [executorId]\n</code></pre> <p>If however the input <code>hostname</code> is blacklisted, you should see the following INFO message in the logs:</p> <pre><code>Rejecting [executorId] as it has been blacklisted.\n</code></pre> <p><code>DriverEndpoint</code> sends RegisterExecutorFailed message back with the reason:</p> <pre><code>Executor is blacklisted: [executorId]\n</code></pre>","text":""},{"location":"scheduler/DriverEndpoint/#removeexecutor","title":"RemoveExecutor","text":""},{"location":"scheduler/DriverEndpoint/#removeworker","title":"RemoveWorker","text":""},{"location":"scheduler/DriverEndpoint/#retrievesparkappconfig","title":"RetrieveSparkAppConfig <pre><code>RetrieveSparkAppConfig(\n  resourceProfileId: Int)\n</code></pre> <p>Posted when:</p> <ul> <li><code>CoarseGrainedExecutorBackend</code> standalone application is started</li> </ul> <p>When received, <code>DriverEndpoint</code> replies with a <code>SparkAppConfig</code> message with the following:</p> <ol> <li><code>spark</code>-prefixed configuration properties</li> <li>IO Encryption Key</li> <li>Delegation tokens</li> <li>Default profile</li> </ol>","text":""},{"location":"scheduler/DriverEndpoint/#reviveoffers","title":"ReviveOffers <p>Posted when:</p> <ul> <li>Periodically (every spark.scheduler.revive.interval) right after <code>DriverEndpoint</code> is requested to start</li> <li><code>CoarseGrainedSchedulerBackend</code> is requested to revive resource offers</li> </ul> <p>When received, <code>DriverEndpoint</code> makes executor resource offers.</p>","text":""},{"location":"scheduler/DriverEndpoint/#statusupdate","title":"StatusUpdate <p><code>CoarseGrainedExecutorBackend</code> sends task status updates to the driver</p> <pre><code>StatusUpdate(\n  executorId: String,\n  taskId: Long,\n  state: TaskState,\n  data: SerializableBuffer)\n</code></pre> <p><code>StatusUpdate</code> is sent when <code>CoarseGrainedExecutorBackend</code> sends task status updates to the driver.</p> <p>When <code>StatusUpdate</code> is received, DriverEndpoint requests the TaskSchedulerImpl to handle the task status update.</p> <p>If the task has finished, <code>DriverEndpoint</code> updates the number of cores available for work on the corresponding executor (registered in executorDataMap).</p> <p>DriverEndpoint makes an executor resource offer on the single executor.</p> <p>When <code>DriverEndpoint</code> found no executor (in executorDataMap), you should see the following WARN message in the logs:</p> <pre><code>Ignored task status update ([taskId] state [state]) from unknown executor with ID [executorId]\n</code></pre>","text":""},{"location":"scheduler/DriverEndpoint/#stopdriver","title":"StopDriver","text":""},{"location":"scheduler/DriverEndpoint/#stopexecutors","title":"StopExecutors <p><code>StopExecutors</code> message is receive-reply and blocking. When received, the following INFO message appears in the logs:</p> <pre><code>Asking each executor to shut down\n</code></pre> <p>It then sends a StopExecutor message to every registered executor (from <code>executorDataMap</code>).</p>","text":""},{"location":"scheduler/DriverEndpoint/#updatedelegationtokens","title":"UpdateDelegationTokens","text":""},{"location":"scheduler/DriverEndpoint/#removing-executor","title":"Removing Executor <pre><code>removeExecutor(\n  executorId: String,\n  reason: ExecutorLossReason): Unit\n</code></pre> <p>When <code>removeExecutor</code> is executed, you should see the following DEBUG message in the logs:</p> <pre><code>Asked to remove executor [executorId] with reason [reason]\n</code></pre> <p><code>removeExecutor</code> then tries to find the <code>executorId</code> executor (in executorDataMap internal registry).</p> <p>If the <code>executorId</code> executor was found, <code>removeExecutor</code> removes the executor from the following registries:</p> <ul> <li>addressToExecutorId</li> <li>executorDataMap</li> <li>&lt;&gt; <li>executorsPendingToRemove</li>  <p><code>removeExecutor</code> decrements:</p> <ul> <li>totalCoreCount by the executor's <code>totalCores</code></li> <li>totalRegisteredExecutors</li> </ul> <p>In the end, <code>removeExecutor</code> notifies <code>TaskSchedulerImpl</code> that an executor was lost.</p> <p><code>removeExecutor</code> posts SparkListenerExecutorRemoved to LiveListenerBus (with the <code>executorId</code> executor).</p> <p>If however the <code>executorId</code> executor could not be found, <code>removeExecutor</code> requests <code>BlockManagerMaster</code> to remove the executor asynchronously.</p>  <p>Note</p> <p><code>removeExecutor</code> uses <code>SparkEnv</code> to access the current <code>BlockManager</code> and then BlockManagerMaster.</p>  <p>You should see the following INFO message in the logs:</p> <pre><code>Asked to remove non-existent executor [executorId]\n</code></pre> <p><code>removeExecutor</code> is used when <code>DriverEndpoint</code> handles RemoveExecutor message and gets disassociated with a remote RPC endpoint of an executor.</p>","text":""},{"location":"scheduler/DriverEndpoint/#removing-worker","title":"Removing Worker <pre><code>removeWorker(\n  workerId: String,\n  host: String,\n  message: String): Unit\n</code></pre> <p><code>removeWorker</code> prints out the following DEBUG message to the logs:</p> <pre><code>Asked to remove worker [workerId] with reason [message]\n</code></pre> <p>In the end, <code>removeWorker</code> simply requests the TaskSchedulerImpl to workerRemoved.</p> <p><code>removeWorker</code> is used when <code>DriverEndpoint</code> is requested to handle a RemoveWorker event.</p>","text":""},{"location":"scheduler/DriverEndpoint/#processing-one-way-messages","title":"Processing One-Way Messages <pre><code>receive: PartialFunction[Any, Unit]\n</code></pre> <p><code>receive</code> is part of the RpcEndpoint abstraction.</p> <p><code>receive</code>...FIXME</p>","text":""},{"location":"scheduler/DriverEndpoint/#processing-two-way-messages","title":"Processing Two-Way Messages <pre><code>receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n</code></pre> <p><code>receiveAndReply</code> is part of the RpcEndpoint abstraction.</p> <p><code>receiveAndReply</code>...FIXME</p>","text":""},{"location":"scheduler/DriverEndpoint/#ondisconnected-callback","title":"onDisconnected Callback <p><code>onDisconnected</code> removes the worker from the internal addressToExecutorId registry (that effectively removes the worker from a cluster).</p> <p><code>onDisconnected</code> removes the executor with the reason being <code>SlaveLost</code> and message:</p> <pre><code>Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.\n</code></pre>","text":""},{"location":"scheduler/DriverEndpoint/#executors-by-rpcaddress-registry","title":"Executors by RpcAddress Registry <pre><code>addressToExecutorId: Map[RpcAddress, String]\n</code></pre> <p>Executor addresses (host and port) for executors.</p> <p>Set when an executor connects to register itself.</p>","text":""},{"location":"scheduler/DriverEndpoint/#disabling-executor","title":"Disabling Executor <pre><code>disableExecutor(\n  executorId: String): Boolean\n</code></pre> <p><code>disableExecutor</code> checks whether the executor is active:</p> <ul> <li>If so, <code>disableExecutor</code> adds the executor to the executorsPendingLossReason registry</li> <li>Otherwise, <code>disableExecutor</code> checks whether added to executorsPendingToRemove registry</li> </ul> <p><code>disableExecutor</code> determines whether the executor should really be disabled (as active or registered in executorsPendingToRemove registry).</p> <p>If the executor should be disabled, <code>disableExecutor</code> prints out the following INFO message to the logs and notifies the TaskSchedulerImpl that the executor is lost.</p> <pre><code>Disabling executor [executorId].\n</code></pre> <p><code>disableExecutor</code> returns the indication whether the executor should have been disabled or not.</p> <p><code>disableExecutor</code> is used when:</p> <ul> <li><code>KubernetesDriverEndpoint</code> is requested to handle <code>onDisconnected</code> event</li> <li><code>YarnDriverEndpoint</code> is requested to handle <code>onDisconnected</code> event</li> </ul>","text":""},{"location":"scheduler/DriverEndpoint/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverEndpoint</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverEndpoint=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/ExecutorData/","title":"ExecutorData","text":"<p><code>ExecutorData</code> is a metadata of an executor:</p> <ul> <li> Executor's RPC Endpoint <li> Executor's RpcAddress <li> Executor's Host <li> Executor's Free Cores <li> Executor's Total Cores <li> Executor's Log URLs (<code>Map[String, String]</code>) <li> Executor's Attributes (<code>Map[String, String]</code>) <li> Executor's Resources Info (<code>Map[String, ExecutorResourceInfo]</code>) <li> Executor's ResourceProfile ID <p><code>ExecutorData</code> is created for every executor registered (when <code>DriverEndpoint</code> is requested to handle a RegisterExecutor message).</p> <p><code>ExecutorData</code> is used by <code>CoarseGrainedSchedulerBackend</code> to track registered executors.</p> <p>Note</p> <p><code>ExecutorData</code> is posted as part of SparkListenerExecutorAdded event by DriverEndpoint every time an executor is registered.</p>"},{"location":"scheduler/ExternalClusterManager/","title":"ExternalClusterManager","text":"<p><code>ExternalClusterManager</code> is an abstraction of pluggable cluster managers that can create a SchedulerBackend and TaskScheduler for a given master URL (when SparkContext is created).</p> <p>Note</p> <p>The support for pluggable cluster managers was introduced in SPARK-13904 Add support for pluggable cluster manager.</p> <p><code>ExternalClusterManager</code> can be registered using the <code>java.util.ServiceLoader</code> mechanism (with service markers under <code>META-INF/services</code> directory).</p>"},{"location":"scheduler/ExternalClusterManager/#contract","title":"Contract","text":""},{"location":"scheduler/ExternalClusterManager/#checking-support-for-master-url","title":"Checking Support for Master URL <pre><code>canCreate(\n  masterURL: String): Boolean\n</code></pre> <p>Checks whether this cluster manager instance can create scheduler components for a given master URL</p> <p>Used when SparkContext is created (and requested for a cluster manager)</p>","text":""},{"location":"scheduler/ExternalClusterManager/#creating-schedulerbackend","title":"Creating SchedulerBackend <pre><code>createSchedulerBackend(\n  sc: SparkContext,\n  masterURL: String,\n  scheduler: TaskScheduler): SchedulerBackend\n</code></pre> <p>Creates a SchedulerBackend for a given SparkContext, master URL, and TaskScheduler.</p> <p>Used when SparkContext is created (and requested for a SchedulerBackend and TaskScheduler)</p>","text":""},{"location":"scheduler/ExternalClusterManager/#creating-taskscheduler","title":"Creating TaskScheduler <pre><code>createTaskScheduler(\n  sc: SparkContext,\n  masterURL: String): TaskScheduler\n</code></pre> <p>Creates a TaskScheduler for a given SparkContext and master URL</p> <p>Used when SparkContext is created (and requested for a SchedulerBackend and TaskScheduler)</p>","text":""},{"location":"scheduler/ExternalClusterManager/#initializing-scheduling-components","title":"Initializing Scheduling Components <pre><code>initialize(\n  scheduler: TaskScheduler,\n  backend: SchedulerBackend): Unit\n</code></pre> <p>Initializes the TaskScheduler and SchedulerBackend</p> <p>Used when SparkContext is created (and requested for a SchedulerBackend and TaskScheduler)</p>","text":""},{"location":"scheduler/ExternalClusterManager/#implementations","title":"Implementations","text":"<ul> <li><code>KubernetesClusterManager</code> (Spark on Kubernetes)</li> <li><code>MesosClusterManager</code></li> <li><code>YarnClusterManager</code></li> </ul>"},{"location":"scheduler/FIFOSchedulableBuilder/","title":"FIFOSchedulableBuilder","text":"<p>== FIFOSchedulableBuilder - SchedulableBuilder for FIFO Scheduling Mode</p> <p><code>FIFOSchedulableBuilder</code> is a &lt;&gt; that holds a single spark-scheduler-Pool.md[Pool] (that is given when &lt;FIFOSchedulableBuilder is created&gt;&gt;). <p>NOTE: <code>FIFOSchedulableBuilder</code> is the scheduler:TaskSchedulerImpl.md#creating-instance[default <code>SchedulableBuilder</code> for <code>TaskSchedulerImpl</code>].</p> <p>NOTE: When <code>FIFOSchedulableBuilder</code> is created, the <code>TaskSchedulerImpl</code> passes its own <code>rootPool</code> (a part of scheduler:TaskScheduler.md#contract[TaskScheduler Contract]).</p> <p><code>FIFOSchedulableBuilder</code> obeys the &lt;&gt; as follows: <ul> <li>&lt;&gt; does nothing. <li><code>addTaskSetManager</code> spark-scheduler-Pool.md#addSchedulable[passes the input <code>Schedulable</code> to the one and only rootPool Pool (using <code>addSchedulable</code>)] and completely disregards the properties of the Schedulable.</li> <p>=== [[creating-instance]] Creating FIFOSchedulableBuilder Instance</p> <p><code>FIFOSchedulableBuilder</code> takes the following when created:</p> <ul> <li>[[rootPool]] <code>rootPool</code> spark-scheduler-Pool.md[Pool]</li> </ul>"},{"location":"scheduler/FairSchedulableBuilder/","title":"FairSchedulableBuilder","text":"<p><code>FairSchedulableBuilder</code> is a &lt;&gt; that is &lt;&gt; exclusively for scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] for FAIR scheduling mode (when configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property is <code>FAIR</code>). <p>[[creating-instance]] <code>FairSchedulableBuilder</code> takes the following to be created:</p> <ul> <li>[[rootPool]] &lt;&gt; <li>[[conf]] SparkConf.md[]</li> <p>Once &lt;&gt;, <code>TaskSchedulerImpl</code> requests the <code>FairSchedulableBuilder</code> to &lt;&gt;. <p>[[DEFAULT_SCHEDULER_FILE]] <code>FairSchedulableBuilder</code> uses the pools defined in an &lt;&gt; that is assumed to be the value of the configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property or the default fairscheduler.xml (that is &lt;&gt;). <p>TIP: Use conf/fairscheduler.xml.template as a template for the &lt;&gt;. <p>[[DEFAULT_POOL_NAME]] <code>FairSchedulableBuilder</code> always has the default pool defined (and &lt;&gt; unless done in the &lt;&gt;). <p>[[FAIR_SCHEDULER_PROPERTIES]] [[spark.scheduler.pool]] <code>FairSchedulableBuilder</code> uses spark.scheduler.pool local property for the name of the pool to use when requested to &lt;&gt; (default: &lt;&gt;). <p>Note</p> <p>SparkContext.setLocalProperty lets you set local properties per thread to group jobs in logical groups, e.g. to allow <code>FairSchedulableBuilder</code> to use <code>spark.scheduler.pool</code> property and to group jobs from different threads to be submitted for execution on a non-&lt;&gt; pool."},{"location":"scheduler/FairSchedulableBuilder/#source-scala","title":"[source, scala]","text":"<p>scala&gt; :type sc org.apache.spark.SparkContext</p> <p>sc.setLocalProperty(\"spark.scheduler.pool\", \"production\")</p>"},{"location":"scheduler/FairSchedulableBuilder/#whatever-is-executed-afterwards-is-submitted-to-production-pool","title":"// whatever is executed afterwards is submitted to production pool","text":"<p>[[logging]] [TIP] ==== Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.FairSchedulableBuilder</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.FairSchedulableBuilder=ALL\n</code></pre>"},{"location":"scheduler/FairSchedulableBuilder/#refer-to","title":"Refer to &lt;&gt;. <p>=== [[allocations-file]] Allocation Pools Configuration File</p> <p>The allocation pools configuration file is an XML file.</p> <p>The default <code>conf/fairscheduler.xml.template</code> is as follows:</p>","text":""},{"location":"scheduler/FairSchedulableBuilder/#source-xml","title":"[source, xml]","text":"<p> FAIR 1 2 FIFO 2 3 </p> <p>TIP: The top-level element's name <code>allocations</code> can be anything. Spark does not insist on <code>allocations</code> and accepts any name.</p> <p>=== [[buildPools]] Building (Tree of) Pools of Schedulables -- <code>buildPools</code> Method</p>"},{"location":"scheduler/FairSchedulableBuilder/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/FairSchedulableBuilder/#buildpools-unit","title":"buildPools(): Unit","text":"<p>NOTE: <code>buildPools</code> is part of the &lt;&gt; to build a tree of &lt;&gt;. <p><code>buildPools</code> &lt;&gt; if available and then &lt;&gt;. <p><code>buildPools</code> prints out the following INFO message to the logs when the configuration file (per the configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property) could be read:</p> <pre><code>Creating Fair Scheduler pools from [file]\n</code></pre> <p><code>buildPools</code> prints out the following INFO message to the logs when the configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property was not used to define the configuration file and the &lt;&gt; is used instead: <pre><code>Creating Fair Scheduler pools from default file: [DEFAULT_SCHEDULER_FILE]\n</code></pre> <p>When neither configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property nor the &lt;&gt; could be used, <code>buildPools</code> prints out the following WARN message to the logs: <pre><code>Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in [DEFAULT_SCHEDULER_FILE] or set spark.scheduler.allocation.file to a file that contains the configuration.\n</code></pre> <p>=== [[addTaskSetManager]] <code>addTaskSetManager</code> Method</p>"},{"location":"scheduler/FairSchedulableBuilder/#source-scala_2","title":"[source, scala]","text":""},{"location":"scheduler/FairSchedulableBuilder/#addtasksetmanagermanager-schedulable-properties-properties-unit","title":"addTaskSetManager(manager: Schedulable, properties: Properties): Unit","text":"<p>NOTE: <code>addTaskSetManager</code> is part of the &lt;&gt; to register a new &lt;&gt; with the &lt;&gt; <p><code>addTaskSetManager</code> finds the pool by name (in the given <code>Properties</code>) under the &lt;&gt; property or defaults to the &lt;&gt; pool if undefined. <p><code>addTaskSetManager</code> then requests the &lt;&gt; to &lt;&gt;. <p>Unless found, <code>addTaskSetManager</code> creates a new &lt;&gt; with the &lt;&gt; (as if the &lt;&gt; pool were used) and requests the &lt;&gt; to &lt;&gt;. In the end, <code>addTaskSetManager</code> prints out the following WARN message to the logs: <pre><code>A job was submitted with scheduler pool [poolName], which has not been configured. This can happen when the file that pools are read from isn't set, or when that file doesn't contain [poolName]. Created [poolName] with default configuration (schedulingMode: [mode], minShare: [minShare], weight: [weight])\n</code></pre> <p><code>addTaskSetManager</code> then requests the pool (found or newly-created) to &lt;&gt; the given &lt;&gt;. <p>In the end, <code>addTaskSetManager</code> prints out the following INFO message to the logs:</p> <pre><code>Added task set [name] tasks to pool [poolName]\n</code></pre> <p>=== [[buildDefaultPool]] Registering Default Pool -- <code>buildDefaultPool</code> Method</p>"},{"location":"scheduler/FairSchedulableBuilder/#source-scala_3","title":"[source, scala]","text":""},{"location":"scheduler/FairSchedulableBuilder/#builddefaultpool-unit","title":"buildDefaultPool(): Unit","text":"<p><code>buildDefaultPool</code> requests the &lt;&gt; to &lt;&gt; (one with the &lt;&gt; name). <p>Unless already available, <code>buildDefaultPool</code> creates a &lt;&gt; with the following: <ul> <li> <p>&lt;&gt; pool name <li> <p><code>FIFO</code> scheduling mode</p> </li> <li> <p><code>0</code> for the initial minimum share</p> </li> <li> <p><code>1</code> for the initial weight</p> </li> <p>In the end, <code>buildDefaultPool</code> requests the &lt;&gt; to &lt;&gt; followed by the INFO message in the logs: <pre><code>Created default pool: [name], schedulingMode: [mode], minShare: [minShare], weight: [weight]\n</code></pre> <p>NOTE: <code>buildDefaultPool</code> is used exclusively when <code>FairSchedulableBuilder</code> is requested to &lt;&gt;. <p>=== [[buildFairSchedulerPool]] Building Pools from XML Allocations File -- <code>buildFairSchedulerPool</code> Internal Method</p>"},{"location":"scheduler/FairSchedulableBuilder/#source-scala_4","title":"[source, scala]","text":"<p>buildFairSchedulerPool(   is: InputStream,   fileName: String): Unit</p> <p><code>buildFairSchedulerPool</code> starts by loading the XML file from the given <code>InputStream</code>.</p> <p>For every pool element, <code>buildFairSchedulerPool</code> creates a &lt;&gt; with the following: <ul> <li> <p>Pool name per name attribute</p> </li> <li> <p>Scheduling mode per schedulingMode element (case-insensitive with <code>FIFO</code> as the default)</p> </li> <li> <p>Initial minimum share per minShare element (default: <code>0</code>)</p> </li> <li> <p>Initial weight per weight element (default: <code>1</code>)</p> </li> </ul> <p>In the end, <code>buildFairSchedulerPool</code> requests the &lt;&gt; to &lt;&gt; followed by the INFO message in the logs: <pre><code>Created pool: [name], schedulingMode: [mode], minShare: [minShare], weight: [weight]\n</code></pre> <p>NOTE: <code>buildFairSchedulerPool</code> is used exclusively when <code>FairSchedulableBuilder</code> is requested to &lt;&gt;."},{"location":"scheduler/HighlyCompressedMapStatus/","title":"HighlyCompressedMapStatus","text":"<p><code>HighlyCompressedMapStatus</code> is...FIXME</p>"},{"location":"scheduler/JobListener/","title":"JobListener","text":"<p><code>JobListener</code> is an abstraction of listeners that listen for job completion or failure events (after submitting a job to the DAGScheduler).</p>"},{"location":"scheduler/JobListener/#contract","title":"Contract","text":""},{"location":"scheduler/JobListener/#tasksucceeded","title":"taskSucceeded <pre><code>taskSucceeded(\n  index: Int,\n  result: Any): Unit\n</code></pre> <p>Used when <code>DAGScheduler</code> is requested to handleTaskCompletion or markMapStageJobAsFinished</p>","text":""},{"location":"scheduler/JobListener/#jobfailed","title":"jobFailed <pre><code>jobFailed(\n  exception: Exception): Unit\n</code></pre> <p>Used when <code>DAGScheduler</code> is requested to cleanUpAfterSchedulerStop, handleJobSubmitted, handleMapStageSubmitted, handleTaskCompletion or failJobAndIndependentStages</p>","text":""},{"location":"scheduler/JobListener/#implementations","title":"Implementations","text":"<ul> <li>ApproximateActionListener</li> <li>JobWaiter</li> </ul>"},{"location":"scheduler/JobWaiter/","title":"JobWaiter","text":"<p><code>JobWaiter</code> is a JobListener to listen to task events and to know when all have finished successfully or not.</p>"},{"location":"scheduler/JobWaiter/#creating-instance","title":"Creating Instance","text":"<p><code>JobWaiter</code> takes the following to be created:</p> <ul> <li> DAGScheduler <li> Job ID <li> Total number of tasks <li> Result Handler Function (<code>(Int, T) =&gt; Unit</code>) <p><code>JobWaiter</code> is created\u00a0when <code>DAGScheduler</code> is requested to submit a job or a map stage.</p>"},{"location":"scheduler/JobWaiter/#scala-promise","title":"Scala Promise <pre><code>jobPromise: Promise[Unit]\n</code></pre> <p><code>jobPromise</code> is a Scala Promise that is completed when all tasks have finished successfully or failed with an exception.</p>","text":""},{"location":"scheduler/JobWaiter/#tasksucceeded","title":"taskSucceeded <pre><code>taskSucceeded(\n  index: Int,\n  result: Any): Unit\n</code></pre> <p><code>taskSucceeded</code> executes the Result Handler Function with the given <code>index</code> and <code>result</code>.</p> <p><code>taskSucceeded</code> marks the waiter finished successfully when all tasks have finished.</p> <p><code>taskSucceeded</code>\u00a0is part of the JobListener abstraction.</p>","text":""},{"location":"scheduler/JobWaiter/#jobfailed","title":"jobFailed <pre><code>jobFailed(\n  exception: Exception): Unit\n</code></pre> <p><code>jobFailed</code> marks the waiter failed.</p> <p><code>jobFailed</code>\u00a0is part of the JobListener abstraction.</p>","text":""},{"location":"scheduler/LiveListenerBus/","title":"LiveListenerBus","text":"<p><code>LiveListenerBus</code> is an event bus to dispatch Spark events to registered SparkListeners.</p> <p></p> <p><code>LiveListenerBus</code> is a single-JVM SparkListenerBus that uses listenerThread to poll events.</p> <p>Note</p> <p>The event queue is java.util.concurrent.LinkedBlockingQueue with capacity of 10000 <code>SparkListenerEvent</code> events.</p>"},{"location":"scheduler/LiveListenerBus/#creating-instance","title":"Creating Instance","text":"<p><code>LiveListenerBus</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>LiveListenerBus</code> is created (and started) when <code>SparkContext</code> is requested to initialize.</p>"},{"location":"scheduler/LiveListenerBus/#event-queues","title":"Event Queues <pre><code>queues: CopyOnWriteArrayList[AsyncEventQueue]\n</code></pre> <p><code>LiveListenerBus</code> manages <code>AsyncEventQueue</code>s.</p> <p><code>queues</code> is initialized empty when <code>LiveListenerBus</code> is created.</p> <p><code>queues</code> is used when:</p> <ul> <li>Registering Listener with Queue</li> <li>Posting Event to All Queues</li> <li>Deregistering Listener</li> <li>Starting LiveListenerBus</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#livelistenerbusmetrics","title":"LiveListenerBusMetrics <pre><code>metrics: LiveListenerBusMetrics\n</code></pre> <p><code>LiveListenerBus</code> creates a <code>LiveListenerBusMetrics</code> when created.</p> <p><code>metrics</code> is registered (with a MetricsSystem) when <code>LiveListenerBus</code> is started.</p> <p><code>metrics</code> is used to:</p> <ul> <li>Increment events posted every event posting</li> <li>Create a <code>AsyncEventQueue</code> when adding a listener to a queue</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#starting-livelistenerbus","title":"Starting LiveListenerBus <pre><code>start(\n  sc: SparkContext,\n  metricsSystem: MetricsSystem): Unit\n</code></pre> <p><code>start</code> starts <code>AsyncEventQueue</code>s (from the queues internal registry).</p> <p>In the end, <code>start</code> requests the given MetricsSystem to register the LiveListenerBusMetrics.</p> <p><code>start</code> is used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#posting-event-to-all-queues","title":"Posting Event to All Queues <pre><code>post(\n  event: SparkListenerEvent): Unit\n</code></pre> <p><code>post</code> puts the input <code>event</code> onto the internal <code>eventQueue</code> queue and releases the internal <code>eventLock</code> semaphore. If the event placement was not successful (and it could happen since it is tapped at 10000 events) onDropEvent method is called.</p> <p>The event publishing is only possible when <code>stopped</code> flag has been enabled.</p> <p><code>post</code> is used when...FIXME</p>","text":""},{"location":"scheduler/LiveListenerBus/#posttoqueues","title":"postToQueues <pre><code>postToQueues(\n  event: SparkListenerEvent): Unit\n</code></pre> <p><code>postToQueues</code>...FIXME</p>","text":""},{"location":"scheduler/LiveListenerBus/#event-dropped-callback","title":"Event Dropped Callback <pre><code>onDropEvent(\n  event: SparkListenerEvent): Unit\n</code></pre> <p><code>onDropEvent</code> is called when no further events can be added to the internal <code>eventQueue</code> queue (while posting a SparkListenerEvent event).</p> <p>It simply prints out the following ERROR message to the logs and ensures that it happens only once.</p> <pre><code>Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.\n</code></pre>","text":""},{"location":"scheduler/LiveListenerBus/#stopping-livelistenerbus","title":"Stopping LiveListenerBus <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> releases the internal <code>eventLock</code> semaphore and waits until listenerThread dies. It can only happen after all events were posted (and polling <code>eventQueue</code> gives nothing).</p> <p><code>stopped</code> flag is enabled.</p>","text":""},{"location":"scheduler/LiveListenerBus/#listenerthread-for-event-polling","title":"listenerThread for Event Polling <p><code>LiveListenerBus</code> uses a SparkListenerBus single-daemon thread that ensures that the polling events from the event queue is only after the listener was started and only one event at a time.</p>","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-status-queue","title":"Registering Listener with Status Queue <pre><code>addToStatusQueue(\n  listener: SparkListenerInterface): Unit\n</code></pre> <p><code>addToStatusQueue</code> adds the given SparkListenerInterface to appStatus queue.</p> <p><code>addToStatusQueue</code> is used when:</p> <ul> <li><code>BarrierCoordinator</code> is requested to <code>onStart</code></li> <li><code>SparkContext</code> is created</li> <li><code>HiveThriftServer2</code> utility is used to <code>createListenerAndUI</code></li> <li><code>SharedState</code> (Spark SQL) is requested to create a SQLAppStatusStore</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-shared-queue","title":"Registering Listener with Shared Queue <pre><code>addToSharedQueue(\n  listener: SparkListenerInterface): Unit\n</code></pre> <p><code>addToSharedQueue</code> adds the given SparkListenerInterface to shared queue.</p> <p><code>addToSharedQueue</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to register a SparkListener and register extra SparkListeners</li> <li><code>ExecutionListenerBus</code> (Spark Structured Streaming) is created</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-executormanagement-queue","title":"Registering Listener with executorManagement Queue <pre><code>addToManagementQueue(\n  listener: SparkListenerInterface): Unit\n</code></pre> <p><code>addToManagementQueue</code> adds the given SparkListenerInterface to executorManagement queue.</p> <p><code>addToManagementQueue</code> is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is requested to start</li> <li><code>HeartbeatReceiver</code> is created</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-eventlog-queue","title":"Registering Listener with eventLog Queue <pre><code>addToEventLogQueue(\n  listener: SparkListenerInterface): Unit\n</code></pre> <p><code>addToEventLogQueue</code> adds the given SparkListenerInterface to eventLog queue.</p> <p><code>addToEventLogQueue</code> is used when:</p> <ul> <li><code>SparkContext</code> is created (with event logging enabled)</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-queue","title":"Registering Listener with Queue <pre><code>addToQueue(\n  listener: SparkListenerInterface,\n  queue: String): Unit\n</code></pre> <p><code>addToQueue</code> finds the queue in the queues internal registry.</p> <p>If found, <code>addToQueue</code> requests it to add the given listener</p> <p>If not found, <code>addToQueue</code> creates a <code>AsyncEventQueue</code> (with the given name, the LiveListenerBusMetrics, and this <code>LiveListenerBus</code>) and requests it to add the given listener. The <code>AsyncEventQueue</code> is started and added to the queues internal registry.</p> <p><code>addToQueue</code> is used when:</p> <ul> <li><code>LiveListenerBus</code> is requested to addToSharedQueue, addToManagementQueue, addToStatusQueue, addToEventLogQueue</li> <li><code>StreamingQueryListenerBus</code> (Spark Structured Streaming) is created</li> </ul>","text":""},{"location":"scheduler/MapOutputStatistics/","title":"MapOutputStatistics","text":"<p><code>MapOutputStatistics</code> holds statistics about the output partition sizes in a map stage.</p> <p><code>MapOutputStatistics</code> is the result of executing the following (currently internal APIs):</p> <ul> <li><code>SparkContext</code> is requested to submitMapStage</li> <li><code>DAGScheduler</code> is requested to submitMapStage</li> </ul>"},{"location":"scheduler/MapOutputStatistics/#creating-instance","title":"Creating Instance","text":"<p><code>MapOutputStatistics</code> takes the following to be created:</p> <ul> <li> Shuffle Id (of a ShuffleDependency) <li> Output Partition Sizes (<code>Array[Long]</code>) <p><code>MapOutputStatistics</code> is created when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested for the statistics (of a ShuffleDependency)</li> </ul>"},{"location":"scheduler/MapOutputTracker/","title":"MapOutputTracker","text":"<p><code>MapOutputTracker</code> is an base abstraction of shuffle map output location registries.</p>"},{"location":"scheduler/MapOutputTracker/#contract","title":"Contract","text":""},{"location":"scheduler/MapOutputTracker/#getmapsizesbyexecutorid","title":"getMapSizesByExecutorId <pre><code>getMapSizesByExecutorId(\n  shuffleId: Int,\n  startPartition: Int,\n  endPartition: Int): Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]\n</code></pre> <p>Used when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleReader</li> </ul>","text":""},{"location":"scheduler/MapOutputTracker/#getmapsizesbyrange","title":"getMapSizesByRange <pre><code>getMapSizesByRange(\n  shuffleId: Int,\n  startMapIndex: Int,\n  endMapIndex: Int,\n  startPartition: Int,\n  endPartition: Int): Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]\n</code></pre> <p>Used when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleReader</li> </ul>","text":""},{"location":"scheduler/MapOutputTracker/#unregistershuffle","title":"unregisterShuffle <pre><code>unregisterShuffle(\n  shuffleId: Int): Unit\n</code></pre> <p>Deletes map output status information for the specified shuffle stage</p> <p>Used when:</p> <ul> <li><code>ContextCleaner</code> is requested to doCleanupShuffle</li> <li><code>BlockManagerSlaveEndpoint</code> is requested to handle a RemoveShuffle message</li> </ul>","text":""},{"location":"scheduler/MapOutputTracker/#implementations","title":"Implementations","text":"<ul> <li>MapOutputTrackerMaster</li> <li>MapOutputTrackerWorker</li> </ul>"},{"location":"scheduler/MapOutputTracker/#creating-instance","title":"Creating Instance","text":"<p><code>MapOutputTracker</code> takes the following to be created:</p> <ul> <li> SparkConf Abstract Class <p><code>MapOutputTracker</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MapOutputTrackers.</p>"},{"location":"scheduler/MapOutputTracker/#accessing-mapoutputtracker","title":"Accessing MapOutputTracker","text":"<p><code>MapOutputTracker</code> is available using SparkEnv (on the driver and executors).</p> <pre><code>SparkEnv.get.mapOutputTracker\n</code></pre>"},{"location":"scheduler/MapOutputTracker/#mapoutputtracker-rpc-endpoint","title":"MapOutputTracker RPC Endpoint <p><code>trackerEndpoint</code> is a RpcEndpointRef of the MapOutputTracker RPC endpoint.</p> <p><code>trackerEndpoint</code> is initialized (registered or looked up) when <code>SparkEnv</code> is created for the driver and executors.</p> <p><code>trackerEndpoint</code> is used to communicate (synchronously).</p> <p><code>trackerEndpoint</code> is cleared (<code>null</code>) when <code>MapOutputTrackerMaster</code> is requested to stop.</p>","text":""},{"location":"scheduler/MapOutputTracker/#deregistering-map-output-status-information-of-shuffle-stage","title":"Deregistering Map Output Status Information of Shuffle Stage <pre><code>unregisterShuffle(\n  shuffleId: Int): Unit\n</code></pre> <p>Deregisters map output status information for the given shuffle stage</p> <p>Used when:</p> <ul> <li> <p><code>ContextCleaner</code> is requested for shuffle cleanup</p> </li> <li> <p><code>BlockManagerSlaveEndpoint</code> is requested to remove a shuffle</p> </li> </ul>","text":""},{"location":"scheduler/MapOutputTracker/#stopping-mapoutputtracker","title":"Stopping MapOutputTracker <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> does nothing at all.</p> <p><code>stop</code> is used when <code>SparkEnv</code> is requested to stop (and stops all the services, incl. <code>MapOutputTracker</code>).</p>","text":""},{"location":"scheduler/MapOutputTracker/#converting-mapstatuses-to-blockmanagerids-with-shuffleblockids-and-their-sizes","title":"Converting MapStatuses To BlockManagerIds with ShuffleBlockIds and Their Sizes <pre><code>convertMapStatuses(\n  shuffleId: Int,\n  startPartition: Int,\n  endPartition: Int,\n  statuses: Array[MapStatus]): Seq[(BlockManagerId, Seq[(BlockId, Long)])]\n</code></pre> <p><code>convertMapStatuses</code> iterates over the input <code>statuses</code> array (of MapStatus entries indexed by map id) and creates a collection of BlockManagerIds (for each <code>MapStatus</code> entry) with a ShuffleBlockId (with the input <code>shuffleId</code>, a <code>mapId</code>, and <code>partition</code> ranging from the input <code>startPartition</code> and <code>endPartition</code>) and estimated size for the reduce block for every status and partitions.</p> <p>For any empty <code>MapStatus</code>, <code>convertMapStatuses</code> prints out the following ERROR message to the logs:</p> <pre><code>Missing an output location for shuffle [id]\n</code></pre> <p>And <code>convertMapStatuses</code> throws a <code>MetadataFetchFailedException</code> (with <code>shuffleId</code>, <code>startPartition</code>, and the above error message).</p> <p><code>convertMapStatuses</code> is used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested for the sizes of shuffle map outputs by executor and range</li> <li><code>MapOutputTrackerWorker</code> is requested to sizes of shuffle map outputs by executor and range</li> </ul>","text":""},{"location":"scheduler/MapOutputTracker/#sending-blocking-messages-to-trackerendpoint-rpcendpointref","title":"Sending Blocking Messages To trackerEndpoint RpcEndpointRef <pre><code>askTracker[T](message: Any): T\n</code></pre> <p><code>askTracker</code> sends the input <code>message</code> to trackerEndpoint RpcEndpointRef and waits for a result.</p> <p>When an exception happens, <code>askTracker</code> prints out the following ERROR message to the logs and throws a <code>SparkException</code>.</p> <pre><code>Error communicating with MapOutputTracker\n</code></pre> <p><code>askTracker</code> is used when <code>MapOutputTracker</code> is requested to fetches map outputs for <code>ShuffleDependency</code> remotely and sends a one-way message.</p>","text":""},{"location":"scheduler/MapOutputTracker/#epoch","title":"Epoch <p>Starts from <code>0</code> when <code>MapOutputTracker</code> is created.</p> <p>Can be updated (on <code>MapOutputTrackerWorkers</code>) or incremented (on the driver's <code>MapOutputTrackerMaster</code>).</p>","text":""},{"location":"scheduler/MapOutputTracker/#sendtracker","title":"sendTracker <pre><code>sendTracker(\n  message: Any): Unit\n</code></pre> <p><code>sendTracker</code>...FIXME</p> <p><code>sendTracker</code> is used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested to stop</li> </ul>","text":""},{"location":"scheduler/MapOutputTracker/#utilities","title":"Utilities","text":""},{"location":"scheduler/MapOutputTracker/#serializemapstatuses","title":"serializeMapStatuses <pre><code>serializeMapStatuses(\n  statuses: Array[MapStatus],\n  broadcastManager: BroadcastManager,\n  isLocal: Boolean,\n  minBroadcastSize: Int,\n  conf: SparkConf): (Array[Byte], Broadcast[Array[Byte]])\n</code></pre> <p><code>serializeMapStatuses</code> serializes the given array of map output locations into an efficient byte format (to send to reduce tasks). <code>serializeMapStatuses</code> compresses the serialized bytes using GZIP. They are supposed to be pretty compressible because many map outputs will be on the same hostname.</p> <p>Internally, <code>serializeMapStatuses</code> creates a Java ByteArrayOutputStream.</p> <p><code>serializeMapStatuses</code> writes out 0 (direct) first.</p> <p><code>serializeMapStatuses</code> creates a Java GZIPOutputStream (with the <code>ByteArrayOutputStream</code> created) and writes out the given statuses array.</p> <p><code>serializeMapStatuses</code> decides whether to return the output array (of the output stream) or use a broadcast variable based on the size of the byte array.</p> <p>If the size of the result byte array is the given <code>minBroadcastSize</code> threshold or bigger, <code>serializeMapStatuses</code> requests the input <code>BroadcastManager</code> to create a broadcast variable.</p> <p><code>serializeMapStatuses</code> resets the <code>ByteArrayOutputStream</code> and starts over.</p> <p><code>serializeMapStatuses</code> writes out 1 (broadcast) first.</p> <p><code>serializeMapStatuses</code> creates a new Java <code>GZIPOutputStream</code> (with the <code>ByteArrayOutputStream</code> created) and writes out the broadcast variable.</p> <p><code>serializeMapStatuses</code> prints out the following INFO message to the logs:</p> <pre><code>Broadcast mapstatuses size = [length], actual size = [length]\n</code></pre> <p><code>serializeMapStatuses</code> is used when <code>ShuffleStatus</code> is requested to serialize shuffle map output statuses.</p>","text":""},{"location":"scheduler/MapOutputTracker/#deserializemapstatuses","title":"deserializeMapStatuses <pre><code>deserializeMapStatuses(\n  bytes: Array[Byte],\n  conf: SparkConf): Array[MapStatus]\n</code></pre> <p><code>deserializeMapStatuses</code>...FIXME</p> <p><code>deserializeMapStatuses</code> is used when:</p> <ul> <li><code>MapOutputTrackerWorker</code> is requested to getStatuses</li> </ul>","text":""},{"location":"scheduler/MapOutputTrackerMaster/","title":"MapOutputTrackerMaster","text":"<p><code>MapOutputTrackerMaster</code> is a MapOutputTracker for the driver.</p> <p><code>MapOutputTrackerMaster</code> is the source of truth of shuffle map output locations.</p>"},{"location":"scheduler/MapOutputTrackerMaster/#creating-instance","title":"Creating Instance","text":"<p><code>MapOutputTrackerMaster</code> takes the following to be created:</p> <ul> <li> SparkConf <li>BroadcastManager</li> <li> <code>isLocal</code> flag (to indicate whether <code>MapOutputTrackerMaster</code> runs in local or a cluster) <p>When created, <code>MapOutputTrackerMaster</code> starts dispatcher threads on the map-output-dispatcher thread pool.</p> <p><code>MapOutputTrackerMaster</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv for the driver</li> </ul>"},{"location":"scheduler/MapOutputTrackerMaster/#maxrpcmessagesize","title":"maxRpcMessageSize <p><code>maxRpcMessageSize</code> is...FIXME</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#broadcastmanager","title":"BroadcastManager <p><code>MapOutputTrackerMaster</code> is given a BroadcastManager to be created.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#shuffle-map-output-status-registry","title":"Shuffle Map Output Status Registry <p><code>MapOutputTrackerMaster</code> uses an internal registry of ShuffleStatuses by shuffle stages.</p> <p><code>MapOutputTrackerMaster</code> adds a new shuffle when requested to register one (when <code>DAGScheduler</code> is requested to create a ShuffleMapStage for a ShuffleDependency).</p> <p><code>MapOutputTrackerMaster</code> uses the registry when requested for the following:</p> <ul> <li> <p>registerMapOutput</p> </li> <li> <p>getStatistics</p> </li> <li> <p>MessageLoop</p> </li> <li> <p>unregisterMapOutput, unregisterAllMapOutput, unregisterShuffle, removeOutputsOnHost, removeOutputsOnExecutor, containsShuffle, getNumAvailableOutputs, findMissingPartitions, getLocationsWithLargestOutputs, getMapSizesByExecutorId</p> </li> </ul> <p><code>MapOutputTrackerMaster</code> removes (clears) all shuffles when requested to stop.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#configuration-properties","title":"Configuration Properties <p><code>MapOutputTrackerMaster</code> uses the following configuration properties:</p> <ul> <li> <p> spark.shuffle.mapOutput.minSizeForBroadcast  <li> <p> spark.shuffle.mapOutput.dispatcher.numThreads  <li> <p> spark.shuffle.reduceLocality.enabled","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-and-reduce-task-thresholds-for-preferred-locations","title":"Map and Reduce Task Thresholds for Preferred Locations <p><code>MapOutputTrackerMaster</code> defines 1000 (tasks) as the hardcoded threshold of the number of map and reduce tasks when requested to compute preferred locations with spark.shuffle.reduceLocality.enabled.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-output-threshold-for-preferred-location-of-reduce-tasks","title":"Map Output Threshold for Preferred Location of Reduce Tasks <p><code>MapOutputTrackerMaster</code> defines <code>0.2</code> as the fraction of total map output that must be at a location for it to considered as a preferred location for a reduce task.</p> <p>Making this larger will focus on fewer locations where most data can be read locally, but may lead to more delay in scheduling if those locations are busy.</p> <p><code>MapOutputTrackerMaster</code> uses the fraction when requested for the preferred locations of shuffle RDDs.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#getmapoutputmessage-queue","title":"GetMapOutputMessage Queue <p><code>MapOutputTrackerMaster</code> uses a blocking queue (a Java LinkedBlockingQueue) for requests for map output statuses.</p> <pre><code>GetMapOutputMessage(\n  shuffleId: Int,\n  context: RpcCallContext)\n</code></pre> <p><code>GetMapOutputMessage</code> holds the shuffle ID and the <code>RpcCallContext</code> of the caller.</p> <p>A new <code>GetMapOutputMessage</code> is added to the queue when <code>MapOutputTrackerMaster</code> is requested to post one.</p> <p><code>MapOutputTrackerMaster</code> uses MessageLoop Dispatcher Threads to process <code>GetMapOutputMessages</code>.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#messageloop-dispatcher-thread","title":"MessageLoop Dispatcher Thread <p><code>MessageLoop</code> is a thread of execution to handle GetMapOutputMessages until a <code>PoisonPill</code> marker message arrives (when <code>MapOutputTrackerMaster</code> is requested to stop).</p> <p><code>MessageLoop</code> takes a <code>GetMapOutputMessage</code> and prints out the following DEBUG message to the logs:</p> <pre><code>Handling request to send map output locations for shuffle [shuffleId] to [hostPort]\n</code></pre> <p><code>MessageLoop</code> then finds the ShuffleStatus by the shuffle ID in the shuffleStatuses internal registry and replies back (to the RPC client) with a serialized map output status (with the BroadcastManager and spark.shuffle.mapOutput.minSizeForBroadcast configuration property).</p> <p><code>MessageLoop</code> threads run on the map-output-dispatcher Thread Pool.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-output-dispatcher-thread-pool","title":"map-output-dispatcher Thread Pool <pre><code>threadpool: ThreadPoolExecutor\n</code></pre> <p><code>threadpool</code> is a daemon fixed thread pool registered with map-output-dispatcher thread name prefix.</p> <p><code>threadpool</code> uses spark.shuffle.mapOutput.dispatcher.numThreads configuration property for the number of MessageLoop dispatcher threads to process received <code>GetMapOutputMessage</code> messages.</p> <p>The dispatcher threads are started immediately when <code>MapOutputTrackerMaster</code> is created.</p> <p>The thread pool is shut down when <code>MapOutputTrackerMaster</code> is requested to stop.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#epoch-number","title":"Epoch Number <p><code>MapOutputTrackerMaster</code> uses an epoch number to...FIXME</p> <p><code>getEpoch</code> is used when:</p> <ul> <li> <p><code>DAGScheduler</code> is requested to removeExecutorAndUnregisterOutputs</p> </li> <li> <p>TaskSetManager is created (and sets the epoch to tasks)</p> </li> </ul>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#enqueueing-getmapoutputmessage","title":"Enqueueing GetMapOutputMessage <pre><code>post(\n  message: GetMapOutputMessage): Unit\n</code></pre> <p><code>post</code> simply adds the input <code>GetMapOutputMessage</code> to the mapOutputRequests internal queue.</p> <p><code>post</code> is used when <code>MapOutputTrackerMasterEndpoint</code> is requested to handle a GetMapOutputStatuses message.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#stopping-mapoutputtrackermaster","title":"Stopping MapOutputTrackerMaster <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>...FIXME</p> <p><code>stop</code> is part of the MapOutputTracker abstraction.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#unregistering-shuffle-map-output","title":"Unregistering Shuffle Map Output <pre><code>unregisterMapOutput(\n  shuffleId: Int,\n  mapId: Int,\n  bmAddress: BlockManagerId): Unit\n</code></pre> <p><code>unregisterMapOutput</code>...FIXME</p> <p><code>unregisterMapOutput</code> is used when <code>DAGScheduler</code> is requested to handle a task completion (due to a fetch failure).</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#computing-preferred-locations","title":"Computing Preferred Locations <pre><code>getPreferredLocationsForShuffle(\n  dep: ShuffleDependency[_, _, _],\n  partitionId: Int): Seq[String]\n</code></pre> <p><code>getPreferredLocationsForShuffle</code> computes the locations (BlockManagers) with the most shuffle map outputs for the input ShuffleDependency and Partition.</p> <p><code>getPreferredLocationsForShuffle</code> computes the locations when all of the following are met:</p> <ul> <li> <p>spark.shuffle.reduceLocality.enabled configuration property is enabled</p> </li> <li> <p>The number of \"map\" partitions (of the RDD of the input ShuffleDependency) is below SHUFFLE_PREF_MAP_THRESHOLD</p> </li> <li> <p>The number of \"reduce\" partitions (of the Partitioner of the input ShuffleDependency) is below SHUFFLE_PREF_REDUCE_THRESHOLD</p> </li> </ul>  <p>Note</p> <p><code>getPreferredLocationsForShuffle</code> is simply getLocationsWithLargestOutputs with a guard condition.</p>  <p>Internally, <code>getPreferredLocationsForShuffle</code> checks whether spark.shuffle.reduceLocality.enabled configuration property is enabled with the number of partitions of the RDD of the input <code>ShuffleDependency</code> and partitions in the partitioner of the input <code>ShuffleDependency</code> both being less than <code>1000</code>.</p>  <p>Note</p> <p>The thresholds for the number of partitions in the RDD and of the partitioner when computing the preferred locations are <code>1000</code> and are not configurable.</p>  <p>If the condition holds, <code>getPreferredLocationsForShuffle</code> finds locations with the largest number of shuffle map outputs for the input <code>ShuffleDependency</code> and <code>partitionId</code> (with the number of partitions in the partitioner of the input <code>ShuffleDependency</code> and <code>0.2</code>) and returns the hosts of the preferred <code>BlockManagers</code>.</p>  <p>Note</p> <p><code>0.2</code> is the fraction of total map output that must be at a location to be considered as a preferred location for a reduce task. It is not configurable.</p>  <p><code>getPreferredLocationsForShuffle</code> is used when ShuffledRDD and Spark SQL's <code>ShuffledRowRDD</code> are requested for preferred locations of a partition.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-locations-with-largest-number-of-shuffle-map-outputs","title":"Finding Locations with Largest Number of Shuffle Map Outputs <pre><code>getLocationsWithLargestOutputs(\n  shuffleId: Int,\n  reducerId: Int,\n  numReducers: Int,\n  fractionThreshold: Double): Option[Array[BlockManagerId]]\n</code></pre> <p><code>getLocationsWithLargestOutputs</code> returns BlockManagerIds with the largest size (of all the shuffle blocks they manage) above the input <code>fractionThreshold</code> (given the total size of all the shuffle blocks for the shuffle across all BlockManagers).</p>  <p>Note</p> <p><code>getLocationsWithLargestOutputs</code> may return no <code>BlockManagerId</code> if their shuffle blocks do not total up above the input <code>fractionThreshold</code>.</p>   <p>Note</p> <p>The input <code>numReducers</code> is not used.</p>  <p>Internally, <code>getLocationsWithLargestOutputs</code> queries the mapStatuses internal cache for the input <code>shuffleId</code>.</p>  <p>Note</p> <p>One entry in <code>mapStatuses</code> internal cache is a MapStatus array indexed by partition id.</p> <p><code>MapStatus</code> includes information about the <code>BlockManager</code> (as <code>BlockManagerId</code>) and estimated size of the reduce blocks.</p>  <p><code>getLocationsWithLargestOutputs</code> iterates over the <code>MapStatus</code> array and builds an interim mapping between BlockManagerId and the cumulative sum of shuffle blocks across BlockManagers.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#incrementing-epoch","title":"Incrementing Epoch <pre><code>incrementEpoch(): Unit\n</code></pre> <p><code>incrementEpoch</code> increments the internal epoch.</p> <p><code>incrementEpoch</code> prints out the following DEBUG message to the logs:</p> <pre><code>Increasing epoch to [epoch]\n</code></pre> <p><code>incrementEpoch</code> is used when:</p> <ul> <li> <p><code>MapOutputTrackerMaster</code> is requested to unregisterMapOutput, unregisterAllMapOutput, removeOutputsOnHost and removeOutputsOnExecutor</p> </li> <li> <p><code>DAGScheduler</code> is requested to handle a ShuffleMapTask completion (of a <code>ShuffleMapStage</code>)</p> </li> </ul>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#checking-availability-of-shuffle-map-output-status","title":"Checking Availability of Shuffle Map Output Status <pre><code>containsShuffle(\n  shuffleId: Int): Boolean\n</code></pre> <p><code>containsShuffle</code> checks if the input <code>shuffleId</code> is registered in the cachedSerializedStatuses or mapStatuses internal caches.</p> <p><code>containsShuffle</code> is used when <code>DAGScheduler</code> is requested to create a createShuffleMapStage (for a ShuffleDependency).</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#registering-shuffle","title":"Registering Shuffle <pre><code>registerShuffle(\n  shuffleId: Int,\n  numMaps: Int): Unit\n</code></pre> <p><code>registerShuffle</code> registers a new ShuffleStatus (for the given shuffle ID and the number of partitions) to the shuffleStatuses internal registry.</p> <p><code>registerShuffle</code> throws an <code>IllegalArgumentException</code> when the shuffle ID has already been registered:</p> <pre><code>Shuffle ID [shuffleId] registered twice\n</code></pre> <p><code>registerShuffle</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to create a ShuffleMapStage (for a ShuffleDependency)</li> </ul>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#registering-map-outputs-for-shuffle-possibly-with-epoch-change","title":"Registering Map Outputs for Shuffle (Possibly with Epoch Change) <pre><code>registerMapOutputs(\n  shuffleId: Int,\n  statuses: Array[MapStatus],\n  changeEpoch: Boolean = false): Unit\n</code></pre> <p><code>registerMapOutputs</code> registers the input <code>statuses</code> (as the shuffle map output) with the input <code>shuffleId</code> in the mapStatuses internal cache.</p> <p><code>registerMapOutputs</code> increments epoch if the input <code>changeEpoch</code> is enabled (it is not by default).</p> <p><code>registerMapOutputs</code> is used when <code>DAGScheduler</code> handles successful <code>ShuffleMapTask</code> completion and executor lost events.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-serialized-map-output-statuses-and-possibly-broadcasting-them","title":"Finding Serialized Map Output Statuses (And Possibly Broadcasting Them) <pre><code>getSerializedMapOutputStatuses(\n  shuffleId: Int): Array[Byte]\n</code></pre> <p><code>getSerializedMapOutputStatuses</code> finds cached serialized map statuses for the input <code>shuffleId</code>.</p> <p>If found, <code>getSerializedMapOutputStatuses</code> returns the cached serialized map statuses.</p> <p>Otherwise, <code>getSerializedMapOutputStatuses</code> acquires the shuffle lock for <code>shuffleId</code> and finds cached serialized map statuses again since some other thread could not update the cachedSerializedStatuses internal cache.</p> <p><code>getSerializedMapOutputStatuses</code> returns the serialized map statuses if found.</p> <p>If not, <code>getSerializedMapOutputStatuses</code> serializes the local array of <code>MapStatuses</code> (from checkCachedStatuses).</p> <p><code>getSerializedMapOutputStatuses</code> prints out the following INFO message to the logs:</p> <pre><code>Size of output statuses for shuffle [shuffleId] is [bytes] bytes\n</code></pre> <p><code>getSerializedMapOutputStatuses</code> saves the serialized map output statuses in cachedSerializedStatuses internal cache if the epoch has not changed in the meantime. <code>getSerializedMapOutputStatuses</code> also saves its broadcast version in cachedSerializedBroadcast internal cache.</p> <p>If the epoch has changed in the meantime, the serialized map output statuses and their broadcast version are not saved, and <code>getSerializedMapOutputStatuses</code> prints out the following INFO message to the logs:</p> <pre><code>Epoch changed, not caching!\n</code></pre> <p><code>getSerializedMapOutputStatuses</code> removes the broadcast.</p> <p><code>getSerializedMapOutputStatuses</code> returns the serialized map statuses.</p> <p><code>getSerializedMapOutputStatuses</code> is used when MapOutputTrackerMaster responds to <code>GetMapOutputMessage</code> requests and <code>DAGScheduler</code> creates <code>ShuffleMapStage</code> for <code>ShuffleDependency</code> (copying the shuffle map output locations from previous jobs to avoid unnecessarily regenerating data).</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-cached-serialized-map-statuses","title":"Finding Cached Serialized Map Statuses <pre><code>checkCachedStatuses(): Boolean\n</code></pre> <p><code>checkCachedStatuses</code> is an internal helper method that &lt;&gt; uses to do some bookkeeping (when the &lt;&gt; and &lt;&gt; differ) and set local <code>statuses</code>, <code>retBytes</code> and <code>epochGotten</code> (that getSerializedMapOutputStatuses uses). <p>Internally, <code>checkCachedStatuses</code> acquires the MapOutputTracker.md#epochLock[<code>epochLock</code> lock] and checks the status of &lt;&gt; to &lt;cacheEpoch&gt;&gt;. <p>If <code>epoch</code> is younger (i.e. greater), <code>checkCachedStatuses</code> clears &lt;&gt; internal cache, &lt;&gt; and sets <code>cacheEpoch</code> to be <code>epoch</code>. <p><code>checkCachedStatuses</code> gets the serialized map output statuses for the <code>shuffleId</code> (of the owning &lt;&gt;). <p>When the serialized map output status is found, <code>checkCachedStatuses</code> saves it in a local <code>retBytes</code> and returns <code>true</code>.</p> <p>When not found, you should see the following DEBUG message in the logs:</p> <pre><code>cached status not found for : [shuffleId]\n</code></pre> <p><code>checkCachedStatuses</code> uses MapOutputTracker.md#mapStatuses[mapStatuses] internal cache to get map output statuses for the <code>shuffleId</code> (of the owning &lt;&gt;) or falls back to an empty array and sets it to a local <code>statuses</code>. <code>checkCachedStatuses</code> sets the local <code>epochGotten</code> to the current &lt;&gt; and returns <code>false</code>.","text":""},{"location":"scheduler/MapOutputTrackerMaster/#registering-shuffle-map-output","title":"Registering Shuffle Map Output <pre><code>registerMapOutput(\n  shuffleId: Int,\n  mapId: Int,\n  status: MapStatus): Unit\n</code></pre> <p><code>registerMapOutput</code> finds the ShuffleStatus by the given shuffle ID and adds the given MapStatus:</p> <ul> <li> <p>The given mapId is the partitionId of the ShuffleMapTask that finished.</p> </li> <li> <p>The given shuffleId is the shuffleId of the ShuffleDependency of the ShuffleMapStage (for which the <code>ShuffleMapTask</code> completed)</p> </li> </ul> <p><code>registerMapOutput</code> is used when <code>DAGScheduler</code> is requested to handle a ShuffleMapTask completion.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-output-statistics-for-shuffledependency","title":"Map Output Statistics for ShuffleDependency <pre><code>getStatistics(\n  dep: ShuffleDependency[_, _, _]): MapOutputStatistics\n</code></pre> <p><code>getStatistics</code> requests the input ShuffleDependency for the shuffle ID and looks up the corresponding ShuffleStatus (in the shuffleStatuses registry).</p> <p><code>getStatistics</code> assumes that the <code>ShuffleStatus</code> is in shuffleStatuses registry.</p> <p><code>getStatistics</code> requests the <code>ShuffleStatus</code> for the MapStatuses (of the <code>ShuffleDependency</code>).</p> <p><code>getStatistics</code> uses the spark.shuffle.mapOutput.parallelAggregationThreshold configuration property to decide on parallelism to calculate the statistics.</p> <p>With no parallelism, <code>getStatistics</code> simply traverses over the <code>MapStatus</code>es and requests them (one by one) for the size of every shuffle block.</p>  <p>Note</p> <p><code>getStatistics</code> requests the given <code>ShuffleDependency</code> for the Partitioner that in turn is requested for the number of partitions.</p> <p>The number of blocks is the number of <code>MapStatus</code>es multiplied by the number of partitions.</p> <p>And hence the need for parallelism based on the spark.shuffle.mapOutput.parallelAggregationThreshold configuration property.</p>  <p>In the end, <code>getStatistics</code> creates a <code>MapOutputStatistics</code> with the shuffle ID (of the given <code>ShuffleDependency</code>) and the total sizes (sumed up for every partition).</p> <p><code>getStatistics</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handle a successful ShuffleMapStage submission and markMapStageJobsAsFinished</li> </ul>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-all-map-outputs-of-shuffle-stage","title":"Deregistering All Map Outputs of Shuffle Stage <pre><code>unregisterAllMapOutput(\n  shuffleId: Int): Unit\n</code></pre> <p><code>unregisterAllMapOutput</code>...FIXME</p> <p><code>unregisterAllMapOutput</code> is used when <code>DAGScheduler</code> is requested to handle a task completion (due to a fetch failure).</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-shuffle","title":"Deregistering Shuffle <pre><code>unregisterShuffle(\n  shuffleId: Int): Unit\n</code></pre> <p><code>unregisterShuffle</code>...FIXME</p> <p><code>unregisterShuffle</code> is part of the MapOutputTracker abstraction.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-shuffle-outputs-associated-with-host","title":"Deregistering Shuffle Outputs Associated with Host <pre><code>removeOutputsOnHost(\n  host: String): Unit\n</code></pre> <p><code>removeOutputsOnHost</code>...FIXME</p> <p><code>removeOutputsOnHost</code> is used when <code>DAGScheduler</code> is requested to removeExecutorAndUnregisterOutputs and handle a worker removal.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-shuffle-outputs-associated-with-executor","title":"Deregistering Shuffle Outputs Associated with Executor <pre><code>removeOutputsOnExecutor(\n  execId: String): Unit\n</code></pre> <p><code>removeOutputsOnExecutor</code>...FIXME</p> <p><code>removeOutputsOnExecutor</code> is used when <code>DAGScheduler</code> is requested to removeExecutorAndUnregisterOutputs.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#number-of-partitions-with-shuffle-map-outputs-available","title":"Number of Partitions with Shuffle Map Outputs Available <pre><code>getNumAvailableOutputs(\n  shuffleId: Int): Int\n</code></pre> <p><code>getNumAvailableOutputs</code>...FIXME</p> <p><code>getNumAvailableOutputs</code> is used when <code>ShuffleMapStage</code> is requested for the number of partitions with shuffle outputs available.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-missing-partitions","title":"Finding Missing Partitions <pre><code>findMissingPartitions(\n  shuffleId: Int): Option[Seq[Int]]\n</code></pre> <p><code>findMissingPartitions</code>...FIXME</p> <p><code>findMissingPartitions</code> is used when <code>ShuffleMapStage</code> is requested for missing partitions.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-locations-with-blocks-and-sizes","title":"Finding Locations with Blocks and Sizes <pre><code>getMapSizesByExecutorId(\n  shuffleId: Int,\n  startPartition: Int,\n  endPartition: Int): Iterator[(BlockManagerId, Seq[(BlockId, Long)])]\n</code></pre> <p><code>getMapSizesByExecutorId</code> is part of the MapOutputTracker abstraction.</p> <p><code>getMapSizesByExecutorId</code> returns a collection of BlockManagerIds with their blocks and sizes.</p> <p>When executed, <code>getMapSizesByExecutorId</code> prints out the following DEBUG message to the logs:</p> <pre><code>Fetching outputs for shuffle [id], partitions [startPartition]-[endPartition]\n</code></pre> <p><code>getMapSizesByExecutorId</code> finds map outputs for the input <code>shuffleId</code>.</p>  <p>Note</p> <p><code>getMapSizesByExecutorId</code> gets the map outputs for all the partitions (despite the method's signature).</p>  <p>In the end, <code>getMapSizesByExecutorId</code> converts shuffle map outputs (as <code>MapStatuses</code>) into the collection of BlockManagerIds with their blocks and sizes.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.MapOutputTrackerMaster</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.MapOutputTrackerMaster=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/","title":"MapOutputTrackerMasterEndpoint","text":"<p><code>MapOutputTrackerMasterEndpoint</code> is an RpcEndpoint for MapOutputTrackerMaster.</p> <p><code>MapOutputTrackerMasterEndpoint</code> is registered under the name of MapOutputTracker (on the driver).</p>"},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#creating-instance","title":"Creating Instance","text":"<p><code>MapOutputTrackerMasterEndpoint</code> takes the following to be created:</p> <ul> <li> RpcEnv <li> MapOutputTrackerMaster <li> SparkConf <p><code>MapOutputTrackerMasterEndpoint</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> is created (for the driver and executors)</li> </ul> <p>While being created, <code>MapOutputTrackerMasterEndpoint</code> prints out the following DEBUG message to the logs:</p> <pre><code>init\n</code></pre>"},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#messages","title":"Messages","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#getmapoutputstatuses","title":"GetMapOutputStatuses <pre><code>GetMapOutputStatuses(\n  shuffleId: Int)\n</code></pre> <p>Posted when <code>MapOutputTrackerWorker</code> is requested for shuffle map outputs for a given shuffle ID</p> <p>When received, <code>MapOutputTrackerMasterEndpoint</code> prints out the following INFO message to the logs:</p> <pre><code>Asked to send map output locations for shuffle [shuffleId] to [hostPort]\n</code></pre> <p>In the end, <code>MapOutputTrackerMasterEndpoint</code> requests the MapOutputTrackerMaster to post a <code>GetMapOutputMessage</code> (with the input <code>shuffleId</code>). Whatever is returned from <code>MapOutputTrackerMaster</code> becomes the response.</p>","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#stopmapoutputtracker","title":"StopMapOutputTracker <p>Posted when <code>MapOutputTrackerMaster</code> is requested to stop.</p> <p>When received, <code>MapOutputTrackerMasterEndpoint</code> prints out the following INFO message to the logs:</p> <pre><code>MapOutputTrackerMasterEndpoint stopped!\n</code></pre> <p><code>MapOutputTrackerMasterEndpoint</code> confirms the request (by replying <code>true</code>) and stops.</p>","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.MapOutputTrackerMasterEndpoint</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.MapOutputTrackerMasterEndpoint=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/MapOutputTrackerWorker/","title":"MapOutputTrackerWorker","text":"<p><code>MapOutputTrackerWorker</code> is the MapOutputTracker for executors.</p> <p><code>MapOutputTrackerWorker</code> uses Java's thread-safe java.util.concurrent.ConcurrentHashMap for mapStatuses internal cache and any lookup cache miss triggers a fetch from the driver's MapOutputTrackerMaster.</p> <p>== [[getStatuses]] Finding Shuffle Map Outputs</p>"},{"location":"scheduler/MapOutputTrackerWorker/#source-scala","title":"[source, scala]","text":"<p>getStatuses(   shuffleId: Int): Array[MapStatus]</p> <p><code>getStatuses</code> finds MapStatus.md[MapStatuses] for the input <code>shuffleId</code> in the &lt;&gt; internal cache and, when not available, fetches them from a remote MapOutputTrackerMaster.md[MapOutputTrackerMaster] (using RPC). <p>Internally, <code>getStatuses</code> first queries the &lt;mapStatuses internal cache&gt;&gt; and returns the map outputs if found. <p>If not found (in the <code>mapStatuses</code> internal cache), you should see the following INFO message in the logs:</p> <pre><code>Don't have map outputs for shuffle [id], fetching them\n</code></pre> <p>If some other process fetches the map outputs for the <code>shuffleId</code> (as recorded in <code>fetching</code> internal registry), <code>getStatuses</code> waits until it is done.</p> <p>When no other process fetches the map outputs, <code>getStatuses</code> registers the input <code>shuffleId</code> in <code>fetching</code> internal registry (of shuffle map outputs being fetched).</p> <p>You should see the following INFO message in the logs:</p> <pre><code>Doing the fetch; tracker endpoint = [trackerEndpoint]\n</code></pre> <p><code>getStatuses</code> sends a <code>GetMapOutputStatuses</code> RPC remote message for the input <code>shuffleId</code> to the trackerEndpoint expecting a <code>Array[Byte]</code>.</p> <p>NOTE: <code>getStatuses</code> requests shuffle map outputs remotely within a timeout and with retries. Refer to rpc:RpcEndpointRef.md[RpcEndpointRef].</p> <p><code>getStatuses</code> &lt;&gt; and records the result in the &lt;mapStatuses internal cache&gt;&gt;. <p>You should see the following INFO message in the logs:</p> <pre><code>Got the output locations\n</code></pre> <p><code>getStatuses</code> removes the input <code>shuffleId</code> from <code>fetching</code> internal registry.</p> <p>You should see the following DEBUG message in the logs:</p> <pre><code>Fetching map output statuses for shuffle [id] took [time] ms\n</code></pre> <p>If <code>getStatuses</code> could not find the map output locations for the input <code>shuffleId</code> (locally and remotely), you should see the following ERROR message in the logs and throws a <code>MetadataFetchFailedException</code>.</p> <pre><code>Missing all output locations for shuffle [id]\n</code></pre> <p>NOTE: <code>getStatuses</code> is used when MapOutputTracker &lt;&gt; and &lt;ShuffleDependency&gt;&gt;. <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.MapOutputTrackerWorker</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>"},{"location":"scheduler/MapOutputTrackerWorker/#source","title":"[source]","text":""},{"location":"scheduler/MapOutputTrackerWorker/#log4jloggerorgapachesparkmapoutputtrackerworkerall","title":"log4j.logger.org.apache.spark.MapOutputTrackerWorker=ALL","text":"<p>Refer to spark-logging.md[Logging].</p>"},{"location":"scheduler/MapStatus/","title":"MapStatus","text":"<p><code>MapStatus</code> is an abstraction of shuffle map output statuses with an estimated size, location and map Id.</p> <p><code>MapStatus</code> is a result of executing a ShuffleMapTask.</p> <p>After a ShuffleMapTask has finished execution successfully, <code>DAGScheduler</code> is requested to handle a ShuffleMapTask completion that in turn requests the MapOutputTrackerMaster to register the MapStatus.</p>"},{"location":"scheduler/MapStatus/#contract","title":"Contract","text":""},{"location":"scheduler/MapStatus/#estimated-size","title":"Estimated Size <pre><code>getSizeForBlock(\n  reduceId: Int): Long\n</code></pre> <p>Estimated size (in bytes)</p> <p>Used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested for a MapOutputStatistics and locations with the largest number of shuffle map outputs</li> <li><code>MapOutputTracker</code> utility is used to convert MapStatuses</li> <li><code>OptimizeSkewedJoin</code> (Spark SQL) physical optimization is executed</li> </ul>","text":""},{"location":"scheduler/MapStatus/#location","title":"Location <pre><code>location: BlockManagerId\n</code></pre> <p>BlockManagerId of the shuffle map output (i.e. the BlockManager where a <code>ShuffleMapTask</code> ran and the result is stored)</p> <p>Used when:</p> <ul> <li><code>ShuffleStatus</code> is requested to removeMapOutput and removeOutputsByFilter</li> <li><code>MapOutputTrackerMaster</code> is requested for locations with the largest number of shuffle map outputs and getMapLocation</li> <li><code>MapOutputTracker</code> utility is used to convert MapStatuses</li> <li><code>DAGScheduler</code> is requested to handle a ShuffleMapTask completion</li> </ul>","text":""},{"location":"scheduler/MapStatus/#map-id","title":"Map Id <pre><code>mapId: Long\n</code></pre> <p>Map Id of the shuffle map output</p> <p>Used when:</p> <ul> <li><code>MapOutputTracker</code> utility is used to convert MapStatuses</li> </ul>","text":""},{"location":"scheduler/MapStatus/#implementations","title":"Implementations","text":"<ul> <li>CompressedMapStatus</li> <li>HighlyCompressedMapStatus</li> </ul> Sealed Trait <p><code>MapStatus</code> is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).</p>"},{"location":"scheduler/MapStatus/#sparkshuffleminnumpartitionstohighlycompress","title":"spark.shuffle.minNumPartitionsToHighlyCompress <p><code>MapStatus</code> utility uses spark.shuffle.minNumPartitionsToHighlyCompress internal configuration property for the minimum number of partitions to prefer a HighlyCompressedMapStatus.</p>","text":""},{"location":"scheduler/MapStatus/#creating-mapstatus","title":"Creating MapStatus <pre><code>apply(\n  loc: BlockManagerId,\n  uncompressedSizes: Array[Long],\n  mapTaskId: Long): MapStatus\n</code></pre> <p><code>apply</code> creates a HighlyCompressedMapStatus when the number of <code>uncompressedSizes</code> is above minPartitionsToUseHighlyCompressMapStatus threshold. Otherwise, <code>apply</code> creates a CompressedMapStatus.</p> <p><code>apply</code> is used when:</p> <ul> <li><code>SortShuffleWriter</code> is requested to write records</li> <li><code>BypassMergeSortShuffleWriter</code> is requested to write records</li> <li><code>UnsafeShuffleWriter</code> is requested to close resources and write out merged spill files</li> </ul>","text":""},{"location":"scheduler/Pool/","title":"Pool","text":"<p>== [[Pool]] Schedulable Pool</p> <p><code>Pool</code> is a scheduler:spark-scheduler-Schedulable.md[Schedulable] entity that represents a tree of scheduler:TaskSetManager.md[TaskSetManagers], i.e. it contains a collection of <code>TaskSetManagers</code> or the <code>Pools</code> thereof.</p> <p>A <code>Pool</code> has a mandatory name, a spark-scheduler-SchedulingMode.md[scheduling mode], initial <code>minShare</code> and <code>weight</code> that are defined when it is created.</p> <p>NOTE: An instance of <code>Pool</code> is created when scheduler:TaskSchedulerImpl.md#initialize[TaskSchedulerImpl is initialized].</p> <p>NOTE: The scheduler:TaskScheduler.md#contract[TaskScheduler Contract] and spark-scheduler-Schedulable.md#contract[Schedulable Contract] both require that their entities have <code>rootPool</code> of type <code>Pool</code>.</p> <p>=== [[increaseRunningTasks]] <code>increaseRunningTasks</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[decreaseRunningTasks]] <code>decreaseRunningTasks</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[taskSetSchedulingAlgorithm]] <code>taskSetSchedulingAlgorithm</code> Attribute</p> <p>Using the spark-scheduler-SchedulingMode.md[scheduling mode] (given when a <code>Pool</code> object is created), <code>Pool</code> selects &lt;&gt; and sets <code>taskSetSchedulingAlgorithm</code>: <ul> <li>&lt;&gt; for FIFO scheduling mode. <li>&lt;&gt; for FAIR scheduling mode. <p>It throws an <code>IllegalArgumentException</code> when unsupported scheduling mode is passed on:</p> <pre><code>Unsupported spark.scheduler.mode: [schedulingMode]\n</code></pre> <p>TIP: Read about the scheduling modes in spark-scheduler-SchedulingMode.md[SchedulingMode].</p> <p>NOTE: <code>taskSetSchedulingAlgorithm</code> is used in &lt;&gt;. <p>=== [[getSortedTaskSetQueue]] Getting TaskSetManagers Sorted -- <code>getSortedTaskSetQueue</code> Method</p> <p>NOTE: <code>getSortedTaskSetQueue</code> is part of the spark-scheduler-Schedulable.md#contract[Schedulable Contract].</p> <p><code>getSortedTaskSetQueue</code> sorts all the spark-scheduler-Schedulable.md[Schedulables] in spark-scheduler-Schedulable.md#contract[schedulableQueue] queue by a &lt;&gt; (from the internal &lt;&gt;). <p>NOTE: It is called when scheduler:TaskSchedulerImpl.md#resourceOffers[<code>TaskSchedulerImpl</code> processes executor resource offers].</p> <p>=== [[schedulableNameToSchedulable]] Schedulables by Name -- <code>schedulableNameToSchedulable</code> Registry</p>"},{"location":"scheduler/Pool/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/Pool/#schedulablenametoschedulable-new-concurrenthashmapstring-schedulable","title":"schedulableNameToSchedulable = new ConcurrentHashMap[String, Schedulable]","text":"<p><code>schedulableNameToSchedulable</code> is a lookup table of spark-scheduler-Schedulable.md[Schedulable] objects by their names.</p> <p>Beside the obvious usage in the housekeeping methods like <code>addSchedulable</code>, <code>removeSchedulable</code>, <code>getSchedulableByName</code> from the spark-scheduler-Schedulable.md#contract[Schedulable Contract], it is exclusively used in SparkContext.md#getPoolForName[SparkContext.getPoolForName].</p> <p>=== [[addSchedulable]] <code>addSchedulable</code> Method</p> <p>NOTE: <code>addSchedulable</code> is part of the spark-scheduler-Schedulable.md#contract[Schedulable Contract].</p> <p><code>addSchedulable</code> adds a <code>Schedulable</code> to the spark-scheduler-Schedulable.md#contract[schedulableQueue] and &lt;&gt;. <p>More importantly, it sets the <code>Schedulable</code> entity's spark-scheduler-Schedulable.md#contract[parent] to itself.</p> <p>=== [[removeSchedulable]] <code>removeSchedulable</code> Method</p> <p>NOTE: <code>removeSchedulable</code> is part of the spark-scheduler-Schedulable.md#contract[Schedulable Contract].</p> <p><code>removeSchedulable</code> removes a <code>Schedulable</code> from the spark-scheduler-Schedulable.md#contract[schedulableQueue] and &lt;&gt;. <p>NOTE: <code>removeSchedulable</code> is the opposite to &lt;addSchedulable method&gt;&gt;. <p>=== [[SchedulingAlgorithm]] SchedulingAlgorithm</p> <p><code>SchedulingAlgorithm</code> is the interface for a sorting algorithm to sort spark-scheduler-Schedulable.md[Schedulables].</p> <p>There are currently two <code>SchedulingAlgorithms</code>:</p> <ul> <li>&lt;&gt; for FIFO scheduling mode. <li>&lt;&gt; for FAIR scheduling mode. <p>==== [[FIFOSchedulingAlgorithm]] FIFOSchedulingAlgorithm</p> <p><code>FIFOSchedulingAlgorithm</code> is a scheduling algorithm that compares <code>Schedulables</code> by their <code>priority</code> first and, when equal, by their <code>stageId</code>.</p> <p>NOTE: <code>priority</code> and <code>stageId</code> are part of spark-scheduler-Schedulable.md#contract[Schedulable Contract].</p> <p>CAUTION: FIXME A picture is worth a thousand words. How to picture the algorithm?</p> <p>==== [[FairSchedulingAlgorithm]] FairSchedulingAlgorithm</p> <p><code>FairSchedulingAlgorithm</code> is a scheduling algorithm that compares <code>Schedulables</code> by their <code>minShare</code>, <code>runningTasks</code>, and <code>weight</code>.</p> <p>NOTE: <code>minShare</code>, <code>runningTasks</code>, and <code>weight</code> are part of spark-scheduler-Schedulable.md#contract[Schedulable Contract].</p> <p>.FairSchedulingAlgorithm image::spark-pool-FairSchedulingAlgorithm.png[align=\"center\"]</p> <p>For each input <code>Schedulable</code>, <code>minShareRatio</code> is computed as <code>runningTasks</code> by <code>minShare</code> (but at least <code>1</code>) while <code>taskToWeightRatio</code> is <code>runningTasks</code> by <code>weight</code>.</p> <p>=== [[getSchedulableByName]] Finding Schedulable by Name -- <code>getSchedulableByName</code> Method</p>"},{"location":"scheduler/Pool/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/Pool/#getschedulablebynameschedulablename-string-schedulable","title":"getSchedulableByName(schedulableName: String): Schedulable","text":"<p>NOTE: <code>getSchedulableByName</code> is part of the &lt;&gt; to find a &lt;&gt; by name. <p><code>getSchedulableByName</code>...FIXME</p>"},{"location":"scheduler/ResultStage/","title":"ResultStage","text":"<p><code>ResultStage</code> is the final stage in a job that applies a function to one or many partitions of the target RDD to compute the result of an action.</p> <p></p> <p>The partitions are given as a collection of partition ids (<code>partitions</code>) and the function <code>func: (TaskContext, Iterator[_]) =&gt; _</code>.</p> <p></p> <p>== [[findMissingPartitions]] Finding Missing Partitions</p>"},{"location":"scheduler/ResultStage/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/ResultStage/#findmissingpartitions-seqint","title":"findMissingPartitions(): Seq[Int]","text":"<p>NOTE: findMissingPartitions is part of the scheduler:Stage.md#findMissingPartitions[Stage] abstraction.</p> <p>findMissingPartitions...FIXME</p> <p>.ResultStage.findMissingPartitions and ActiveJob image::resultstage-findMissingPartitions.png[align=\"center\"]</p> <p>In the above figure, partitions 1 and 2 are not finished (<code>F</code> is false while <code>T</code> is true).</p> <p>== [[func]] <code>func</code> Property</p> <p>CAUTION: FIXME</p> <p>== [[setActiveJob]] <code>setActiveJob</code> Method</p> <p>CAUTION: FIXME</p> <p>== [[removeActiveJob]] <code>removeActiveJob</code> Method</p> <p>CAUTION: FIXME</p> <p>== [[activeJob]] <code>activeJob</code> Method</p>"},{"location":"scheduler/ResultStage/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/ResultStage/#activejob-optionactivejob","title":"activeJob: Option[ActiveJob]","text":"<p><code>activeJob</code> returns the optional <code>ActiveJob</code> associated with a <code>ResultStage</code>.</p> <p>CAUTION: FIXME When/why would that be <code>NONE</code> (empty)?</p>"},{"location":"scheduler/ResultTask/","title":"ResultTask","text":"<p><code>ResultTask[T, U]</code> is a Task that executes a partition processing function on a partition with records (of type <code>T</code>) to produce a result (of type <code>U</code>) that is sent back to the driver.</p> <pre><code>T -- [ResultTask] --&gt; U\n</code></pre>"},{"location":"scheduler/ResultTask/#creating-instance","title":"Creating Instance","text":"<p><code>ResultTask</code> takes the following to be created:</p> <ul> <li> Stage ID <li> Stage Attempt ID <li> Broadcast variable with a serialized task (<code>Broadcast[Array[Byte]]</code>) <li> Partition to compute <li> TaskLocation <li> Output ID <li> Local Properties <li> Serialized TaskMetrics (<code>Array[Byte]</code>) <li> ActiveJob ID (optional) <li> Application ID (optional) <li> Application Attempt ID (optional) <li> <code>isBarrier</code> flag (default: <code>false</code>) <p><code>ResultTask</code> is created\u00a0when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit missing tasks of a ResultStage</li> </ul>"},{"location":"scheduler/ResultTask/#running-task","title":"Running Task <pre><code>runTask(\n  context: TaskContext): U\n</code></pre> <p><code>runTask</code>\u00a0is part of the Task abstraction.</p> <p><code>runTask</code> deserializes a RDD and a partition processing function from the broadcast variable (using the Closure Serializer).</p> <p>In the end, <code>runTask</code> executes the function (on the records from the partition of the <code>RDD</code>).</p>","text":""},{"location":"scheduler/Schedulable/","title":"Schedulable","text":"<p>== [[Schedulable]] Schedulable Contract -- Schedulable Entities</p> <p><code>Schedulable</code> is the &lt;&gt; of &lt;&gt; that manages the &lt;&gt; and can &lt;&gt;. <p>[[contract]] .Schedulable Contract [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| addSchedulable a| [[addSchedulable]]</p>"},{"location":"scheduler/Schedulable/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#addschedulableschedulable-schedulable-unit","title":"addSchedulable(schedulable: Schedulable): Unit","text":"<p>Registers a &lt;&gt; <p>Used when:</p> <ul> <li> <p><code>FIFOSchedulableBuilder</code> is requested to &lt;&gt; <li> <p><code>FairSchedulableBuilder</code> is requested to &lt;&gt;, &lt;&gt;, and &lt;&gt; <p>| checkSpeculatableTasks a| [[checkSpeculatableTasks]]</p>"},{"location":"scheduler/Schedulable/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#checkspeculatabletasksmintimetospeculation-int-boolean","title":"checkSpeculatableTasks(minTimeToSpeculation: Int): Boolean","text":"<p>Used when...FIXME</p> <p>| executorLost a| [[executorLost]]</p>"},{"location":"scheduler/Schedulable/#source-scala_2","title":"[source, scala]","text":"<p>executorLost(   executorId: String,   host: String,   reason: ExecutorLossReason): Unit</p> <p>Handles an executor lost event</p> <p>Used when:</p> <ul> <li> <p><code>Pool</code> is requested to &lt;&gt; <li> <p><code>TaskSchedulerImpl</code> is requested to scheduler:TaskSchedulerImpl.md#removeExecutor[removeExecutor]</p> </li> <p>| getSchedulableByName a| [[getSchedulableByName]]</p>"},{"location":"scheduler/Schedulable/#source-scala_3","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#getschedulablebynamename-string-schedulable","title":"getSchedulableByName(name: String): Schedulable","text":"<p>Finds a &lt;&gt; by name <p>Used when...FIXME</p> <p>| getSortedTaskSetQueue a| [[getSortedTaskSetQueue]]</p>"},{"location":"scheduler/Schedulable/#source-scala_4","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#getsortedtasksetqueue-arraybuffertasksetmanager","title":"getSortedTaskSetQueue: ArrayBuffer[TaskSetManager]","text":"<p>Builds a collection of scheduler:TaskSetManager.md[TaskSetManagers] sorted by &lt;&gt; <p>Used when:</p> <ul> <li> <p><code>Pool</code> is requested to &lt;&gt; (recursively) <li> <p><code>TaskSchedulerImpl</code> is requested to scheduler:TaskSchedulerImpl.md#resourceOffers[resourceOffers]</p> </li> <p>| minShare a| [[minShare]]</p>"},{"location":"scheduler/Schedulable/#source-scala_5","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#minshare-int","title":"minShare: Int","text":"<p>Used when...FIXME</p> <p>| name a| [[name]]</p>"},{"location":"scheduler/Schedulable/#source-scala_6","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#name-string","title":"name: String","text":"<p>Used when...FIXME</p> <p>| parent a| [[parent]]</p>"},{"location":"scheduler/Schedulable/#source-scala_7","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#parent-pool","title":"parent: Pool","text":"<p>Used when...FIXME</p> <p>| priority a| [[priority]]</p>"},{"location":"scheduler/Schedulable/#source-scala_8","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#priority-int","title":"priority: Int","text":"<p>Used when...FIXME</p> <p>| removeSchedulable a| [[removeSchedulable]]</p>"},{"location":"scheduler/Schedulable/#source-scala_9","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#removeschedulableschedulable-schedulable-unit","title":"removeSchedulable(schedulable: Schedulable): Unit","text":"<p>Used when...FIXME</p> <p>| runningTasks a| [[runningTasks]]</p>"},{"location":"scheduler/Schedulable/#source-scala_10","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#runningtasks-int","title":"runningTasks: Int","text":"<p>Used when...FIXME</p> <p>| schedulableQueue a| [[schedulableQueue]]</p>"},{"location":"scheduler/Schedulable/#source-scala_11","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#schedulablequeue-concurrentlinkedqueueschedulable","title":"schedulableQueue: ConcurrentLinkedQueue[Schedulable]","text":"<p>Queue of &lt;&gt; (as https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html[ConcurrentLinkedQueue]) <p>Used when:</p> <ul> <li> <p><code>SparkContext</code> is requested to SparkContext.md#getAllPools[getAllPools]</p> </li> <li> <p><code>Pool</code> is requested to &lt;&gt;, &lt;&gt;, &lt;&gt;, &lt;&gt;, &lt;&gt;, and &lt;&gt; <p>| schedulingMode a| [[schedulingMode]]</p>"},{"location":"scheduler/Schedulable/#source-scala_12","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#schedulingmode-schedulingmode","title":"schedulingMode: SchedulingMode","text":"<p>&lt;&gt; <p>Used when:</p> <ul> <li> <p><code>Pool</code> is &lt;&gt; <li> <p>web UI's <code>PoolTable</code> is requested to render a page with pools (<code>poolRow</code>)</p> </li> <p>| stageId a| [[stageId]]</p>"},{"location":"scheduler/Schedulable/#source-scala_13","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#stageid-int","title":"stageId: Int","text":"<p>Used when...FIXME</p> <p>| weight a| [[weight]]</p>"},{"location":"scheduler/Schedulable/#source-scala_14","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#weight-int","title":"weight: Int","text":"<p>Used when...FIXME</p> <p>|===</p> <p>[[implementations]] .Schedulables [cols=\"1,3\",options=\"header\",width=\"100%\"] |=== | Schedulable | Description</p> <p>| &lt;&gt; | [[Pool]] Pool of &lt;&gt; (i.e. a recursive data structure for prioritizing task sets) <p>| scheduler:TaskSetManager.md[TaskSetManager] | [[TaskSetManager]] Manages scheduling of tasks of a scheduler:TaskSet.md[TaskSet]</p> <p>|===</p>"},{"location":"scheduler/SchedulableBuilder/","title":"SchedulableBuilder","text":"<p>== [[SchedulableBuilder]] SchedulableBuilder Contract -- Builders of Schedulable Pools</p> <p><code>SchedulableBuilder</code> is the &lt;&gt; of &lt;&gt; that manage a &lt;&gt;, which is to &lt;&gt; and &lt;&gt;. <p><code>SchedulableBuilder</code> is a <code>private[spark]</code> Scala trait that is used exclusively by scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] (the default Spark scheduler). When requested to scheduler:TaskSchedulerImpl.md#initialize[initialize], <code>TaskSchedulerImpl</code> uses the configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property (default: <code>FIFO</code>) to select one of the &lt;&gt;. <p>[[contract]] .SchedulableBuilder Contract [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| addTaskSetManager a| [[addTaskSetManager]]</p>"},{"location":"scheduler/SchedulableBuilder/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/SchedulableBuilder/#addtasksetmanagermanager-schedulable-properties-properties-unit","title":"addTaskSetManager(manager: Schedulable, properties: Properties): Unit","text":"<p>Registers a new &lt;&gt; with the &lt;&gt; <p>Used exclusively when <code>TaskSchedulerImpl</code> is requested to scheduler:TaskSchedulerImpl.md#submitTasks[submit tasks (of TaskSet) for execution] (and registers a new scheduler:TaskSetManager.md[TaskSetManager] for the <code>TaskSet</code>)</p> <p>| buildPools a| [[buildPools]]</p>"},{"location":"scheduler/SchedulableBuilder/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/SchedulableBuilder/#buildpools-unit","title":"buildPools(): Unit","text":"<p>Builds a tree of &lt;&gt; <p>Used exclusively when <code>TaskSchedulerImpl</code> is requested to scheduler:TaskSchedulerImpl.md#initialize[initialize] (and creates a scheduler:TaskSchedulerImpl.md#schedulableBuilder[SchedulableBuilder] per configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property)</p> <p>| rootPool a| [[rootPool]]</p>"},{"location":"scheduler/SchedulableBuilder/#source-scala_2","title":"[source, scala]","text":""},{"location":"scheduler/SchedulableBuilder/#rootpool-pool","title":"rootPool: Pool","text":"<p>Root (top-level) &lt;&gt; <p>Used when:</p> <ul> <li> <p><code>FIFOSchedulableBuilder</code> is requested to &lt;&gt; <li> <p><code>FairSchedulableBuilder</code> is requested to &lt;&gt;, &lt;&gt;, and &lt;&gt; <p>|===</p> <p>[[implementations]] .SchedulableBuilders [cols=\"1,3\",options=\"header\",width=\"100%\"] |=== | SchedulableBuilder | Description</p> <p>| &lt;&gt; | [[FairSchedulableBuilder]] Used when the configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property is <code>FAIR</code> <p>| &lt;&gt; | [[FIFOSchedulableBuilder]] Default <code>SchedulableBuilder</code> that is used when the configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property is <code>FIFO</code> (default) <p>|===</p>"},{"location":"scheduler/SchedulerBackend/","title":"SchedulerBackend","text":"<p><code>SchedulerBackend</code> is an abstraction of task scheduling backends that can revive resource offers from cluster managers.</p> <p><code>SchedulerBackend</code> abstraction allows <code>TaskSchedulerImpl</code> to use variety of cluster managers (with their own resource offers and task scheduling modes).</p> <p>Note</p> <p>Being a scheduler backend system assumes a Apache Mesos-like scheduling model in which \"an application\" gets resource offers as machines become available so it is possible to launch tasks on them. Once required resource allocation is obtained, the scheduler backend can start executors.</p>"},{"location":"scheduler/SchedulerBackend/#contract","title":"Contract","text":""},{"location":"scheduler/SchedulerBackend/#applicationattemptid","title":"applicationAttemptId <pre><code>applicationAttemptId(): Option[String]\n</code></pre> <p>Execution attempt ID of this Spark application</p> <p>Default: <code>None</code> (undefined)</p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested for the execution attempt ID of a Spark application</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#applicationid","title":"applicationId <pre><code>applicationId(): String\n</code></pre> <p>Unique identifier of this Spark application</p> <p>Default: <code>spark-application-[currentTimeMillis]</code></p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested for the unique identifier of a Spark application</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#default-parallelism","title":"Default Parallelism <pre><code>defaultParallelism(): Int\n</code></pre> <p>Default parallelism, i.e. a hint for the number of tasks in stages while sizing jobs</p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested for the default parallelism</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#getdriverattributes","title":"getDriverAttributes <pre><code>getDriverAttributes: Option[Map[String, String]]\n</code></pre> <p>Default: <code>None</code></p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested to postApplicationStart</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#getdriverlogurls","title":"getDriverLogUrls <pre><code>getDriverLogUrls: Option[Map[String, String]]\n</code></pre> <p>Driver log URLs</p> <p>Default: <code>None</code> (undefined)</p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested to postApplicationStart</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#isready","title":"isReady <pre><code>isReady(): Boolean\n</code></pre> <p>Controls whether this <code>SchedulerBackend</code> is ready (<code>true</code>) or not (<code>false</code>)</p> <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to wait until scheduling backend is ready</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#killing-task","title":"Killing Task <pre><code>killTask(\n  taskId: Long,\n  executorId: String,\n  interruptThread: Boolean,\n  reason: String): Unit\n</code></pre> <p>Kills a given task</p> <p>Default: <code>UnsupportedOperationException</code></p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to killTaskAttempt and killAllTaskAttempts</li> <li><code>TaskSetManager</code> is requested to handle a successful task attempt</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks <pre><code>maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n</code></pre> <p>Maximum number of concurrent tasks that can be launched (based on the given ResourceProfile)</p> <p>See:</p> <ul> <li>CoarseGrainedSchedulerBackend</li> <li>LocalSchedulerBackend</li> </ul> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested for the maximum number of concurrent tasks</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#reviveoffers","title":"reviveOffers <pre><code>reviveOffers(): Unit\n</code></pre> <p>Handles resource allocation offers (from the scheduling system)</p> <p>Used when <code>TaskSchedulerImpl</code> is requested to:</p> <ul> <li> <p>Submit tasks (from a TaskSet)</p> </li> <li> <p>Handle a task status update</p> </li> <li> <p>Notify the TaskSetManager that a task has failed</p> </li> <li> <p>Check for speculatable tasks</p> </li> <li> <p>Handle a lost executor event</p> </li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#starting-schedulerbackend","title":"Starting SchedulerBackend <pre><code>start(): Unit\n</code></pre> <p>Starts this <code>SchedulerBackend</code></p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to start</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#stop","title":"stop <pre><code>stop(): Unit\n</code></pre> <p>Stops this <code>SchedulerBackend</code></p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to stop</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#implementations","title":"Implementations","text":"<ul> <li>CoarseGrainedSchedulerBackend</li> <li>LocalSchedulerBackend</li> <li>MesosFineGrainedSchedulerBackend</li> </ul>"},{"location":"scheduler/SchedulerBackendUtils/","title":"SchedulerBackendUtils Utility","text":""},{"location":"scheduler/SchedulerBackendUtils/#default-number-of-executors","title":"Default Number of Executors <p><code>SchedulerBackendUtils</code> defaults to <code>2</code> as the default number of executors.</p>","text":""},{"location":"scheduler/SchedulerBackendUtils/#getinitialtargetexecutornumber","title":"getInitialTargetExecutorNumber <pre><code>getInitialTargetExecutorNumber(\n  conf: SparkConf,\n  numExecutors: Int = DEFAULT_NUMBER_EXECUTORS): Int\n</code></pre> <p><code>getInitialTargetExecutorNumber</code> branches off based on whether Dynamic Allocation of Executors is enabled or not.</p> <p>With no Dynamic Allocation of Executors, <code>getInitialTargetExecutorNumber</code> uses the spark.executor.instances configuration property (if defined) or uses the given <code>numExecutors</code> (and the DEFAULT_NUMBER_EXECUTORS).</p> <p>With Dynamic Allocation of Executors enabled, <code>getInitialTargetExecutorNumber</code> getDynamicAllocationInitialExecutors and makes sure that the value is between the following configuration properties:</p> <ul> <li>spark.dynamicAllocation.minExecutors</li> <li>spark.dynamicAllocation.maxExecutors</li> </ul> <p><code>getInitialTargetExecutorNumber</code> is used when:</p> <ul> <li><code>KubernetesClusterSchedulerBackend</code> (Spark on Kubernetes) is created</li> <li>Spark on YARN's <code>YarnAllocator</code>, <code>YarnClientSchedulerBackend</code> and <code>YarnClusterSchedulerBackend</code> are used</li> </ul>","text":""},{"location":"scheduler/SchedulingMode/","title":"SchedulingMode","text":"<p>== [[SchedulingMode]] Scheduling Mode -- <code>spark.scheduler.mode</code> Spark Property</p> <p>Scheduling Mode (aka order task policy or scheduling policy or scheduling order) defines a policy to sort tasks in order for execution.</p> <p>The scheduling mode <code>schedulingMode</code> attribute is part of the scheduler:TaskScheduler.md#schedulingMode[TaskScheduler Contract].</p> <p>The only implementation of the <code>TaskScheduler</code> contract in Spark -- scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] -- uses configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] setting to configure <code>schedulingMode</code> that is merely used to set up the scheduler:TaskScheduler.md#rootPool[rootPool] attribute (with <code>FIFO</code> being the default). It happens when scheduler:TaskSchedulerImpl.md#initialize[<code>TaskSchedulerImpl</code> is initialized].</p> <p>There are three acceptable scheduling modes:</p> <ul> <li>[[FIFO]] <code>FIFO</code> with no pools but a single top-level unnamed pool with elements being scheduler:TaskSetManager.md[TaskSetManager] objects; lower priority gets scheduler:spark-scheduler-Schedulable.md[Schedulable] sooner or earlier stage wins.</li> <li>[[FAIR]] <code>FAIR</code> with a scheduler:spark-scheduler-FairSchedulableBuilder.md#buildPools[hierarchy of <code>Schedulable</code> (sub)pools] with the scheduler:TaskScheduler.md#rootPool[rootPool] at the top.</li> <li>[[NONE]] NONE (not used)</li> </ul> <p>NOTE: Out of three possible <code>SchedulingMode</code> policies only <code>FIFO</code> and <code>FAIR</code> modes are supported by scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl].</p>"},{"location":"scheduler/SchedulingMode/#note","title":"[NOTE]","text":"<p>After the root pool is initialized, the scheduling mode is no longer relevant (since the spark-scheduler-Schedulable.md[Schedulable] that represents the root pool is fully set up).</p>"},{"location":"scheduler/SchedulingMode/#the-root-pool-is-later-used-when-schedulertaskschedulerimplmdsubmittaskstaskschedulerimpl-submits-tasks-as-tasksets-for-execution","title":"The root pool is later used when scheduler:TaskSchedulerImpl.md#submitTasks[<code>TaskSchedulerImpl</code> submits tasks (as <code>TaskSets</code>) for execution].","text":"<p>NOTE: The scheduler:TaskScheduler.md#rootPool[root pool] is a <code>Schedulable</code>. Refer to spark-scheduler-Schedulable.md[Schedulable].</p> <p>=== [[fair-scheduling-sparkui]] Monitoring FAIR Scheduling Mode using Spark UI</p> <p>CAUTION: FIXME Describe me...</p>"},{"location":"scheduler/ShuffleMapStage/","title":"ShuffleMapStage","text":"<p><code>ShuffleMapStage</code> (shuffle map stage or simply map stage) is a Stage.</p> <p><code>ShuffleMapStage</code> corresponds to (and is associated with) a ShuffleDependency.</p> <p><code>ShuffleMapStage</code> can be submitted independently but it is usually an intermediate step in a physical execution plan (with the final step being a ResultStage).</p>"},{"location":"scheduler/ShuffleMapStage/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleMapStage</code> takes the following to be created:</p> <ul> <li> Stage ID <li> RDD (of the ShuffleDependency) <li> Number of tasks <li> Parent Stages <li> First Job ID (of the ActiveJob that created it) <li> <code>CallSite</code> <li> ShuffleDependency <li> MapOutputTrackerMaster <li> Resource Profile ID <p><code>ShuffleMapStage</code> is created when:</p> <ul> <li><code>DAGScheduler</code> is requested to plan a ShuffleDependency for execution</li> </ul>"},{"location":"scheduler/ShuffleMapStage/#missing-partitions","title":"Missing Partitions <pre><code>findMissingPartitions(): Seq[Int]\n</code></pre> <p><code>findMissingPartitions</code> requests the MapOutputTrackerMaster for the missing partitions (of the ShuffleDependency) and returns them.</p> <p>If not available (<code>MapOutputTrackerMaster</code> does not track the <code>ShuffleDependency</code>), <code>findMissingPartitions</code> simply assumes that all the partitions are missing.</p> <p><code>findMissingPartitions</code> is part of the Stage abstraction.</p>","text":""},{"location":"scheduler/ShuffleMapStage/#shufflemapstage-ready","title":"ShuffleMapStage Ready <p>When \"executed\", a <code>ShuffleMapStage</code> saves map output files (for reduce tasks).</p> <p>When all partitions have shuffle map outputs available, <code>ShuffleMapStage</code> is considered ready (done or available).</p>","text":""},{"location":"scheduler/ShuffleMapStage/#isavailable","title":"isAvailable <pre><code>isAvailable: Boolean\n</code></pre> <p><code>isAvailable</code> is <code>true</code> when the <code>ShuffleMapStage</code> is ready and all partitions have shuffle outputs (i.e. the numAvailableOutputs is exactly the numPartitions).</p> <p><code>isAvailable</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to getMissingParentStages, handleMapStageSubmitted, submitMissingTasks, processShuffleMapStageCompletion, markMapStageJobsAsFinished and stageDependsOn</li> </ul>","text":""},{"location":"scheduler/ShuffleMapStage/#available-outputs","title":"Available Outputs <pre><code>numAvailableOutputs: Int\n</code></pre> <p><code>numAvailableOutputs</code> requests the MapOutputTrackerMaster to getNumAvailableOutputs (for the shuffleId of the ShuffleDependency).</p> <p><code>numAvailableOutputs</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to submitMissingTasks</li> <li><code>ShuffleMapStage</code> is requested to isAvailable</li> </ul>","text":""},{"location":"scheduler/ShuffleMapStage/#active-jobs","title":"Active Jobs <p><code>ShuffleMapStage</code> defines <code>_mapStageJobs</code> internal registry of ActiveJobs to track jobs that were submitted to execute the stage independently.</p> <p>A new job is registered (added) in addActiveJob.</p> <p>An active job is deregistered (removed) in removeActiveJob.</p>","text":""},{"location":"scheduler/ShuffleMapStage/#addactivejob","title":"addActiveJob <pre><code>addActiveJob(\n  job: ActiveJob): Unit\n</code></pre> <p><code>addActiveJob</code> adds the given ActiveJob to (the front of) the _mapStageJobs list.</p> <p><code>addActiveJob</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handleMapStageSubmitted</li> </ul>","text":""},{"location":"scheduler/ShuffleMapStage/#removeactivejob","title":"removeActiveJob <pre><code>removeActiveJob(\n  job: ActiveJob): Unit\n</code></pre> <p><code>removeActiveJob</code> removes the ActiveJob from the _mapStageJobs registry.</p> <p><code>removeActiveJob</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to cleanupStateForJobAndIndependentStages</li> </ul>","text":""},{"location":"scheduler/ShuffleMapStage/#mapstagejobs","title":"mapStageJobs <pre><code>mapStageJobs: Seq[ActiveJob]\n</code></pre> <p><code>mapStageJobs</code> returns the _mapStageJobs list.</p> <p><code>mapStageJobs</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to markMapStageJobsAsFinished</li> </ul>","text":""},{"location":"scheduler/ShuffleMapStage/#demo-shufflemapstage-sharing","title":"Demo: ShuffleMapStage Sharing <p>A <code>ShuffleMapStage</code> can be shared across multiple jobs (if these jobs reuse the same RDDs).</p> <p></p> <pre><code>val keyValuePairs = sc.parallelize(0 to 5).map((_, 1))\nval rdd = keyValuePairs.sortByKey()  // (1)\n\nscala&gt; println(rdd.toDebugString)\n(6) ShuffledRDD[4] at sortByKey at &lt;console&gt;:39 []\n +-(16) MapPartitionsRDD[1] at map at &lt;console&gt;:39 []\n    |   ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:39 []\n\nrdd.count  // (2)\nrdd.count  // (3)\n</code></pre> <ol> <li>Shuffle at <code>sortByKey()</code></li> <li>Submits a job with two stages (and two to be executed)</li> <li>Intentionally repeat the last action that submits a new job with two stages with one being shared as already-computed</li> </ol>","text":""},{"location":"scheduler/ShuffleMapStage/#map-output-files","title":"Map Output Files <p><code>ShuffleMapStage</code> writes out map output files (for a shuffle).</p>","text":""},{"location":"scheduler/ShuffleMapTask/","title":"ShuffleMapTask","text":"<p><code>ShuffleMapTask</code> is a Task to produce a MapStatus (<code>Task[MapStatus]</code>).</p> <p><code>ShuffleMapTask</code> is one of the two types of Tasks. When executed, <code>ShuffleMapTask</code> writes the result of executing a serialized task code over the records (of a RDD partition) to the shuffle system and returns a MapStatus (with the BlockManager and estimated size of the result shuffle blocks).</p> <p></p>"},{"location":"scheduler/ShuffleMapTask/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleMapTask</code> takes the following to be created:</p> <ul> <li> Stage ID <li> Stage Attempt ID <li>Broadcast variable with a serialized task binary</li> <li> Partition <li> TaskLocations <li> Local Properties <li> Serialized task metrics <li> Job ID (default: <code>None</code>) <li> Application ID (default: <code>None</code>) <li> Application Attempt ID (default: <code>None</code>) <li>isBarrier flag</li> <p><code>ShuffleMapTask</code> is created when <code>DAGScheduler</code> is requested to submit tasks for all missing partitions of a ShuffleMapStage.</p>"},{"location":"scheduler/ShuffleMapTask/#isBarrier","title":"isBarrier Flag","text":"<p><code>ShuffleMapTask</code> can be given <code>isBarrier</code> flag when created. Unless given, <code>isBarrier</code> is assumed disabled (<code>false</code>).</p> <p><code>isBarrier</code> flag is passed to the parent Task.</p>"},{"location":"scheduler/ShuffleMapTask/#serialized-task-binary","title":"Serialized Task Binary <pre><code>taskBinary: Broadcast[Array[Byte]]\n</code></pre> <p><code>ShuffleMapTask</code> is given a broadcast variable with a reference to a serialized task binary.</p> <p>runTask expects that the serialized task binary is a tuple of an RDD and a ShuffleDependency.</p>","text":""},{"location":"scheduler/ShuffleMapTask/#preferred-locations","title":"Preferred Locations  Signature <pre><code>preferredLocations: Seq[TaskLocation]\n</code></pre> <p><code>preferredLocations</code> is part of the Task abstraction.</p>  <p><code>preferredLocations</code> returns <code>preferredLocs</code> internal property.</p> <p><code>ShuffleMapTask</code> tracks TaskLocations as unique entries in the given locs (with the only rule that when <code>locs</code> is not defined, it is empty, and no task location preferences are defined).</p> <p><code>ShuffleMapTask</code> initializes the <code>preferredLocs</code> internal property when created</p>","text":""},{"location":"scheduler/ShuffleMapTask/#running-task","title":"Running Task  Signature <pre><code>runTask(\n  context: TaskContext): MapStatus\n</code></pre> <p><code>runTask</code> is part of the Task abstraction.</p>  <p></p> <p><code>runTask</code> writes the result (records) of executing the serialized task code over the records (in the RDD partition) to the shuffle system and returns a MapStatus (with the BlockManager and an estimated size of the result shuffle blocks).</p> <p>Internally, <code>runTask</code> requests the SparkEnv for the new instance of closure serializer and requests it to deserialize the serialized task code (into a tuple of a RDD and a ShuffleDependency).</p> <p><code>runTask</code> measures the thread and CPU deserialization times.</p> <p><code>runTask</code> requests the SparkEnv for the ShuffleManager and requests it for a ShuffleWriter (for the ShuffleHandle and the partition).</p> <p><code>runTask</code> then requests the RDD for the records (of the partition) that the <code>ShuffleWriter</code> is requested to write out (to the shuffle system).</p> <p>In the end, <code>runTask</code> requests the <code>ShuffleWriter</code> to stop (with the <code>success</code> flag on) and returns the shuffle map output status.</p>  <p>Note</p> <p>This is the moment in <code>Task</code>'s lifecycle (and its corresponding RDD) when a RDD partition is computed and in turn becomes a sequence of records (i.e. real data) on an executor.</p>  <p>In case of any exceptions, <code>runTask</code> requests the <code>ShuffleWriter</code> to stop (with the <code>success</code> flag off) and (re)throws the exception.</p> <p><code>runTask</code> may also print out the following DEBUG message to the logs when the <code>ShuffleWriter</code> could not be stopped.</p> <pre><code>Could not stop writer\n</code></pre>","text":""},{"location":"scheduler/ShuffleMapTask/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.ShuffleMapTask</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.ShuffleMapTask.name = org.apache.spark.scheduler.ShuffleMapTask\nlogger.ShuffleMapTask.level = all\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/ShuffleStatus/","title":"ShuffleStatus","text":"<p><code>ShuffleStatus</code> is a registry of MapStatuses per Partition of a ShuffleMapStage.</p> <p><code>ShuffleStatus</code> is used by MapOutputTrackerMaster.</p>"},{"location":"scheduler/ShuffleStatus/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleStatus</code> takes the following to be created:</p> <ul> <li> Number of Partitions (of the RDD of the ShuffleDependency of a ShuffleMapStage) <p><code>ShuffleStatus</code> is created\u00a0when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested to register a shuffle (when <code>DAGScheduler</code> is requested to create a ShuffleMapStage)</li> </ul>"},{"location":"scheduler/ShuffleStatus/#mapstatuses-per-partition","title":"MapStatuses per Partition <p><code>ShuffleStatus</code> creates a <code>mapStatuses</code> internal registry of MapStatuses per partition (using the numPartitions) when created.</p> <p>A missing partition is when there is no <code>MapStatus</code> for a partition (<code>null</code> at the index of the partition ID) and can be requested using findMissingPartitions.</p> <p><code>mapStatuses</code> is all <code>null</code> (for every partition) initially (and so all partitions are missing / uncomputed).</p> <p>A new <code>MapStatus</code> is added in addMapOutput and updateMapOutput.</p> <p>A <code>MapStatus</code> is removed (<code>null</code>ed) in removeMapOutput and removeOutputsByFilter.</p> <p>The number of available <code>MapStatus</code>es is tracked by _numAvailableMapOutputs internal counter.</p> <p>Used when:</p> <ul> <li>serializedMapStatus and withMapStatuses</li> </ul>","text":""},{"location":"scheduler/ShuffleStatus/#registering-shuffle-map-output","title":"Registering Shuffle Map Output <pre><code>addMapOutput(\n  mapIndex: Int,\n  status: MapStatus): Unit\n</code></pre> <p><code>addMapOutput</code> adds the MapStatus to the mapStatuses internal registry.</p> <p>In case the mapStatuses internal registry had no <code>MapStatus</code> for the <code>mapIndex</code> already available, <code>addMapOutput</code> increments the _numAvailableMapOutputs internal counter and invalidateSerializedMapOutputStatusCache.</p> <p><code>addMapOutput</code>\u00a0is used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested to registerMapOutput</li> </ul>","text":""},{"location":"scheduler/ShuffleStatus/#deregistering-shuffle-map-output","title":"Deregistering Shuffle Map Output <pre><code>removeMapOutput(\n  mapIndex: Int,\n  bmAddress: BlockManagerId): Unit\n</code></pre> <p><code>removeMapOutput</code>...FIXME</p> <p><code>removeMapOutput</code>\u00a0is used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested to unregisterMapOutput</li> </ul>","text":""},{"location":"scheduler/ShuffleStatus/#missing-partitions","title":"Missing Partitions <pre><code>findMissingPartitions(): Seq[Int]\n</code></pre> <p><code>findMissingPartitions</code>...FIXME</p> <p><code>findMissingPartitions</code>\u00a0is used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested to findMissingPartitions</li> </ul>","text":""},{"location":"scheduler/ShuffleStatus/#serializing-shuffle-map-output-statuses","title":"Serializing Shuffle Map Output Statuses <pre><code>serializedMapStatus(\n  broadcastManager: BroadcastManager,\n  isLocal: Boolean,\n  minBroadcastSize: Int,\n  conf: SparkConf): Array[Byte]\n</code></pre> <p><code>serializedMapStatus</code>...FIXME</p> <p><code>serializedMapStatus</code>\u00a0is used when:</p> <ul> <li><code>MessageLoop</code> (of the MapOutputTrackerMaster) is requested to send map output locations for shuffle</li> </ul>","text":""},{"location":"scheduler/ShuffleStatus/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.ShuffleStatus</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.ShuffleStatus=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/Stage/","title":"Stage","text":"<p><code>Stage</code> is an abstraction of steps in a physical execution plan.</p> <p>Note</p> <p>The logical DAG or logical execution plan is the RDD lineage.</p> <p>Indirectly, a <code>Stage</code> is a set of parallel tasks - one task per partition (of an RDD that computes partial results of a function executed as part of a Spark job).</p> <p></p> <p>In other words, a Spark job is a computation \"sliced\" (not to use the reserved term partitioned) into stages.</p>"},{"location":"scheduler/Stage/#contract","title":"Contract","text":""},{"location":"scheduler/Stage/#missing-partitions","title":"Missing Partitions <pre><code>findMissingPartitions(): Seq[Int]\n</code></pre> <p>Missing partitions (IDs of the partitions of the RDD that are missing and need to be computed)</p> <p>Used when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit missing tasks</li> </ul>","text":""},{"location":"scheduler/Stage/#implementations","title":"Implementations","text":"<ul> <li>ResultStage</li> <li>ShuffleMapStage</li> </ul>"},{"location":"scheduler/Stage/#creating-instance","title":"Creating Instance","text":"<p><code>Stage</code> takes the following to be created:</p> <ul> <li>Stage ID</li> <li>RDD</li> <li> Number of tasks <li> Parent <code>Stage</code>s <li> First Job ID <li> <code>CallSite</code> <li> Resource Profile ID <p>Abstract Class</p> <p><code>Stage</code> is an abstract class and cannot be created directly. It is created indirectly for the concrete Stages.</p>"},{"location":"scheduler/Stage/#rdd","title":"RDD <p><code>Stage</code> is given a RDD when created.</p>","text":""},{"location":"scheduler/Stage/#stage-id","title":"Stage ID <p><code>Stage</code> is given an unique ID when created.</p>  <p>Note</p> <p><code>DAGScheduler</code> uses nextStageId internal counter to track the number of stage submissions.</p>","text":""},{"location":"scheduler/Stage/#making-new-stage-attempt","title":"Making New Stage Attempt <pre><code>makeNewStageAttempt(\n  numPartitionsToCompute: Int,\n  taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): Unit\n</code></pre> <p><code>makeNewStageAttempt</code> creates a new TaskMetrics and requests it to register itself with the SparkContext of the RDD.</p> <p><code>makeNewStageAttempt</code> creates a StageInfo from this <code>Stage</code> (and the nextAttemptId). This <code>StageInfo</code> is saved in the _latestInfo internal registry.</p> <p>In the end, <code>makeNewStageAttempt</code> increments the nextAttemptId internal counter.</p>  <p>Note</p> <p><code>makeNewStageAttempt</code> returns <code>Unit</code> (nothing) and its purpose is to update the latest StageInfo internal registry.</p>  <p><code>makeNewStageAttempt</code>\u00a0is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit the missing tasks of a stage</li> </ul>","text":""},{"location":"scheduler/StageInfo/","title":"StageInfo","text":"<p><code>StageInfo</code> is a metadata about a stage to pass from the scheduler to SparkListeners.</p>"},{"location":"scheduler/StageInfo/#creating-instance","title":"Creating Instance","text":"<p><code>StageInfo</code> takes the following to be created:</p> <ul> <li> Stage ID <li> Stage Attempt ID <li> Name <li> Number of Tasks <li> RDDInfos <li> Parent IDs <li> Details <li> TaskMetrics (default: <code>null</code>) <li> Task Locality Preferences (default: empty) <li> Optional Shuffle Dependency ID (default: undefined) <p><code>StageInfo</code> is created\u00a0when:</p> <ul> <li><code>StageInfo</code> utility is used to fromStage</li> <li><code>JsonProtocol</code> (History Server) is used to stageInfoFromJson</li> </ul>"},{"location":"scheduler/StageInfo/#fromstage-utility","title":"fromStage Utility <pre><code>fromStage(\n  stage: Stage,\n  attemptId: Int,\n  numTasks: Option[Int] = None,\n  taskMetrics: TaskMetrics = null,\n  taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): StageInfo\n</code></pre> <p><code>fromStage</code>...FIXME</p> <p><code>fromStage</code>\u00a0is used when:</p> <ul> <li><code>Stage</code> is created and make a new Stage attempt</li> </ul>","text":""},{"location":"scheduler/Task/","title":"Task","text":"<p><code>Task</code> is an abstraction of the smallest individual units of execution that can be executed (to compute an RDD partition).</p> <p></p>"},{"location":"scheduler/Task/#contract","title":"Contract","text":""},{"location":"scheduler/Task/#running-task","title":"Running Task <pre><code>runTask(\n  context: TaskContext): T\n</code></pre> <p>Runs the task (in a TaskContext)</p> <p>Used when <code>Task</code> is requested to run</p>","text":""},{"location":"scheduler/Task/#implementations","title":"Implementations","text":"<ul> <li>ResultTask</li> <li>ShuffleMapTask</li> </ul>"},{"location":"scheduler/Task/#creating-instance","title":"Creating Instance","text":"<p><code>Task</code> takes the following to be created:</p> <ul> <li> Stage ID <li> Stage (execution) Attempt ID <li> Partition ID to compute <li> Local Properties <li> Serialized TaskMetrics (<code>Array[Byte]</code>) <li> ActiveJob ID (default: <code>None</code>) <li> Application ID (default: <code>None</code>) <li> Application Attempt ID (default: <code>None</code>) <li>isBarrier flag</li> <p><code>Task</code> is created when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit missing tasks of a stage</li> </ul> Abstract Class <p><code>Task</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete Tasks.</p>"},{"location":"scheduler/Task/#isBarrier","title":"isBarrier Flag <p><code>Task</code> can be given <code>isBarrier</code> flag when created. Unless given, <code>isBarrier</code> is assumed disabled (<code>false</code>).</p> <p><code>isBarrier</code> flag indicates whether this <code>Task</code> belongs to a Barrier Stage in Barrier Execution Mode.</p> <p><code>isBarrier</code> flag is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handleTaskCompletion (of a <code>FetchFailed</code> task) to fail the parent stage (and retry a barrier stage when one of the barrier tasks fails)</li> <li><code>Task</code> is requested to run (to create a BarrierTaskContext)</li> <li><code>TaskSetManager</code> is requested to isBarrier and handleFailedTask</li> </ul>","text":""},{"location":"scheduler/Task/#taskmemorymanager","title":"TaskMemoryManager <p><code>Task</code> is given a TaskMemoryManager when <code>TaskRunner</code> is requested to run a task (right after deserializing the task for execution).</p> <p><code>Task</code> uses the <code>TaskMemoryManager</code> to create a TaskContextImpl (when requested to run).</p>","text":""},{"location":"scheduler/Task/#serializable","title":"Serializable <p><code>Task</code> is a <code>Serializable</code> (Java) so it can be serialized (to bytes) and send over the wire for execution from the driver to executors.</p>","text":""},{"location":"scheduler/Task/#preferred-locations","title":"Preferred Locations <pre><code>preferredLocations: Seq[TaskLocation]\n</code></pre> <p>TaskLocations that represent preferred locations (executors) to execute the task on.</p> <p>Empty by default and so no task location preferences are defined that says the task could be launched on any executor.</p>  <p>Note</p> <p>Defined by the concrete tasks (i.e. ShuffleMapTask and ResultTask).</p>  <p><code>preferredLocations</code> is used when <code>TaskSetManager</code> is requested to register a task as pending execution and dequeueSpeculativeTask.</p>","text":""},{"location":"scheduler/Task/#run","title":"Running Task <pre><code>run(\n  taskAttemptId: Long,\n  attemptNumber: Int,\n  metricsSystem: MetricsSystem,\n  resources: Map[String, ResourceInformation],\n  plugins: Option[PluginContainer]): T\n</code></pre> <p><code>run</code> registers the task (attempt) with the BlockManager.</p> <p><code>run</code> creates a TaskContextImpl (and perhaps a BarrierTaskContext too when the given <code>isBarrier</code> flag is enabled) that in turn becomes the task's TaskContext.</p> <p><code>run</code> checks _killed flag and, if enabled, kills the task (with <code>interruptThread</code> flag disabled).</p> <p><code>run</code> creates a Hadoop <code>CallerContext</code> and sets it.</p> <p><code>run</code> informs the given <code>PluginContainer</code> that the task is started.</p> <p><code>run</code> runs the task.</p>  <p>Note</p> <p>This is the moment when the custom <code>Task</code>'s runTask is executed.</p>  <p>In the end, <code>run</code> notifies <code>TaskContextImpl</code> that the task has completed (regardless of the final outcome -- a success or a failure).</p> <p>In case of any exceptions, <code>run</code> notifies <code>TaskContextImpl</code> that the task has failed. <code>run</code> requests <code>MemoryStore</code> to release unroll memory for this task (for both <code>ON_HEAP</code> and <code>OFF_HEAP</code> memory modes).</p>  <p>Note</p> <p><code>run</code> uses <code>SparkEnv</code> to access the current BlockManager that it uses to access MemoryStore.</p>  <p><code>run</code> requests <code>MemoryManager</code> to notify any tasks waiting for execution memory to be freed to wake up and try to acquire memory again.</p> <p><code>run</code> unsets the task's <code>TaskContext</code>.</p>  <p>Note</p> <p><code>run</code> uses <code>SparkEnv</code> to access the current MemoryManager.</p>   <p><code>run</code> is used when:</p> <ul> <li><code>TaskRunner</code> is requested to run (when <code>Executor</code> is requested to launch a task (on \"Executor task launch worker\" thread pool sometime in the future))</li> </ul>","text":""},{"location":"scheduler/Task/#task-states","title":"Task States <p><code>Task</code> can be in one of the following states (as described by <code>TaskState</code> enumeration):</p> <ul> <li><code>LAUNCHING</code></li> <li><code>RUNNING</code> when the task is being started.</li> <li><code>FINISHED</code> when the task finished with the serialized result.</li> <li><code>FAILED</code> when the task fails, e.g. when FetchFailedException, <code>CommitDeniedException</code> or any <code>Throwable</code> occurs</li> <li><code>KILLED</code> when an executor kills a task.</li> <li><code>LOST</code></li> </ul> <p>States are the values of <code>org.apache.spark.TaskState</code>.</p>  <p>Note</p> <p>Task status updates are sent from executors to the driver through ExecutorBackend.</p>  <p>Task is finished when it is in one of <code>FINISHED</code>, <code>FAILED</code>, <code>KILLED</code>, <code>LOST</code>.</p> <p><code>LOST</code> and <code>FAILED</code> states are considered failures.</p>","text":""},{"location":"scheduler/Task/#collecting-latest-values-of-accumulators","title":"Collecting Latest Values of Accumulators <pre><code>collectAccumulatorUpdates(\n  taskFailed: Boolean = false): Seq[AccumulableInfo]\n</code></pre> <p><code>collectAccumulatorUpdates</code> collects the latest values of internal and external accumulators from a task (and returns the values as a collection of AccumulableInfo).</p> <p>Internally, <code>collectAccumulatorUpdates</code> takes <code>TaskMetrics</code>.</p>  <p>Note</p> <p><code>collectAccumulatorUpdates</code> uses TaskContextImpl to access the task's <code>TaskMetrics</code>.</p>  <p><code>collectAccumulatorUpdates</code> collects the latest values of:</p> <ul> <li> <p>internal accumulators whose current value is not the zero value and the <code>RESULT_SIZE</code> accumulator (regardless whether the value is its zero or not).</p> </li> <li> <p>external accumulators when <code>taskFailed</code> is disabled (<code>false</code>) or which should be included on failures.</p> </li> </ul> <p><code>collectAccumulatorUpdates</code> returns an empty collection when TaskContextImpl is not initialized.</p> <p><code>collectAccumulatorUpdates</code> is used when <code>TaskRunner</code> runs a task (and sends a task's final results back to the driver).</p>","text":""},{"location":"scheduler/Task/#killing-task","title":"Killing Task <pre><code>kill(\n  interruptThread: Boolean): Unit\n</code></pre> <p><code>kill</code> marks the task to be killed, i.e. it sets the internal <code>_killed</code> flag to <code>true</code>.</p> <p><code>kill</code> calls TaskContextImpl.markInterrupted when <code>context</code> is set.</p> <p>If <code>interruptThread</code> is enabled and the internal <code>taskThread</code> is available, <code>kill</code> interrupts it.</p> <p>CAUTION: FIXME When could <code>context</code> and <code>interruptThread</code> not be set?</p>","text":""},{"location":"scheduler/TaskContext/","title":"TaskContext","text":"<p><code>TaskContext</code> is an abstraction of task contexts.</p>"},{"location":"scheduler/TaskContext/#contract-subset","title":"Contract (Subset)","text":""},{"location":"scheduler/TaskContext/#addtaskcompletionlistener","title":"addTaskCompletionListener <pre><code>addTaskCompletionListener[U](\n  f: (TaskContext) =&gt; U): TaskContext\naddTaskCompletionListener(\n  listener: TaskCompletionListener): TaskContext\n</code></pre> <p>Registers a TaskCompletionListener</p> <pre><code>val rdd = sc.range(0, 5, numSlices = 1)\n\nimport org.apache.spark.TaskContext\nval printTaskInfo = (tc: TaskContext) =&gt; {\n  val msg = s\"\"\"|-------------------\n                |partitionId:   ${tc.partitionId}\n                |stageId:       ${tc.stageId}\n                |attemptNum:    ${tc.attemptNumber}\n                |taskAttemptId: ${tc.taskAttemptId}\n                |-------------------\"\"\".stripMargin\n  println(msg)\n}\n\nrdd.foreachPartition { _ =&gt;\n  val tc = TaskContext.get\n  tc.addTaskCompletionListener(printTaskInfo)\n}\n</code></pre>","text":""},{"location":"scheduler/TaskContext/#addtaskfailurelistener","title":"addTaskFailureListener <pre><code>addTaskFailureListener(\n  f: (TaskContext, Throwable) =&gt; Unit): TaskContext\naddTaskFailureListener(\n  listener: TaskFailureListener): TaskContext\n</code></pre> <p>Registers a TaskFailureListener</p> <pre><code>val rdd = sc.range(0, 2, numSlices = 2)\n\nimport org.apache.spark.TaskContext\nval printTaskErrorInfo = (tc: TaskContext, error: Throwable) =&gt; {\n  val msg = s\"\"\"|-------------------\n                |partitionId:   ${tc.partitionId}\n                |stageId:       ${tc.stageId}\n                |attemptNum:    ${tc.attemptNumber}\n                |taskAttemptId: ${tc.taskAttemptId}\n                |error:         ${error.toString}\n                |-------------------\"\"\".stripMargin\n  println(msg)\n}\n\nval throwExceptionForOddNumber = (n: Long) =&gt; {\n  if (n % 2 == 1) {\n    throw new Exception(s\"No way it will pass for odd number: $n\")\n  }\n}\n\n// FIXME It won't work.\nrdd.map(throwExceptionForOddNumber).foreachPartition { _ =&gt;\n  val tc = TaskContext.get\n  tc.addTaskFailureListener(printTaskErrorInfo)\n}\n\n// Listener registration matters.\nrdd.mapPartitions { (it: Iterator[Long]) =&gt;\n  val tc = TaskContext.get\n  tc.addTaskFailureListener(printTaskErrorInfo)\n  it\n}.map(throwExceptionForOddNumber).count\n</code></pre>","text":""},{"location":"scheduler/TaskContext/#fetchfailed","title":"fetchFailed <pre><code>fetchFailed: Option[FetchFailedException]\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run</li> </ul>","text":""},{"location":"scheduler/TaskContext/#getkillreason","title":"getKillReason <pre><code>getKillReason(): Option[String]\n</code></pre>","text":""},{"location":"scheduler/TaskContext/#getlocalproperty","title":"getLocalProperty <pre><code>getLocalProperty(\n  key: String): String\n</code></pre> <p>Looks up a local property by <code>key</code></p>","text":""},{"location":"scheduler/TaskContext/#getmetricssources","title":"getMetricsSources <pre><code>getMetricsSources(\n  sourceName: String): Seq[Source]\n</code></pre> <p>Looks up Sources by name</p>","text":""},{"location":"scheduler/TaskContext/#registering-accumulator","title":"Registering Accumulator <pre><code>registerAccumulator(\n  a: AccumulatorV2[_, _]): Unit\n</code></pre> <p>Registers a AccumulatorV2</p> <p>Used when:</p> <ul> <li><code>AccumulatorV2</code> is requested to deserialize itself</li> </ul>","text":""},{"location":"scheduler/TaskContext/#resources","title":"Resources <pre><code>resources(): Map[String, ResourceInformation]\n</code></pre> <p>Resources (names) allocated to this task</p> <p>See:</p> <ul> <li>TaskContextImpl</li> </ul>","text":""},{"location":"scheduler/TaskContext/#taskmetrics","title":"taskMetrics <pre><code>taskMetrics(): TaskMetrics\n</code></pre> <p>TaskMetrics</p>","text":""},{"location":"scheduler/TaskContext/#others","title":"others  <p>Important</p> <p>There are other methods, but don't seem very interesting.</p>","text":""},{"location":"scheduler/TaskContext/#implementations","title":"Implementations","text":"<ul> <li>BarrierTaskContext</li> <li>TaskContextImpl</li> </ul>"},{"location":"scheduler/TaskContext/#serializable","title":"Serializable <p><code>TaskContext</code> is a <code>Serializable</code> (Java).</p>","text":""},{"location":"scheduler/TaskContext/#accessing-taskcontext","title":"Accessing TaskContext <pre><code>get(): TaskContext\n</code></pre> <p><code>get</code> returns the thread-local <code>TaskContext</code> instance.</p> <pre><code>import org.apache.spark.TaskContext\nval tc = TaskContext.get\n</code></pre> <pre><code>val rdd = sc.range(0, 3, numSlices = 3)\n\nassert(rdd.partitions.size == 3)\n\nrdd.foreach { n =&gt;\n  import org.apache.spark.TaskContext\n  val tc = TaskContext.get\n  val msg = s\"\"\"|-------------------\n                |partitionId:   ${tc.partitionId}\n                |stageId:       ${tc.stageId}\n                |attemptNum:    ${tc.attemptNumber}\n                |taskAttemptId: ${tc.taskAttemptId}\n                |-------------------\"\"\".stripMargin\n  println(msg)\n}\n</code></pre>","text":""},{"location":"scheduler/TaskContextImpl/","title":"TaskContextImpl","text":"<p><code>TaskContextImpl</code> is a concrete TaskContext.</p>"},{"location":"scheduler/TaskContextImpl/#creating-instance","title":"Creating Instance","text":"<p><code>TaskContextImpl</code> takes the following to be created:</p> <ul> <li> Stage ID <li> <code>Stage</code> Execution Attempt ID <li> Partition ID <li> Task Execution Attempt ID <li> Attempt Number <li> TaskMemoryManager <li> Local Properties <li> MetricsSystem <li> TaskMetrics <li>Resources</li> <p><code>TaskContextImpl</code> is created\u00a0when:</p> <ul> <li><code>Task</code> is requested to run</li> </ul>"},{"location":"scheduler/TaskContextImpl/#resources","title":"Resources","text":"TaskContext <pre><code>resources: Map[String, ResourceInformation]\n</code></pre> <p><code>resources</code> is part of the TaskContext abstraction.</p> <p><code>TaskContextImpl</code> can be given resources (names) when created.</p> <p>The resources are given when a <code>Task</code> is requested to run that in turn come from a TaskDescription (of a TaskRunner).</p>"},{"location":"scheduler/TaskContextImpl/#barriertaskcontext","title":"BarrierTaskContext <p><code>TaskContextImpl</code> is available to barrier tasks as a BarrierTaskContext.</p>","text":""},{"location":"scheduler/TaskDescription/","title":"TaskDescription","text":"<p><code>TaskDescription</code> is a metadata of a Task.</p>"},{"location":"scheduler/TaskDescription/#creating-instance","title":"Creating Instance","text":"<p><code>TaskDescription</code> takes the following to be created:</p> <ul> <li> Task ID <li> Task attempt number <li> Executor ID <li>Task name</li> <li> Task index (within the TaskSet) <li> Partition ID <li> Added files (as <code>Map[String, Long]</code>) <li> Added JAR files (as <code>Map[String, Long]</code>) <li> <code>Properties</code> <li>Resources</li> <li> Serialized task (as <code>ByteBuffer</code>) <p><code>TaskDescription</code> is created when:</p> <ul> <li><code>TaskSetManager</code> is requested to find a task ready for execution (given a resource offer)</li> </ul>"},{"location":"scheduler/TaskDescription/#resources","title":"Resources","text":"<pre><code>resources: Map[String, ResourceInformation]\n</code></pre> <p><code>TaskDescription</code> is given resources when created.</p> <p>The resources are either specified when <code>TaskSetManager</code> is requested to resourceOffer (and prepareLaunchingTask) or decoded from bytes.</p>"},{"location":"scheduler/TaskDescription/#text-representation","title":"Text Representation <pre><code>toString: String\n</code></pre> <p><code>toString</code> uses the taskId and index as follows:</p> <pre><code>TaskDescription(TID=[taskId], index=[index])\n</code></pre>","text":""},{"location":"scheduler/TaskDescription/#decoding-taskdescription-from-serialized-format","title":"Decoding TaskDescription (from Serialized Format) <pre><code>decode(\n  byteBuffer: ByteBuffer): TaskDescription\n</code></pre> <p><code>decode</code> simply decodes (&lt;&gt;) a <code>TaskDescription</code> from the serialized format (<code>ByteBuffer</code>). <p>Internally, <code>decode</code>...FIXME</p> <p><code>decode</code> is used when:</p> <ul> <li> <p><code>CoarseGrainedExecutorBackend</code> is requested to CoarseGrainedExecutorBackend.md#LaunchTask[handle a LaunchTask message]</p> </li> <li> <p>Spark on Mesos' <code>MesosExecutorBackend</code> is requested to spark-on-mesos:spark-executor-backends-MesosExecutorBackend.md#launchTask[launch a task]</p> </li> </ul>","text":""},{"location":"scheduler/TaskDescription/#encoding-taskdescription-to-serialized-format","title":"Encoding TaskDescription (to Serialized Format) <pre><code>encode(\n  taskDescription: TaskDescription): ByteBuffer\n</code></pre> <p><code>encode</code> simply encodes the <code>TaskDescription</code> to a serialized format (<code>ByteBuffer</code>).</p> <p>Internally, <code>encode</code>...FIXME</p> <p><code>encode</code> is used when:</p> <ul> <li><code>DriverEndpoint</code> (of <code>CoarseGrainedSchedulerBackend</code>) is requested to launchTasks</li> </ul>","text":""},{"location":"scheduler/TaskDescription/#task-name","title":"Task Name <p>The name of the task is of the format:</p> <pre><code>task [taskID] in stage [taskSetID]\n</code></pre>","text":""},{"location":"scheduler/TaskInfo/","title":"TaskInfo","text":"<p>== [[TaskInfo]] TaskInfo</p> <p><code>TaskInfo</code> is information about a running task attempt inside a scheduler:TaskSet.md[TaskSet].</p> <p><code>TaskInfo</code> is created when:</p> <ul> <li> <p>scheduler:TaskSetManager.md#resourceOffer[<code>TaskSetManager</code> dequeues a task for execution (given resource offer)] (and records the task as running)</p> </li> <li> <p>TaskUIData does <code>dropInternalAndSQLAccumulables</code></p> </li> <li> <p>JsonProtocol utility is used to spark-history-server:JsonProtocol.md#taskInfoFromJson[re-create a task details from JSON]</p> </li> </ul> <p>NOTE: Back then, at the commit 63051dd2bcc4bf09d413ff7cf89a37967edc33ba, when <code>TaskInfo</code> was first merged to Apache Spark on 07/06/12, <code>TaskInfo</code> was part of <code>spark.scheduler.mesos</code> package -- note \"Mesos\" in the name of the package that shows how much Spark and Mesos influenced each other at that time.</p> <p>[[internal-registries]] .TaskInfo's Internal Registries and Counters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| [[finishTime]] <code>finishTime</code> | Time when <code>TaskInfo</code> was &lt;&gt;. <p>Used when...FIXME |===</p> <p>=== [[creating-instance]] Creating TaskInfo Instance</p> <p><code>TaskInfo</code> takes the following when created:</p> <ul> <li>[[taskId]] Task ID</li> <li>[[index]] Index of the task within its scheduler:TaskSet.md[TaskSet] that may not necessarily be the same as the ID of the RDD partition that the task is computing.</li> <li>[[attemptNumber]] Task attempt ID</li> <li>[[launchTime]] Time when the task was dequeued for execution</li> <li>[[executorId]] Executor that has been offered (as a resource) to run the task</li> <li>[[host]] Host of the &lt;&gt; <li>[[taskLocality]] scheduler:TaskSchedulerImpl.md#TaskLocality[TaskLocality], i.e. locality preference of the task</li> <li>[[speculative]] Flag whether a task is speculative or not</li> <p><code>TaskInfo</code> initializes the &lt;&gt;. <p>=== [[markFinished]] Marking Task As Finished (Successfully or Not) -- <code>markFinished</code> Method</p>"},{"location":"scheduler/TaskInfo/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/TaskInfo/#markfinishedstate-taskstate-time-long-systemcurrenttimemillis-unit","title":"markFinished(state: TaskState, time: Long = System.currentTimeMillis): Unit","text":"<p><code>markFinished</code> records the input <code>time</code> as &lt;&gt;. <p><code>markFinished</code> marks <code>TaskInfo</code> as &lt;&gt; when the input <code>state</code> is <code>FAILED</code> or &lt;&gt; for <code>state</code> being <code>KILLED</code>. <p>NOTE: <code>markFinished</code> is used when <code>TaskSetManager</code> is notified that a task has finished scheduler:TaskSetManager.md#handleSuccessfulTask[successfully] or scheduler:TaskSetManager.md#handleFailedTask[failed].</p>"},{"location":"scheduler/TaskLocation/","title":"TaskLocation","text":"<p>TaskLocation represents a placement preference of an RDD partition, i.e. a hint of the location to submit scheduler:Task.md[tasks] for execution.</p> <p>TaskLocations are tracked by scheduler:DAGScheduler.md#cacheLocs[DAGScheduler] for scheduler:DAGScheduler.md#submitMissingTasks[submitting missing tasks of a stage].</p> <p>TaskLocation is available as scheduler:Task.md#preferredLocations[preferredLocations] of a task.</p> <p>[[host]] Every TaskLocation describes the location by host name, but could also use other location-related metadata.</p> <p>TaskLocations of an RDD and a partition is available using SparkContext.md#getPreferredLocs[SparkContext.getPreferredLocs] method.</p> Sealed <p><code>TaskLocation</code> is a Scala <code>private[spark] sealed</code> trait so all the available implementations of TaskLocation trait are in a single Scala file.</p> <p>== [[ExecutorCacheTaskLocation]] ExecutorCacheTaskLocation</p> <p>ExecutorCacheTaskLocation describes a &lt;&gt; and an executor. <p>ExecutorCacheTaskLocation informs the Scheduler to prefer a given executor, but the next level of preference is any executor on the same host if this is not possible.</p> <p>== [[HDFSCacheTaskLocation]] HDFSCacheTaskLocation</p> <p>HDFSCacheTaskLocation describes a &lt;&gt; that is cached by HDFS. <p>Used exclusively when rdd:HadoopRDD.md#getPreferredLocations[HadoopRDD] and rdd:NewHadoopRDD.md#getPreferredLocations[NewHadoopRDD] are requested for their placement preferences (aka preferred locations).</p> <p>== [[HostTaskLocation]] HostTaskLocation</p> <p>HostTaskLocation describes a &lt;&gt; only."},{"location":"scheduler/TaskResult/","title":"TaskResult","text":"<p><code>TaskResult</code> is an abstraction of task results (of type <code>T</code>).</p> <p>The decision what <code>TaskResult</code> type to use is made when <code>TaskRunner</code> finishes running a task.</p> Sealed Trait <p><code>TaskResult</code> is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).</p>"},{"location":"scheduler/TaskResult/#directtaskresult","title":"DirectTaskResult <p><code>DirectTaskResult</code> is a <code>TaskResult</code> to be serialized and sent over the wire to the driver together with the following:</p> <ul> <li> Value Bytes (java.nio.ByteBuffer) <li> Accumulator updates <li> Metric Peaks  <p><code>DirectTaskResult</code> is used when the size of a task result is below spark.driver.maxResultSize and the maximum size of direct results.</p>","text":""},{"location":"scheduler/TaskResult/#indirecttaskresult","title":"IndirectTaskResult <p><code>IndirectTaskResult</code> is a \"pointer\" to a task result that is available in a BlockManager:</p> <ul> <li> BlockId <li> Size  <p><code>IndirectTaskResult</code> is a java.io.Serializable.</p>","text":""},{"location":"scheduler/TaskResult/#externalizable","title":"Externalizable <p><code>DirectTaskResult</code> is an <code>Externalizable</code> (Java).</p>","text":""},{"location":"scheduler/TaskResultGetter/","title":"TaskResultGetter","text":"<p><code>TaskResultGetter</code> is a helper class of scheduler:TaskSchedulerImpl.md#statusUpdate[TaskSchedulerImpl] for asynchronous deserialization of &lt;&gt; (possibly fetching remote blocks) or &lt;&gt;. <p>CAUTION: FIXME Image with the dependencies</p> <p>TIP: Consult scheduler:Task.md#states[Task States] in Tasks to learn about the different task states.</p> <p>NOTE: The only instance of <code>TaskResultGetter</code> is created while scheduler:TaskSchedulerImpl.md#creating-instance[<code>TaskSchedulerImpl</code> is created].</p> <p><code>TaskResultGetter</code> requires a core:SparkEnv.md[SparkEnv] and scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] to be created and is stopped when scheduler:TaskSchedulerImpl.md#stop[<code>TaskSchedulerImpl</code> stops].</p> <p><code>TaskResultGetter</code> uses &lt;task-result-getter asynchronous task executor&gt;&gt; for operation."},{"location":"scheduler/TaskResultGetter/#tip","title":"[TIP]","text":"<p>Enable <code>DEBUG</code> logging level for <code>org.apache.spark.scheduler.TaskResultGetter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.TaskResultGetter=DEBUG\n</code></pre>"},{"location":"scheduler/TaskResultGetter/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":"<p>=== [[getTaskResultExecutor]][[task-result-getter]] <code>task-result-getter</code> Asynchronous Task Executor</p>"},{"location":"scheduler/TaskResultGetter/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#gettaskresultexecutor-executorservice","title":"getTaskResultExecutor: ExecutorService","text":"<p><code>getTaskResultExecutor</code> creates a daemon thread pool with &lt;&gt; threads and <code>task-result-getter</code> prefix. <p>TIP: Read up on https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html[java.util.concurrent.ThreadPoolExecutor] that <code>getTaskResultExecutor</code> uses under the covers.</p> <p>=== [[stop]] <code>stop</code> Method</p>"},{"location":"scheduler/TaskResultGetter/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#stop-unit","title":"stop(): Unit","text":"<p><code>stop</code> stops the internal &lt;task-result-getter asynchronous task executor&gt;&gt;. <p>=== [[serializer]] <code>serializer</code> Attribute</p>"},{"location":"scheduler/TaskResultGetter/#source-scala_2","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#serializer-threadlocalserializerinstance","title":"serializer: ThreadLocal[SerializerInstance]","text":"<p><code>serializer</code> is a thread-local serializer:SerializerInstance.md[SerializerInstance] that <code>TaskResultGetter</code> uses to deserialize byte buffers (with <code>TaskResult</code>s or a <code>TaskEndReason</code>).</p> <p>When created for a new thread, <code>serializer</code> is initialized with a new instance of <code>Serializer</code> (using core:SparkEnv.md#closureSerializer[SparkEnv.closureSerializer]).</p> <p>NOTE: <code>TaskResultGetter</code> uses https://docs.oracle.com/javase/8/docs/api/java/lang/ThreadLocal.html[java.lang.ThreadLocal] for the thread-local <code>SerializerInstance</code> variable.</p> <p>=== [[taskResultSerializer]] <code>taskResultSerializer</code> Attribute</p>"},{"location":"scheduler/TaskResultGetter/#source-scala_3","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#taskresultserializer-threadlocalserializerinstance","title":"taskResultSerializer: ThreadLocal[SerializerInstance]","text":"<p><code>taskResultSerializer</code> is a thread-local serializer:SerializerInstance.md[SerializerInstance] that <code>TaskResultGetter</code> uses to...</p> <p>When created for a new thread, <code>taskResultSerializer</code> is initialized with a new instance of <code>Serializer</code> (using core:SparkEnv.md#serializer[SparkEnv.serializer]).</p> <p>NOTE: <code>TaskResultGetter</code> uses https://docs.oracle.com/javase/8/docs/api/java/lang/ThreadLocal.html[java.lang.ThreadLocal] for the thread-local <code>SerializerInstance</code> variable.</p>"},{"location":"scheduler/TaskResultGetter/#enqueuing-successful-task","title":"Enqueuing Successful Task <pre><code>enqueueSuccessfulTask(\n  taskSetManager: TaskSetManager,\n  tid: Long,\n  serializedData: ByteBuffer): Unit\n</code></pre> <p><code>enqueueSuccessfulTask</code> submits an asynchronous task (to &lt;&gt; asynchronous task executor) that first deserializes <code>serializedData</code> to a <code>DirectTaskResult</code>, then updates the internal accumulator (with the size of the <code>DirectTaskResult</code>) and ultimately notifies the <code>TaskSchedulerImpl</code> that the <code>tid</code> task was completed and scheduler:TaskSchedulerImpl.md#handleSuccessfulTask[the task result was received successfully] or scheduler:TaskSchedulerImpl.md#handleFailedTask[not]. <p>NOTE: <code>enqueueSuccessfulTask</code> is just the asynchronous task enqueued for execution by &lt;&gt; asynchronous task executor at some point in the future. <p>Internally, the enqueued task first deserializes <code>serializedData</code> to a <code>TaskResult</code> (using the internal thread-local &lt;&gt;). <p>For a DirectTaskResult, the task scheduler:TaskSetManager.md#canFetchMoreResults[checks the available memory for the task result] and, when the size overflows configuration-properties.md#spark.driver.maxResultSize[spark.driver.maxResultSize], it simply returns.</p>  <p>Note</p> <p><code>enqueueSuccessfulTask</code> is a mere thread so returning from a thread is to do nothing else. That is why the check for quota does abort when there is not enough memory.</p>  <p>Otherwise, when there is enough memory to hold the task result, it deserializes the <code>DirectTaskResult</code> (using the internal thread-local &lt;&gt;). <p>For an IndirectTaskResult, the task checks the available memory for the task result and, when the size could overflow the maximum result size, it storage:BlockManagerMaster.md#removeBlock[removes the block] and simply returns.</p> <p>Otherwise, when there is enough memory to hold the task result, you should see the following DEBUG message in the logs:</p> <pre><code>Fetching indirect task result for TID [tid]\n</code></pre> <p>The task scheduler:TaskSchedulerImpl.md#handleTaskGettingResult[notifies <code>TaskSchedulerImpl</code> that it is about to fetch a remote block for a task result]. It then storage:BlockManager.md#getRemoteBytes[gets the block from remote block managers (as serialized bytes)].</p> <p>When the block could not be fetched, scheduler:TaskSchedulerImpl.md#handleFailedTask[<code>TaskSchedulerImpl</code> is informed] (with <code>TaskResultLost</code> task failure reason) and the task simply returns.</p> <p>NOTE: <code>enqueueSuccessfulTask</code> is a mere thread so returning from a thread is to do nothing else and so the real handling is when scheduler:TaskSchedulerImpl.md#handleFailedTask[<code>TaskSchedulerImpl</code> is informed].</p> <p>The task result (as a serialized byte buffer) is then deserialized to a DirectTaskResult (using the internal thread-local &lt;&gt;) and deserialized again using the internal thread-local &lt;&gt; (just like for the <code>DirectTaskResult</code> case). The  storage:BlockManagerMaster.md#removeBlock[block is removed from <code>BlockManagerMaster</code>] and simply returns.  <p>Note</p> <p>A IndirectTaskResult is deserialized twice to become the final deserialized task result (using &lt;&gt; for a <code>DirectTaskResult</code>). Compare it to a <code>DirectTaskResult</code> task result that is deserialized once only.  <p>With no exceptions thrown, <code>enqueueSuccessfulTask</code> scheduler:TaskSchedulerImpl.md#handleSuccessfulTask[informs the <code>TaskSchedulerImpl</code> that the <code>tid</code> task was completed and the task result was received].</p> <p>A <code>ClassNotFoundException</code> leads to scheduler:TaskSetManager.md#abort[aborting the <code>TaskSet</code>] (with <code>ClassNotFound with classloader: [loader]</code> error message) while any non-fatal exception shows the following ERROR message in the logs followed by scheduler:TaskSetManager.md#abort[aborting the <code>TaskSet</code>].</p> <pre><code>Exception while getting task result\n</code></pre> <p><code>enqueueSuccessfulTask</code> is used when <code>TaskSchedulerImpl</code> is requested to handle task status update (and the task has finished successfully).</p> <p>=== [[enqueueFailedTask]] Deserializing TaskFailedReason and Notifying TaskSchedulerImpl -- <code>enqueueFailedTask</code> Method</p>","text":""},{"location":"scheduler/TaskResultGetter/#source-scala_4","title":"[source, scala] <p>enqueueFailedTask(   taskSetManager: TaskSetManager,   tid: Long,   taskState: TaskState.TaskState,   serializedData: ByteBuffer): Unit</p>  <p><code>enqueueFailedTask</code> submits an asynchronous task (to &lt;task-result-getter asynchronous task executor&gt;&gt;) that first attempts to deserialize a <code>TaskFailedReason</code> from <code>serializedData</code> (using the internal thread-local &lt;&gt;) and then scheduler:TaskSchedulerImpl.md#handleFailedTask[notifies <code>TaskSchedulerImpl</code> that the task has failed]. <p>Any <code>ClassNotFoundException</code> leads to the following ERROR message in the logs (without breaking the flow of <code>enqueueFailedTask</code>):</p> <pre><code>ERROR Could not deserialize TaskEndReason: ClassNotFound with classloader [loader]\n</code></pre> <p>NOTE: <code>enqueueFailedTask</code> is called when scheduler:TaskSchedulerImpl.md#statusUpdate[<code>TaskSchedulerImpl</code> is notified about a task that has failed (and is in <code>FAILED</code>, <code>KILLED</code> or <code>LOST</code> state)].</p> <p>=== [[settings]] Settings</p> <p>.Spark Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Spark Property | Default Value | Description | [[spark_resultGetter_threads]] <code>spark.resultGetter.threads</code> | <code>4</code> | The number of threads for <code>TaskResultGetter</code>. |===</p>","text":""},{"location":"scheduler/TaskScheduler/","title":"TaskScheduler","text":"<p><code>TaskScheduler</code> is an abstraction of &lt;&gt; that can &lt;&gt; in a Spark application (per &lt;&gt;). <p></p> <p>NOTE: TaskScheduler works closely with scheduler:DAGScheduler.md[DAGScheduler] that &lt;&gt; (for every stage in a Spark job). <p>TaskScheduler can track the executors available in a Spark application using &lt;&gt; and &lt;&gt; interceptors (that inform about active and lost executors, respectively). <p>== [[submitTasks]] Submitting Tasks for Execution</p>"},{"location":"scheduler/TaskScheduler/#source-scala","title":"[source, scala]","text":"<p>submitTasks(   taskSet: TaskSet): Unit</p> <p>Submits the tasks (of the given scheduler:TaskSet.md[TaskSet]) for execution.</p> <p>Used when DAGScheduler is requested to scheduler:DAGScheduler.md#submitMissingTasks[submit missing tasks (of a stage)].</p> <p>== [[executorHeartbeatReceived]] Handling Executor Heartbeat</p>"},{"location":"scheduler/TaskScheduler/#source-scala_1","title":"[source, scala]","text":"<p>executorHeartbeatReceived(   execId: String,   accumUpdates: Array[(Long, Seq[AccumulatorV2[_, _]])],   blockManagerId: BlockManagerId): Boolean</p> <p>Handles a heartbeat from an executor</p> <p>Returns <code>true</code> when the <code>execId</code> executor is managed by the TaskScheduler. <code>false</code> indicates that the executor:Executor.md#reportHeartBeat[block manager (on the executor) should re-register].</p> <p>Used when HeartbeatReceiver RPC endpoint is requested to handle a Heartbeat (with task metrics) from an executor</p> <p>== [[killTaskAttempt]] Killing Task</p>"},{"location":"scheduler/TaskScheduler/#source-scala_2","title":"[source, scala]","text":"<p>killTaskAttempt(   taskId: Long,   interruptThread: Boolean,   reason: String): Boolean</p> <p>Kills a task (attempt)</p> <p>Used when DAGScheduler is requested to scheduler:DAGScheduler.md#killTaskAttempt[kill a task]</p> <p>== [[workerRemoved]] workerRemoved Notification</p>"},{"location":"scheduler/TaskScheduler/#source-scala_3","title":"[source, scala]","text":"<p>workerRemoved(   workerId: String,   host: String,   message: String): Unit</p> <p>Used when <code>DriverEndpoint</code> is requested to handle a RemoveWorker event</p> <p>== [[contract]] Contract</p> <p>[cols=\"30m,70\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| applicationAttemptId a| [[applicationAttemptId]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_4","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#applicationattemptid-optionstring","title":"applicationAttemptId(): Option[String]","text":"<p>Unique identifier of an (execution) attempt of the Spark application</p> <p>Used when SparkContext is created</p> <p>| cancelTasks a| [[cancelTasks]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_5","title":"[source, scala]","text":"<p>cancelTasks(   stageId: Int,   interruptThread: Boolean): Unit</p> <p>Cancels all the tasks of a given Stage.md[stage]</p> <p>Used when DAGScheduler is requested to DAGScheduler.md#failJobAndIndependentStages[failJobAndIndependentStages]</p> <p>| defaultParallelism a| [[defaultParallelism]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_6","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#defaultparallelism-int","title":"defaultParallelism(): Int","text":"<p>Default level of parallelism</p> <p>Used when <code>SparkContext</code> is requested for the default level of parallelism</p> <p>| executorLost a| [[executorLost]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_7","title":"[source, scala]","text":"<p>executorLost(   executorId: String,   reason: ExecutorLossReason): Unit</p> <p>Handles an executor lost event</p> <p>Used when:</p> <ul> <li> <p><code>HeartbeatReceiver</code> RPC endpoint is requested to expireDeadHosts</p> </li> <li> <p><code>DriverEndpoint</code> RPC endpoint is requested to removes (forgets) and disables a malfunctioning executor (i.e. either lost or blacklisted for some reason)</p> </li> </ul> <p>| killAllTaskAttempts a| [[killAllTaskAttempts]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_8","title":"[source, scala]","text":"<p>killAllTaskAttempts(   stageId: Int,   interruptThread: Boolean,   reason: String): Unit</p> <p>Used when:</p> <ul> <li> <p>DAGScheduler is requested to DAGScheduler.md#handleTaskCompletion[handleTaskCompletion]</p> </li> <li> <p><code>TaskSchedulerImpl</code> is requested to TaskSchedulerImpl.md#cancelTasks[cancel all the tasks of a stage]</p> </li> </ul> <p>| rootPool a| [[rootPool]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_9","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#rootpool-pool","title":"rootPool: Pool","text":"<p>Top-level (root) scheduler:spark-scheduler-Pool.md[schedulable pool]</p> <p>Used when:</p> <ul> <li> <p><code>TaskSchedulerImpl</code> is requested to scheduler:TaskSchedulerImpl.md#initialize[initialize]</p> </li> <li> <p><code>SparkContext</code> is requested to SparkContext.md#getAllPools[getAllPools] and SparkContext.md#getPoolForName[getPoolForName]</p> </li> <li> <p><code>TaskSchedulerImpl</code> is requested to scheduler:TaskSchedulerImpl.md#resourceOffers[resourceOffers], scheduler:TaskSchedulerImpl.md#checkSpeculatableTasks[checkSpeculatableTasks], and scheduler:TaskSchedulerImpl.md#removeExecutor[removeExecutor]</p> </li> </ul> <p>| schedulingMode a| [[schedulingMode]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_10","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#schedulingmode-schedulingmode","title":"schedulingMode: SchedulingMode","text":"<p>scheduler:spark-scheduler-SchedulingMode.md[Scheduling mode]</p> <p>Used when:</p> <ul> <li> <p><code>TaskSchedulerImpl</code> is scheduler:TaskSchedulerImpl.md#rootPool[created] and scheduler:TaskSchedulerImpl.md#initialize[initialized]</p> </li> <li> <p><code>SparkContext</code> is requested to SparkContext.md#getSchedulingMode[getSchedulingMode]</p> </li> </ul> <p>| setDAGScheduler a| [[setDAGScheduler]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_11","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#setdagschedulerdagscheduler-dagscheduler-unit","title":"setDAGScheduler(dagScheduler: DAGScheduler): Unit","text":"<p>Associates a scheduler:DAGScheduler.md[DAGScheduler]</p> <p>Used when DAGScheduler is scheduler:DAGScheduler.md#creating-instance[created]</p> <p>| start a| [[start]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_12","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#start-unit","title":"start(): Unit","text":"<p>Starts the TaskScheduler</p> <p>Used when SparkContext is created</p> <p>| stop a| [[stop]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_13","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#stop-unit","title":"stop(): Unit","text":"<p>Stops the TaskScheduler</p> <p>Used when DAGScheduler is requested to scheduler:DAGScheduler.md#stop[stop]</p> <p>|===</p>"},{"location":"scheduler/TaskScheduler/#lifecycle","title":"Lifecycle","text":"<p>A <code>TaskScheduler</code> is created while SparkContext is being created (by calling SparkContext.createTaskScheduler for a given master URL and deploy mode).</p> <p></p> <p>At this point in SparkContext's lifecycle, the internal <code>_taskScheduler</code> points at the TaskScheduler (and it is \"announced\" by sending a blocking <code>TaskSchedulerIsSet</code> message to HeartbeatReceiver RPC endpoint).</p> <p>The &lt;&gt; right after the blocking <code>TaskSchedulerIsSet</code> message receives a response. <p>The &lt;&gt; and the &lt;&gt; are set at this point (and <code>SparkContext</code> uses the application id to set SparkConf.md#spark.app.id[spark.app.id] Spark property, and configure webui:spark-webui-SparkUI.md[SparkUI], and storage:BlockManager.md[BlockManager]). <p>CAUTION: FIXME The application id is described as \"associated with the job.\" in TaskScheduler, but I think it is \"associated with the application\" and you can have many jobs per application.</p> <p>Right before SparkContext is fully initialized, &lt;&gt; is called. <p>The internal <code>_taskScheduler</code> is cleared (i.e. set to <code>null</code>) while SparkContext.md#stop[SparkContext is being stopped].</p> <p>&lt;&gt; while scheduler:DAGScheduler.md#stop[DAGScheduler is being stopped]. <p>WARNING: FIXME If it is SparkContext to start a TaskScheduler, shouldn't SparkContext stop it too? Why is this the way it is now?</p> <p>== [[postStartHook]] Post-Start Initialization</p>"},{"location":"scheduler/TaskScheduler/#source-scala_14","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#poststarthook-unit","title":"postStartHook(): Unit","text":"<p><code>postStartHook</code> does nothing by default, but allows &lt;&gt; for some additional post-start initialization. <p><code>postStartHook</code> is used when:</p> <ul> <li> <p>SparkContext is created</p> </li> <li> <p>Spark on YARN's <code>YarnClusterScheduler</code> is requested to spark-on-yarn:spark-yarn-yarnclusterscheduler.md#postStartHook[postStartHook]</p> </li> </ul> <p>== [[applicationId]][[appId]] Unique Identifier of Spark Application</p>"},{"location":"scheduler/TaskScheduler/#source-scala_15","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#applicationid-string","title":"applicationId(): String","text":"<p><code>applicationId</code> is the unique identifier of the Spark application and defaults to spark-application-[currentTimeMillis].</p> <p><code>applicationId</code> is used when SparkContext is created.</p>"},{"location":"scheduler/TaskSchedulerImpl/","title":"TaskSchedulerImpl","text":"<p><code>TaskSchedulerImpl</code> is a TaskScheduler that uses a SchedulerBackend to schedule tasks (for execution on a cluster manager).</p> <p>When a Spark application starts (and so an instance of <code>SparkContext</code> is created) <code>TaskSchedulerImpl</code> with a SchedulerBackend and DAGScheduler are created and soon started.</p> <p></p> <p><code>TaskSchedulerImpl</code> generates tasks based on executor resource offers.</p> <p><code>TaskSchedulerImpl</code> can track racks per host and port (that however is only used with Hadoop YARN cluster manager).</p> <p>Using spark.scheduler.mode configuration property you can select the scheduling policy.</p> <p><code>TaskSchedulerImpl</code> submits tasks using SchedulableBuilders.</p>"},{"location":"scheduler/TaskSchedulerImpl/#creating-instance","title":"Creating Instance","text":"<p><code>TaskSchedulerImpl</code> takes the following to be created:</p> <ul> <li> SparkContext <li>Maximum Number of Task Failures</li> <li> <code>isLocal</code> flag (default: <code>false</code>) <li> <code>Clock</code> (default: <code>SystemClock</code>) <p>While being created, <code>TaskSchedulerImpl</code> sets schedulingMode to the value of spark.scheduler.mode configuration property.</p> <p>Note</p> <p><code>schedulingMode</code> is part of the TaskScheduler abstraction.</p> <p><code>TaskSchedulerImpl</code> throws a <code>SparkException</code> for unrecognized scheduling mode:</p> <pre><code>Unrecognized spark.scheduler.mode: [schedulingModeConf]\n</code></pre> <p>In the end, <code>TaskSchedulerImpl</code> creates a TaskResultGetter.</p> <p><code>TaskSchedulerImpl</code> is created\u00a0when:</p> <ul> <li><code>SparkContext</code> is requested for a TaskScheduler (for <code>local</code> and <code>spark</code> master URLs)</li> <li><code>KubernetesClusterManager</code> and <code>MesosClusterManager</code> are requested for a <code>TaskScheduler</code></li> </ul>"},{"location":"scheduler/TaskSchedulerImpl/#maxTaskFailures","title":"Maximum Number of Task Failures","text":"<p><code>TaskSchedulerImpl</code> can be given the maximum number of task failures when created or default to spark.task.maxFailures configuration property.</p> <p>The number of task failures is used when submitting tasks (to create a TaskSetManager).</p>"},{"location":"scheduler/TaskSchedulerImpl/#sparktaskcpus","title":"spark.task.cpus <p><code>TaskSchedulerImpl</code> uses spark.task.cpus configuration property for...FIXME</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#backend","title":"SchedulerBackend <pre><code>backend: SchedulerBackend\n</code></pre> <p><code>TaskSchedulerImpl</code> is given a SchedulerBackend when requested to initialize.</p> <p>The lifecycle of the <code>SchedulerBackend</code> is tightly coupled to the lifecycle of the <code>TaskSchedulerImpl</code>:</p> <ul> <li>It is started when <code>TaskSchedulerImpl</code> is</li> <li>It is stopped when <code>TaskSchedulerImpl</code> is</li> </ul> <p><code>TaskSchedulerImpl</code> waits until the SchedulerBackend is ready before requesting it for the following:</p> <ul> <li> <p>Reviving resource offers when requested to submitTasks, statusUpdate, handleFailedTask, checkSpeculatableTasks, and executorLost</p> </li> <li> <p>Killing tasks when requested to killTaskAttempt and killAllTaskAttempts</p> </li> <li> <p>Default parallelism, applicationId and applicationAttemptId when requested for the defaultParallelism, applicationId and applicationAttemptId, respectively</p> </li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#unique-identifier-of-spark-application","title":"Unique Identifier of Spark Application <pre><code>applicationId(): String\n</code></pre> <p><code>applicationId</code> is part of the TaskScheduler abstraction.</p> <p><code>applicationId</code> simply request the SchedulerBackend for the applicationId.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#cancelling-all-tasks-of-stage","title":"Cancelling All Tasks of Stage <pre><code>cancelTasks(\n  stageId: Int,\n  interruptThread: Boolean): Unit\n</code></pre> <p><code>cancelTasks</code> is part of the TaskScheduler abstraction.</p> <p><code>cancelTasks</code> cancels all tasks submitted for execution in a stage <code>stageId</code>.</p> <p><code>cancelTasks</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to failJobAndIndependentStages</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#handlesuccessfultask","title":"handleSuccessfulTask <pre><code>handleSuccessfulTask(\n  taskSetManager: TaskSetManager,\n  tid: Long,\n  taskResult: DirectTaskResult[_]): Unit\n</code></pre> <p><code>handleSuccessfulTask</code> requests the given TaskSetManager to handleSuccessfulTask (with the given <code>tid</code> and <code>taskResult</code>).</p> <p><code>handleSuccessfulTask</code> is used when:</p> <ul> <li><code>TaskResultGetter</code> is requested to enqueueSuccessfulTask</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#handletaskgettingresult","title":"handleTaskGettingResult <pre><code>handleTaskGettingResult(\n  taskSetManager: TaskSetManager,\n  tid: Long): Unit\n</code></pre> <p><code>handleTaskGettingResult</code> requests the given TaskSetManager to handleTaskGettingResult.</p> <p><code>handleTaskGettingResult</code> is used when:</p> <ul> <li><code>TaskResultGetter</code> is requested to enqueueSuccessfulTask</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#initializing","title":"Initializing <pre><code>initialize(\n  backend: SchedulerBackend): Unit\n</code></pre> <p><code>initialize</code> initializes the <code>TaskSchedulerImpl</code> with the given SchedulerBackend.</p> <p></p> <p><code>initialize</code> saves the given SchedulerBackend.</p> <p><code>initialize</code> then sets &lt;Pool&gt;&gt; as an empty-named Pool.md[Pool] (passing in &lt;&gt;, <code>initMinShare</code> and <code>initWeight</code> as <code>0</code>). <p>NOTE: &lt;&gt; and &lt;&gt; are a part of scheduler:TaskScheduler.md#contract[TaskScheduler Contract]. <p><code>initialize</code> sets &lt;&gt; (based on &lt;&gt;): <ul> <li>FIFOSchedulableBuilder.md[FIFOSchedulableBuilder] for <code>FIFO</code> scheduling mode</li> <li>FairSchedulableBuilder.md[FairSchedulableBuilder] for <code>FAIR</code> scheduling mode</li> </ul> <p><code>initialize</code> SchedulableBuilder.md#buildPools[requests <code>SchedulableBuilder</code> to build pools].</p> <p>CAUTION: FIXME Why are <code>rootPool</code> and <code>schedulableBuilder</code> created only now? What do they need that it is not available when TaskSchedulerImpl is created?</p> <p>NOTE: <code>initialize</code> is called while SparkContext.md#createTaskScheduler[SparkContext is created and creates SchedulerBackend and <code>TaskScheduler</code>].</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#starting-taskschedulerimpl","title":"Starting TaskSchedulerImpl <pre><code>start(): Unit\n</code></pre> <p><code>start</code> starts the SchedulerBackend and the task-scheduler-speculation executor service.</p> <p></p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#handling-task-status-update","title":"Handling Task Status Update <pre><code>statusUpdate(\n  tid: Long,\n  state: TaskState,\n  serializedData: ByteBuffer): Unit\n</code></pre> <p><code>statusUpdate</code> finds TaskSetManager for the input <code>tid</code> task (in &lt;&gt;). <p>When <code>state</code> is <code>LOST</code>, <code>statusUpdate</code>...FIXME</p> <p>NOTE: <code>TaskState.LOST</code> is only used by the deprecated Mesos fine-grained scheduling mode.</p> <p>When <code>state</code> is one of the scheduler:Task.md#states[finished states], i.e. <code>FINISHED</code>, <code>FAILED</code>, <code>KILLED</code> or <code>LOST</code>, <code>statusUpdate</code> &lt;&gt; for the input <code>tid</code>. <p><code>statusUpdate</code> scheduler:TaskSetManager.md#removeRunningTask[requests <code>TaskSetManager</code> to unregister <code>tid</code> from running tasks].</p> <p><code>statusUpdate</code> requests &lt;&gt; to scheduler:TaskResultGetter.md#enqueueSuccessfulTask[schedule an asynchrounous task to deserialize the task result (and notify TaskSchedulerImpl back)] for <code>tid</code> in <code>FINISHED</code> state and scheduler:TaskResultGetter.md#enqueueFailedTask[schedule an asynchrounous task to deserialize <code>TaskFailedReason</code> (and notify TaskSchedulerImpl back)] for <code>tid</code> in the other finished states (i.e. <code>FAILED</code>, <code>KILLED</code>, <code>LOST</code>). <p>If a task is in <code>LOST</code> state, <code>statusUpdate</code> scheduler:DAGScheduler.md#executorLost[notifies <code>DAGScheduler</code> that the executor was lost] (with <code>SlaveLost</code> and the reason <code>Task [tid] was lost, so marking the executor as lost as well.</code>) and scheduler:SchedulerBackend.md#reviveOffers[requests SchedulerBackend to revive offers].</p> <p>In case the <code>TaskSetManager</code> for <code>tid</code> could not be found (in &lt;&gt; registry), you should see the following ERROR message in the logs: <pre><code>Ignoring update with state [state] for TID [tid] because its task set is gone (this is likely the result of receiving duplicate task finished status updates)\n</code></pre> <p>Any exception is caught and reported as ERROR message in the logs:</p> <pre><code>Exception in statusUpdate\n</code></pre> <p>CAUTION: FIXME image with scheduler backends calling <code>TaskSchedulerImpl.statusUpdate</code>.</p> <p><code>statusUpdate</code> is used when:</p> <ul> <li> <p><code>DriverEndpoint</code> (of CoarseGrainedSchedulerBackend) is requested to handle a StatusUpdate message</p> </li> <li> <p><code>LocalEndpoint</code> is requested to handle a StatusUpdate message</p> </li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#task-scheduler-speculation-scheduled-executor-service","title":"task-scheduler-speculation Scheduled Executor Service <p><code>speculationScheduler</code> is a java.util.concurrent.ScheduledExecutorService with the name task-scheduler-speculation for Speculative Execution of Tasks.</p> <p>When <code>TaskSchedulerImpl</code> is requested to start (in non-local run mode) with spark.speculation enabled, <code>speculationScheduler</code> is used to schedule checkSpeculatableTasks to execute periodically every spark.speculation.interval.</p> <p><code>speculationScheduler</code> is shut down when <code>TaskSchedulerImpl</code> is requested to stop.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#checking-for-speculatable-tasks","title":"Checking for Speculatable Tasks <pre><code>checkSpeculatableTasks(): Unit\n</code></pre> <p><code>checkSpeculatableTasks</code> requests <code>rootPool</code> to check for speculatable tasks (if they ran for more than <code>100</code> ms) and, if there any, requests scheduler:SchedulerBackend.md#reviveOffers[SchedulerBackend to revive offers].</p> <p>NOTE: <code>checkSpeculatableTasks</code> is executed periodically as part of speculative-execution-of-tasks.md[].</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#cleaning-up-after-removing-executor","title":"Cleaning up After Removing Executor <pre><code>removeExecutor(\n  executorId: String,\n  reason: ExecutorLossReason): Unit\n</code></pre> <p><code>removeExecutor</code> removes the <code>executorId</code> executor from the following &lt;&gt;: &lt;&gt;, <code>executorIdToHost</code>, <code>executorsByHost</code>, and <code>hostsByRack</code>. If the affected hosts and racks are the last entries in <code>executorsByHost</code> and <code>hostsByRack</code>, appropriately, they are removed from the registries. <p>Unless <code>reason</code> is <code>LossReasonPending</code>, the executor is removed from <code>executorIdToHost</code> registry and Schedulable.md#executorLost[TaskSetManagers get notified].</p> <p>NOTE: The internal <code>removeExecutor</code> is called as part of &lt;&gt; and scheduler:TaskScheduler.md#executorLost[executorLost].","text":""},{"location":"scheduler/TaskSchedulerImpl/#handling-nearly-completed-sparkcontext-initialization","title":"Handling Nearly-Completed SparkContext Initialization <pre><code>postStartHook(): Unit\n</code></pre> <p><code>postStartHook</code> is part of the TaskScheduler abstraction.</p> <p><code>postStartHook</code> waits until a scheduler backend is ready.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#waiting-until-schedulerbackend-is-ready","title":"Waiting Until SchedulerBackend is Ready <pre><code>waitBackendReady(): Unit\n</code></pre> <p><code>waitBackendReady</code> waits until the SchedulerBackend is ready. If it is, <code>waitBackendReady</code> returns immediately. Otherwise, <code>waitBackendReady</code> keeps checking every <code>100</code> milliseconds (hardcoded) or the &lt;&gt; is SparkContext.md#stopped[stopped].  <p>Note</p> <p>A <code>SchedulerBackend</code> is ready by default.</p>  <p>If the <code>SparkContext</code> happens to be stopped while waiting, <code>waitBackendReady</code> throws an <code>IllegalStateException</code>:</p> <pre><code>Spark context stopped while waiting for backend\n</code></pre>","text":""},{"location":"scheduler/TaskSchedulerImpl/#stopping-taskschedulerimpl","title":"Stopping TaskSchedulerImpl <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> stops all the internal services, i.e. &lt;task-scheduler-speculation executor service&gt;&gt;, scheduler:SchedulerBackend.md[SchedulerBackend], scheduler:TaskResultGetter.md[TaskResultGetter], and &lt;&gt; timer.","text":""},{"location":"scheduler/TaskSchedulerImpl/#default-level-of-parallelism","title":"Default Level of Parallelism <pre><code>defaultParallelism(): Int\n</code></pre> <p><code>defaultParallelism</code> is part of the TaskScheduler abstraction.</p> <p><code>defaultParallelism</code> requests the SchedulerBackend for the default level of parallelism.</p>  <p>Note</p> <p>Default level of parallelism is a hint for sizing jobs that <code>SparkContext</code> uses to create RDDs with the right number of partitions unless specified explicitly.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#submitting-tasks-of-taskset-for-execution","title":"Submitting Tasks (of TaskSet) for Execution <pre><code>submitTasks(\n  taskSet: TaskSet): Unit\n</code></pre> <p><code>submitTasks</code> is part of the TaskScheduler abstraction.</p> <p>In essence, <code>submitTasks</code> registers a new TaskSetManager (for the given TaskSet) and requests the SchedulerBackend to handle resource allocation offers (from the scheduling system).</p> <p></p> <p>Internally, <code>submitTasks</code> prints out the following INFO message to the logs:</p> <pre><code>Adding task set [id] with [length] tasks\n</code></pre> <p><code>submitTasks</code> then &lt;&gt; (for the given TaskSet.md[TaskSet] and the &lt;&gt;). <p><code>submitTasks</code> registers (adds) the <code>TaskSetManager</code> per TaskSet.md#stageId[stage] and TaskSet.md#stageAttemptId[stage attempt] IDs (of the TaskSet.md[TaskSet]) in the &lt;&gt; internal registry. <p>NOTE: &lt;&gt; internal registry tracks the TaskSetManager.md[TaskSetManagers] (that represent TaskSet.md[TaskSets]) per stage and stage attempts. In other words, there could be many <code>TaskSetManagers</code> for a single stage, each representing a unique stage attempt. <p>NOTE: Not only could a task be retried (cf. &lt;&gt;), but also a single stage. <p><code>submitTasks</code> makes sure that there is exactly one active <code>TaskSetManager</code> (with different <code>TaskSet</code>) across all the managers (for the stage). Otherwise, <code>submitTasks</code> throws an <code>IllegalStateException</code>:</p> <pre><code>more than one active taskSet for stage [stage]: [TaskSet ids]\n</code></pre> <p>NOTE: <code>TaskSetManager</code> is considered active when it is not a zombie.</p> <p><code>submitTasks</code> requests the &lt;&gt; to SchedulableBuilder.md#addTaskSetManager[add the TaskSetManager to the schedulable pool]. <p>NOTE: The TaskScheduler.md#rootPool[schedulable pool] can be a single flat linked queue (in FIFOSchedulableBuilder.md[FIFO scheduling mode]) or a hierarchy of pools of <code>Schedulables</code> (in FairSchedulableBuilder.md[FAIR scheduling mode]).</p> <p><code>submitTasks</code> &lt;&gt; to make sure that the requested resources (i.e. CPU and memory) are assigned to the Spark application for a &lt;&gt; (the very first time the Spark application is started per &lt;&gt; flag). <p>NOTE: The very first time (&lt;&gt; flag is <code>false</code>) in cluster mode only (i.e. <code>isLocal</code> of the TaskSchedulerImpl is <code>false</code>), <code>starvationTimer</code> is scheduled to execute after configuration-properties.md#spark.starvation.timeout[spark.starvation.timeout]  to ensure that the requested resources, i.e. CPUs and memory, were assigned by a cluster manager. <p>NOTE: After the first configuration-properties.md#spark.starvation.timeout[spark.starvation.timeout] passes, the &lt;&gt; internal flag is <code>true</code>. <p>In the end, <code>submitTasks</code> requests the &lt;&gt; to scheduler:SchedulerBackend.md#reviveOffers[reviveOffers]. <p>TIP: Use <code>dag-scheduler-event-loop</code> thread to step through the code in a debugger.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#scheduling-starvation-task","title":"Scheduling Starvation Task <p>Every time the starvation timer thread is executed and <code>hasLaunchedTask</code> flag is <code>false</code>, the following WARN message is printed out to the logs:</p> <pre><code>Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\n</code></pre> <p>Otherwise, when the <code>hasLaunchedTask</code> flag is <code>true</code> the timer thread cancels itself.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#createTaskSetManager","title":"Creating TaskSetManager <pre><code>createTaskSetManager(\n  taskSet: TaskSet,\n  maxTaskFailures: Int): TaskSetManager\n</code></pre> <p><code>createTaskSetManager</code> creates a TaskSetManager (with this <code>TaskSchedulerImpl</code>, the given TaskSet and the <code>maxTaskFailures</code>).</p>  <p><code>createTaskSetManager</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to submit a TaskSet</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#notifying-tasksetmanager-that-task-failed","title":"Notifying TaskSetManager that Task Failed <pre><code>handleFailedTask(\n  taskSetManager: TaskSetManager,\n  tid: Long,\n  taskState: TaskState,\n  reason: TaskFailedReason): Unit\n</code></pre> <p><code>handleFailedTask</code> scheduler:TaskSetManager.md#handleFailedTask[notifies <code>taskSetManager</code> that <code>tid</code> task has failed] and, only when scheduler:TaskSetManager.md#zombie-state[<code>taskSetManager</code> is not in zombie state] and <code>tid</code> is not in <code>KILLED</code> state, scheduler:SchedulerBackend.md#reviveOffers[requests SchedulerBackend to revive offers].</p> <p>NOTE: <code>handleFailedTask</code> is called when scheduler:TaskResultGetter.md#enqueueSuccessfulTask[<code>TaskResultGetter</code> deserializes a <code>TaskFailedReason</code>] for a failed task.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#tasksetfinished","title":"taskSetFinished <pre><code>taskSetFinished(\n  manager: TaskSetManager): Unit\n</code></pre> <p><code>taskSetFinished</code> looks all scheduler:TaskSet.md[TaskSet]s up by the stage id (in &lt;&gt; registry) and removes the stage attempt from them, possibly with removing the entire stage record from <code>taskSetsByStageIdAndAttempt</code> registry completely (if there are no other attempts registered). <p></p> <p><code>taskSetFinished</code> then removes <code>manager</code> from the parent's schedulable pool.</p> <p>You should see the following INFO message in the logs:</p> <pre><code>Removed TaskSet [id], whose tasks have all completed, from pool [name]\n</code></pre> <p><code>taskSetFinished</code> is used when:</p> <ul> <li><code>TaskSetManager</code> is requested to maybeFinishTaskSet</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#notifying-dagscheduler-about-new-executor","title":"Notifying DAGScheduler About New Executor <pre><code>executorAdded(\n  execId: String,\n  host: String)\n</code></pre> <p><code>executorAdded</code> just DAGScheduler.md#executorAdded[notifies <code>DAGScheduler</code> that an executor was added].</p> <p>NOTE: <code>executorAdded</code> uses &lt;&gt; that was given when &lt;&gt;.","text":""},{"location":"scheduler/TaskSchedulerImpl/#resourceOffers","title":"Creating TaskDescriptions For Available Executor Resource Offers <pre><code>resourceOffers(\n  offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]]\n</code></pre> <p><code>resourceOffers</code> takes the resources <code>offers</code> and generates a collection of tasks (as TaskDescriptions) to launch (given the resources available).</p>  <p>Note</p> <p>A <code>WorkerOffer</code> represents a resource offer with CPU cores free to use on an executor.</p>  <p></p>  <p>Internally, <code>resourceOffers</code> first updates &lt;&gt; and &lt;&gt; lookup tables to record new hosts and executors (given the input <code>offers</code>). <p>For new executors (not in &lt;&gt;) <code>resourceOffers</code> &lt;DAGScheduler that an executor was added&gt;&gt;. <p>NOTE: TaskSchedulerImpl uses <code>resourceOffers</code> to track active executors.</p> <p>CAUTION: FIXME a picture with <code>executorAdded</code> call from TaskSchedulerImpl to DAGScheduler.</p> <p><code>resourceOffers</code> requests <code>BlacklistTracker</code> to <code>applyBlacklistTimeout</code> and filters out offers on blacklisted nodes and executors.</p> <p>NOTE: <code>resourceOffers</code> uses the optional &lt;&gt; that was given when &lt;&gt;. <p>CAUTION: FIXME Expand on blacklisting</p> <p><code>resourceOffers</code> then randomly shuffles offers (to evenly distribute tasks across executors and avoid over-utilizing some executors) and initializes the local data structures <code>tasks</code> and <code>availableCpus</code> (as shown in the figure below).</p> <p></p> <p><code>resourceOffers</code> Pool.md#getSortedTaskSetQueue[takes <code>TaskSets</code> in scheduling order] from scheduler:TaskScheduler.md#rootPool[top-level Schedulable Pool].</p> <p></p>  <p>Note</p> <p><code>rootPool</code> is configured when &lt;&gt;. <p><code>rootPool</code> is part of the scheduler:TaskScheduler.md#rootPool[TaskScheduler Contract] and exclusively managed by scheduler:SchedulableBuilder.md[SchedulableBuilders], i.e. scheduler:FIFOSchedulableBuilder.md[FIFOSchedulableBuilder] and scheduler:FairSchedulableBuilder.md[FairSchedulableBuilder] (that  scheduler:SchedulableBuilder.md#addTaskSetManager[manage registering TaskSetManagers with the root pool]).</p> <p>scheduler:TaskSetManager.md[TaskSetManager] manages execution of the tasks in a single scheduler:TaskSet.md[TaskSet] that represents a single scheduler:Stage.md[Stage].</p>  <p>For every <code>TaskSetManager</code> (in scheduling order), you should see the following DEBUG message in the logs:</p> <pre><code>parentName: [name], name: [name], runningTasks: [count]\n</code></pre> <p>Only if a new executor was added, <code>resourceOffers</code> scheduler:TaskSetManager.md#executorAdded[notifies every <code>TaskSetManager</code> about the change] (to recompute locality preferences).</p> <p><code>resourceOffers</code> then takes every <code>TaskSetManager</code> (in scheduling order) and offers them each node in increasing order of locality levels (per scheduler:TaskSetManager.md#computeValidLocalityLevels[TaskSetManager's valid locality levels]).</p> <p>NOTE: A <code>TaskSetManager</code> scheduler:TaskSetManager.md#computeValidLocalityLevels[computes locality levels of the tasks] it manages.</p> <p>For every <code>TaskSetManager</code> and the <code>TaskSetManager</code>'s valid locality level, <code>resourceOffers</code> tries to &lt;&gt; as long as the <code>TaskSetManager</code> manages to launch a task (given the locality level). <p>If <code>resourceOffers</code> did not manage to offer resources to a <code>TaskSetManager</code> so it could launch any task, <code>resourceOffers</code> scheduler:TaskSetManager.md#abortIfCompletelyBlacklisted[requests the <code>TaskSetManager</code> to abort the <code>TaskSet</code> if completely blacklisted].</p> <p>When <code>resourceOffers</code> managed to launch a task, the internal &lt;&gt; flag gets enabled (that effectively means what the name says \"there were executors and I managed to launch a task\").  <p><code>resourceOffers</code> is used when:</p> <ul> <li><code>CoarseGrainedSchedulerBackend</code> (via DriverEndpoint RPC endpoint) is requested to make executor resource offers</li> <li><code>LocalEndpoint</code> is requested to revive resource offers</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#maybeinitbarriercoordinator","title":"maybeInitBarrierCoordinator <pre><code>maybeInitBarrierCoordinator(): Unit\n</code></pre> <p>Unless a BarrierCoordinator has already been registered, <code>maybeInitBarrierCoordinator</code> creates a BarrierCoordinator and registers it to be known as barrierSync.</p> <p>In the end, <code>maybeInitBarrierCoordinator</code> prints out the following INFO message to the logs:</p> <pre><code>Registered BarrierCoordinator endpoint\n</code></pre>","text":""},{"location":"scheduler/TaskSchedulerImpl/#resourceOfferSingleTaskSet","title":"Finding Tasks from TaskSetManager to Schedule on Executors <pre><code>resourceOfferSingleTaskSet(\n  taskSet: TaskSetManager,\n  maxLocality: TaskLocality,\n  shuffledOffers: Seq[WorkerOffer],\n  availableCpus: Array[Int],\n  availableResources: Array[Map[String, Buffer[String]]],\n  tasks: IndexedSeq[ArrayBuffer[TaskDescription]]): (Boolean, Option[TaskLocality])\n</code></pre> <p><code>resourceOfferSingleTaskSet</code> takes every <code>WorkerOffer</code> (from the input <code>shuffledOffers</code>) and (only if the number of available CPU cores (using the input <code>availableCpus</code>) is at least configuration-properties.md#spark.task.cpus[spark.task.cpus]) scheduler:TaskSetManager.md#resourceOffer[requests <code>TaskSetManager</code> (as the input <code>taskSet</code>) to find a <code>Task</code> to execute (given the resource offer)] (as an executor, a host, and the input <code>maxLocality</code>).</p> <p><code>resourceOfferSingleTaskSet</code> adds the task to the input <code>tasks</code> collection.</p> <p><code>resourceOfferSingleTaskSet</code> records the task id and <code>TaskSetManager</code> in some registries.</p> <p><code>resourceOfferSingleTaskSet</code> decreases configuration-properties.md#spark.task.cpus[spark.task.cpus] from the input <code>availableCpus</code> (for the <code>WorkerOffer</code>).</p> <p><code>resourceOfferSingleTaskSet</code> returns whether a task was launched or not.</p>  <p>Note</p> <p><code>resourceOfferSingleTaskSet</code> asserts that the number of available CPU cores (in the input <code>availableCpus</code> per <code>WorkerOffer</code>) is at least <code>0</code>.</p>   <p>If there is a <code>TaskNotSerializableException</code>, <code>resourceOfferSingleTaskSet</code> prints out the following ERROR in the logs:</p> <pre><code>Resource offer failed, task set [name] was not serializable\n</code></pre>  <p><code>resourceOfferSingleTaskSet</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to resourceOffers</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#TaskLocality","title":"Task Locality Preference <p><code>TaskLocality</code> represents a task locality preference and can be one of the following (from the most localized to the widest):</p> <ol> <li><code>PROCESS_LOCAL</code></li> <li><code>NODE_LOCAL</code></li> <li><code>NO_PREF</code></li> <li><code>RACK_LOCAL</code></li> <li><code>ANY</code></li> </ol>","text":""},{"location":"scheduler/TaskSchedulerImpl/#workeroffer-free-cpu-cores-on-executor","title":"WorkerOffer \u2014 Free CPU Cores on Executor <pre><code>WorkerOffer(\n  executorId: String,\n  host: String,\n  cores: Int)\n</code></pre> <p><code>WorkerOffer</code> represents a resource offer with free CPU <code>cores</code> available on an <code>executorId</code> executor on a <code>host</code>.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#workerremoved","title":"workerRemoved <pre><code>workerRemoved(\n  workerId: String,\n  host: String,\n  message: String): Unit\n</code></pre> <p><code>workerRemoved</code> is part of the TaskScheduler abstraction.</p> <p><code>workerRemoved</code> prints out the following INFO message to the logs:</p> <pre><code>Handle removed worker [workerId]: [message]\n</code></pre> <p>In the end, <code>workerRemoved</code> requests the DAGScheduler to workerRemoved.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#calculateAvailableSlots","title":"calculateAvailableSlots <pre><code>calculateAvailableSlots(\n  scheduler: TaskSchedulerImpl,\n  conf: SparkConf,\n  rpId: Int,\n  availableRPIds: Array[Int],\n  availableCpus: Array[Int],\n  availableResources: Array[Map[String, Int]]): Int\n</code></pre> <p><code>calculateAvailableSlots</code>...FIXME</p>  <p><code>calculateAvailableSlots</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested for TaskDescriptions for the given executor resource offers</li> <li><code>CoarseGrainedSchedulerBackend</code> is requested for the maximum number of concurrent tasks</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.TaskSchedulerImpl</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.TaskSchedulerImpl.name = org.apache.spark.scheduler.TaskSchedulerImpl\nlogger.TaskSchedulerImpl.level = all\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/TaskSet/","title":"TaskSet","text":"<p><code>TaskSet</code> is a collection of independent tasks of a stage (and a stage execution attempt) that are missing (uncomputed), i.e. for which computation results are unavailable (as RDD blocks on BlockManagers on executors).</p> <p>In other words, a <code>TaskSet</code> represents the missing partitions of a stage that (as tasks) can be run right away based on the data that is already on the cluster, e.g. map output files from previous stages, though they may fail if this data becomes unavailable.</p> <p>Since the tasks are only the missing tasks, their number does not necessarily have to be the number of all the tasks of a stage. For a brand new stage (that has never been attempted to compute) their numbers are exactly the same.</p> <p>Once <code>DAGScheduler</code> submits the missing tasks for execution (to the TaskScheduler), the execution of the <code>TaskSet</code> is managed by a TaskSetManager that allows for spark.task.maxFailures.</p>"},{"location":"scheduler/TaskSet/#creating-instance","title":"Creating Instance","text":"<p><code>TaskSet</code> takes the following to be created:</p> <ul> <li> Tasks <li> Stage ID <li> Stage (Execution) Attempt ID <li>FIFO Priority</li> <li> Local Properties <li> Resource Profile ID <p><code>TaskSet</code> is created\u00a0when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit the missing tasks of a stage</li> </ul>"},{"location":"scheduler/TaskSet/#id","title":"ID <pre><code>id: String\n</code></pre> <p><code>TaskSet</code> is uniquely identified by an <code>id</code> that uses the stageId followed by the stageAttemptId with the comma (<code>.</code>) in-between:</p> <pre><code>[stageId].[stageAttemptId]\n</code></pre>","text":""},{"location":"scheduler/TaskSet/#textual-representation","title":"Textual Representation <pre><code>toString: String\n</code></pre> <p><code>toString</code> follows the pattern:</p> <pre><code>TaskSet [stageId].[stageAttemptId]\n</code></pre>","text":""},{"location":"scheduler/TaskSet/#task-scheduling-prioritization-fifo-scheduling","title":"Task Scheduling Prioritization (FIFO Scheduling) <p><code>TaskSet</code> is given a <code>priority</code> when created.</p> <p>The priority is the ID of the earliest-created active job that needs the stage (that is given when <code>DAGScheduler</code> is requested to submit the missing tasks of a stage).</p> <p>Once submitted for execution, the priority is the priority of the <code>TaskSetManager</code> (which is a Schedulable) that is used for task prioritization (prioritizing scheduling of tasks) in the FIFO scheduling mode.</p>","text":""},{"location":"scheduler/TaskSetBlacklist/","title":"TaskSetBlacklist","text":"<p>== [[TaskSetBlacklist]] <code>TaskSetBlacklist</code> -- Blacklisting Executors and Nodes For TaskSet</p> <p>CAUTION: FIXME</p> <p>=== [[updateBlacklistForFailedTask]] <code>updateBlacklistForFailedTask</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[isExecutorBlacklistedForTaskSet]] <code>isExecutorBlacklistedForTaskSet</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[isNodeBlacklistedForTaskSet]] <code>isNodeBlacklistedForTaskSet</code> Method</p> <p>CAUTION: FIXME</p>"},{"location":"scheduler/TaskSetManager/","title":"TaskSetManager","text":"<p><code>TaskSetManager</code> is a Schedulable that manages scheduling the tasks of a TaskSet.</p> <p></p>"},{"location":"scheduler/TaskSetManager/#creating-instance","title":"Creating Instance","text":"<p><code>TaskSetManager</code> takes the following to be created:</p> <ul> <li> TaskSchedulerImpl <li> TaskSet <li>Number of Task Failures</li> <li> <code>HealthTracker</code> <li> <code>Clock</code> <p><code>TaskSetManager</code> is created when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to create a TaskSetManager</li> </ul> <p>While being created, <code>TaskSetManager</code> requests the current epoch from <code>MapOutputTracker</code> and sets it on all tasks in the taskset.</p> <p>Note</p> <p><code>TaskSetManager</code> uses TaskSchedulerImpl to access the current <code>MapOutputTracker</code>.</p> <p><code>TaskSetManager</code> prints out the following DEBUG to the logs:</p> <pre><code>Epoch for [taskSet]: [epoch]\n</code></pre> <p><code>TaskSetManager</code> adds the tasks as pending execution (in reverse order from the highest partition to the lowest).</p>"},{"location":"scheduler/TaskSetManager/#number-of-task-failures","title":"Number of Task Failures <p><code>TaskSetManager</code> is given <code>maxTaskFailures</code> value that is how many times a single task can fail before the whole TaskSet is aborted.</p>    Master URL Number of Task Failures     <code>local</code> 1   local-with-retries <code>maxFailures</code>   <code>local-cluster</code> spark.task.maxFailures   Cluster Manager spark.task.maxFailures","text":""},{"location":"scheduler/TaskSetManager/#isBarrier","title":"isBarrier","text":"<pre><code>isBarrier: Boolean\n</code></pre> <p><code>isBarrier</code> is enabled (<code>true</code>) when this <code>TaskSetManager</code> is created for a TaskSet with barrier tasks.</p> <p><code>isBarrier</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to resourceOfferSingleTaskSet, resourceOffers</li> <li><code>TaskSetManager</code> is requested to resourceOffer, checkSpeculatableTasks, getLocalityWait</li> </ul>"},{"location":"scheduler/TaskSetManager/#resourceOffer","title":"resourceOffer","text":"<pre><code>resourceOffer(\n  execId: String,\n  host: String,\n  maxLocality: TaskLocality.TaskLocality,\n  taskCpus: Int = sched.CPUS_PER_TASK,\n  taskResourceAssignments: Map[String, ResourceInformation] = Map.empty): (Option[TaskDescription], Boolean, Int)\n</code></pre> <p><code>resourceOffer</code> determines allowed locality level for the given <code>TaskLocality</code> being anything but <code>NO_PREF</code>.</p> <p><code>resourceOffer</code> dequeueTask for the given <code>execId</code> and <code>host</code>, and the allowed locality level. This may or may not give a TaskDescription.</p> <p>In the end, <code>resourceOffer</code> returns the <code>TaskDescription</code>, <code>hasScheduleDelayReject</code>, and the index of the dequeued task (if any).</p> <p><code>resourceOffer</code> returns a <code>(None, false, -1)</code> tuple when this <code>TaskSetManager</code> is isZombie or the offer (by the given <code>host</code> or <code>execId</code>) should be ignored (excluded).</p> <p><code>resourceOffer</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to resourceOfferSingleTaskSet</li> </ul>"},{"location":"scheduler/TaskSetManager/#locality-wait","title":"Locality Wait <pre><code>getLocalityWait(\n  level: TaskLocality.TaskLocality): Long\n</code></pre> <p><code>getLocalityWait</code> is <code>0</code> for legacyLocalityWaitReset and isBarrier flags enabled.</p> <p><code>getLocalityWait</code> determines the value of locality wait based on the given <code>TaskLocality.TaskLocality</code>.</p>    TaskLocality Configuration Property     <code>PROCESS_LOCAL</code> spark.locality.wait.process   <code>NODE_LOCAL</code> spark.locality.wait.node   <code>RACK_LOCAL</code> spark.locality.wait.rack    <p>Unless the value has been determined, <code>getLocalityWait</code> defaults to <code>0</code>.</p>  <p>Note</p> <p><code>NO_PREF</code> and <code>ANY</code> task localities have no locality wait.</p>   <p><code>getLocalityWait</code> is used when:</p> <ul> <li><code>TaskSetManager</code> is created and recomputes locality preferences</li> </ul>","text":""},{"location":"scheduler/TaskSetManager/#sparkdrivermaxresultsize","title":"spark.driver.maxResultSize <p><code>TaskSetManager</code> uses spark.driver.maxResultSize configuration property to check available memory for more task results.</p>","text":""},{"location":"scheduler/TaskSetManager/#recomputing-task-locality-preferences","title":"Recomputing Task Locality Preferences <pre><code>recomputeLocality(): Unit\n</code></pre> <p>If zombie, <code>recomputeLocality</code> does nothing.</p> <p><code>recomputeLocality</code> recomputes myLocalityLevels, localityWaits and currentLocalityIndex internal registries.</p> <p><code>recomputeLocality</code> computes locality levels (for scheduled tasks) and saves the result in myLocalityLevels internal registry.</p> <p><code>recomputeLocality</code> computes localityWaits by determining the locality wait for every locality level in myLocalityLevels.</p> <p><code>recomputeLocality</code> computes currentLocalityIndex by getLocalityIndex with the previous locality level. If the current locality index is higher than the previous, <code>recomputeLocality</code> recalculates currentLocalityIndex.</p>  <p><code>recomputeLocality</code> is used when:</p> <ul> <li><code>TaskSetManager</code> is notified about status change in executors (i.e., lost, decommissioned, added)</li> </ul>","text":""},{"location":"scheduler/TaskSetManager/#zombie","title":"Zombie <p>A <code>TaskSetManager</code> is a zombie when all tasks in a taskset have completed successfully (regardless of the number of task attempts), or if the taskset has been aborted.</p> <p>While in zombie state, a <code>TaskSetManager</code> can launch no new tasks and responds with no <code>TaskDescription</code>s to resourceOffers.</p> <p>A <code>TaskSetManager</code> remains in the zombie state until all tasks have finished running, i.e. to continue to track and account for the running tasks.</p>","text":""},{"location":"scheduler/TaskSetManager/#computing-locality-levels-for-scheduled-tasks","title":"Computing Locality Levels (for Scheduled Tasks) <pre><code>computeValidLocalityLevels(): Array[TaskLocality.TaskLocality]\n</code></pre> <p><code>computeValidLocalityLevels</code> computes valid locality levels for tasks that were registered in corresponding registries per locality level.</p>  <p>Note</p> <p>TaskLocality is a locality preference of a task and can be the most localized <code>PROCESS_LOCAL</code>, <code>NODE_LOCAL</code> through <code>NO_PREF</code> and <code>RACK_LOCAL</code> to <code>ANY</code>.</p>  <p>For every pending task (in pendingTasks registry), <code>computeValidLocalityLevels</code> requests the TaskSchedulerImpl for acceptable <code>TaskLocality</code>ies:</p> <ul> <li>For every executor, <code>computeValidLocalityLevels</code> requests the TaskSchedulerImpl to isExecutorAlive and adds <code>PROCESS_LOCAL</code></li> <li>For every host, <code>computeValidLocalityLevels</code> requests the TaskSchedulerImpl to hasExecutorsAliveOnHost and adds <code>NODE_LOCAL</code></li> <li>For any pending tasks with no locality preference, <code>computeValidLocalityLevels</code> adds <code>NO_PREF</code></li> <li>For every rack, <code>computeValidLocalityLevels</code> requests the TaskSchedulerImpl to hasHostAliveOnRack and adds <code>RACK_LOCAL</code></li> </ul> <p><code>computeValidLocalityLevels</code> always registers <code>ANY</code> task locality level.</p> <p>In the end, <code>computeValidLocalityLevels</code> prints out the following DEBUG message to the logs:</p> <pre><code>Valid locality levels for [taskSet]: [comma-separated levels]\n</code></pre>  <p><code>computeValidLocalityLevels</code> is used when:</p> <ul> <li><code>TaskSetManager</code> is created and to recomputeLocality</li> </ul>","text":""},{"location":"scheduler/TaskSetManager/#executoradded","title":"executorAdded <pre><code>executorAdded(): Unit\n</code></pre> <p><code>executorAdded</code> recomputeLocality.</p>  <p><code>executorAdded</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to handle resource offers</li> </ul>","text":""},{"location":"scheduler/TaskSetManager/#prepareLaunchingTask","title":"prepareLaunchingTask <pre><code>prepareLaunchingTask(\n  execId: String,\n  host: String,\n  index: Int,\n  taskLocality: TaskLocality.Value,\n  speculative: Boolean,\n  taskCpus: Int,\n  taskResourceAssignments: Map[String, ResourceInformation],\n  launchTime: Long): TaskDescription\n</code></pre>  taskResourceAssignments <p><code>taskResourceAssignments</code> are the resources that are passed in to resourceOffer.</p>  <p><code>prepareLaunchingTask</code>...FIXME</p>  <p><code>prepareLaunchingTask</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to resourceOffers</li> <li><code>TaskSetManager</code> is requested to resourceOffers</li> </ul>","text":""},{"location":"scheduler/TaskSetManager/#demo","title":"Demo <p>Enable <code>DEBUG</code> logging level for <code>org.apache.spark.scheduler.TaskSchedulerImpl</code> (or <code>org.apache.spark.scheduler.cluster.YarnScheduler</code> for YARN) and <code>org.apache.spark.scheduler.TaskSetManager</code> and execute the following two-stage job to see their low-level innerworkings.</p> <p>A cluster manager is recommended since it gives more task localization choices (with YARN additionally supporting rack localization).</p> <pre><code>$ ./bin/spark-shell \\\n    --master yarn \\\n    --conf spark.ui.showConsoleProgress=false\n\n// Keep # partitions low to keep # messages low\n\nscala&gt; sc.parallelize(0 to 9, 3).groupBy(_ % 3).count\nINFO YarnScheduler: Adding task set 0.0 with 3 tasks\nDEBUG TaskSetManager: Epoch for TaskSet 0.0: 0\nDEBUG TaskSetManager: Valid locality levels for TaskSet 0.0: NO_PREF, ANY\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 0\nINFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.0.2.87, executor 1, partition 0, PROCESS_LOCAL, 7541 bytes)\nINFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.0.2.87, executor 2, partition 1, PROCESS_LOCAL, 7541 bytes)\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 1\nINFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.0.2.87, executor 1, partition 2, PROCESS_LOCAL, 7598 bytes)\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 1\nDEBUG TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY\nINFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 518 ms on 10.0.2.87 (executor 1) (1/3)\nINFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 512 ms on 10.0.2.87 (executor 2) (2/3)\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 0\nINFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 51 ms on 10.0.2.87 (executor 1) (3/3)\nINFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool\nINFO YarnScheduler: Adding task set 1.0 with 3 tasks\nDEBUG TaskSetManager: Epoch for TaskSet 1.0: 1\nDEBUG TaskSetManager: Valid locality levels for TaskSet 1.0: NODE_LOCAL, RACK_LOCAL, ANY\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 0\nINFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, 10.0.2.87, executor 2, partition 0, NODE_LOCAL, 7348 bytes)\nINFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4, 10.0.2.87, executor 1, partition 1, NODE_LOCAL, 7348 bytes)\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 1\nINFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5, 10.0.2.87, executor 1, partition 2, NODE_LOCAL, 7348 bytes)\nINFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 4) in 130 ms on 10.0.2.87 (executor 1) (1/3)\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 1\nDEBUG TaskSetManager: No tasks for locality level NODE_LOCAL, so moving to locality level RACK_LOCAL\nDEBUG TaskSetManager: No tasks for locality level RACK_LOCAL, so moving to locality level ANY\nINFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 3) in 133 ms on 10.0.2.87 (executor 2) (2/3)\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 0\nINFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 5) in 21 ms on 10.0.2.87 (executor 1) (3/3)\nINFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool\nres0: Long = 3\n</code></pre>","text":""},{"location":"scheduler/TaskSetManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.TaskSetManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.TaskSetManager=ALL\n</code></pre> <p>Refer to Logging</p>","text":""},{"location":"serializer/","title":"Serialization System","text":"<p>Serialization System is a core component of Apache Spark with pluggable serializers for task closures and block data.</p> <p>Serialization System uses SerializerManager to select the Serializer (based on spark.serializer configuration property).</p>"},{"location":"serializer/DeserializationStream/","title":"DeserializationStream","text":"<p>= DeserializationStream</p> <p>DeserializationStream is an abstraction of streams for reading serialized objects.</p> <p>== [[readObject]] readObject Method</p>"},{"location":"serializer/DeserializationStream/#source-scala","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#readobjectt-classtag-t","title":"readObjectT: ClassTag: T","text":"<p>readObject...FIXME</p> <p>readObject is used when...FIXME</p> <p>== [[readKey]] readKey Method</p>"},{"location":"serializer/DeserializationStream/#source-scala_1","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#readkeyt-classtag-t","title":"readKeyT: ClassTag: T","text":"<p>readKey &lt;&gt; representing the key of a key-value record. <p>readKey is used when...FIXME</p> <p>== [[readValue]] readValue Method</p>"},{"location":"serializer/DeserializationStream/#source-scala_2","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#readvaluet-classtag-t","title":"readValueT: ClassTag: T","text":"<p>readValue &lt;&gt; representing the value of a key-value record. <p>readValue is used when...FIXME</p> <p>== [[asIterator]] asIterator Method</p>"},{"location":"serializer/DeserializationStream/#source-scala_3","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#asiterator-iteratorany","title":"asIterator: Iterator[Any]","text":"<p>asIterator...FIXME</p> <p>asIterator is used when...FIXME</p> <p>== [[asKeyValueIterator]] asKeyValueIterator Method</p>"},{"location":"serializer/DeserializationStream/#source-scala_4","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#askeyvalueiterator-iteratorany","title":"asKeyValueIterator: Iterator[Any]","text":"<p>asKeyValueIterator...FIXME</p> <p>asKeyValueIterator is used when...FIXME</p>"},{"location":"serializer/JavaSerializerInstance/","title":"JavaSerializerInstance","text":"<p><code>JavaSerializerInstance</code> is...FIXME</p>"},{"location":"serializer/KryoSerializer/","title":"KryoSerializer","text":"<p><code>KryoSerializer</code> is a Serializer that uses the Kryo serialization library.</p>"},{"location":"serializer/KryoSerializer/#creating-instance","title":"Creating Instance","text":"<p><code>KryoSerializer</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>KryoSerializer</code> is created\u00a0when:</p> <ul> <li><code>SerializerManager</code> is created</li> <li><code>SparkConf</code> is requested to registerKryoClasses</li> <li><code>SerializerSupport</code> (Spark SQL) is requested for a SerializerInstance</li> </ul>"},{"location":"serializer/KryoSerializer/#useunsafe-flag","title":"useUnsafe Flag <p><code>KryoSerializer</code> uses the spark.kryo.unsafe configuration property for <code>useUnsafe</code> flag (initialized when <code>KryoSerializer</code> is created).</p> <p><code>useUnsafe</code>\u00a0is used when <code>KryoSerializer</code> is requested to create the following:</p> <ul> <li>KryoSerializerInstance</li> <li>KryoOutput</li> </ul>","text":""},{"location":"serializer/KryoSerializer/#creating-new-serializerinstance","title":"Creating New SerializerInstance <pre><code>newInstance(): SerializerInstance\n</code></pre> <p><code>newInstance</code>\u00a0is part of the Serializer abstraction.</p> <p><code>newInstance</code> creates a KryoSerializerInstance with this <code>KryoSerializer</code> (and the useUnsafe and usePool flags).</p>","text":""},{"location":"serializer/KryoSerializer/#newkryooutput","title":"newKryoOutput <pre><code>newKryoOutput(): KryoOutput\n</code></pre> <p><code>newKryoOutput</code>...FIXME</p> <p><code>newKryoOutput</code>\u00a0is used when:</p> <ul> <li><code>KryoSerializerInstance</code> is requested for the output</li> </ul>","text":""},{"location":"serializer/KryoSerializer/#newkryo","title":"newKryo <pre><code>newKryo(): Kryo\n</code></pre> <p><code>newKryo</code>...FIXME</p> <p><code>newKryo</code>\u00a0is used when:</p> <ul> <li><code>KryoSerializer</code> is requested for a KryoFactory</li> <li><code>KryoSerializerInstance</code> is requested to borrowKryo</li> </ul>","text":""},{"location":"serializer/KryoSerializer/#kryofactory","title":"KryoFactory <pre><code>factory: KryoFactory\n</code></pre> <p><code>KryoSerializer</code> creates a <code>KryoFactory</code> lazily (on demand and once only) for internalPool.</p>","text":""},{"location":"serializer/KryoSerializer/#kryopool","title":"KryoPool <p><code>KryoSerializer</code> creates a custom <code>KryoPool</code> lazily (on demand and once only).</p> <p><code>KryoPool</code> is used when:</p> <ul> <li>pool</li> <li>setDefaultClassLoader</li> </ul>","text":""},{"location":"serializer/KryoSerializer/#supportsrelocationofserializedobjects","title":"supportsRelocationOfSerializedObjects <pre><code>supportsRelocationOfSerializedObjects: Boolean\n</code></pre> <p><code>supportsRelocationOfSerializedObjects</code>\u00a0is part of the Serializer abstraction.</p> <p><code>supportsRelocationOfSerializedObjects</code> creates a new SerializerInstance (that is assumed to be a KryoSerializerInstance) and requests it to get the value of the autoReset field.</p>","text":""},{"location":"serializer/KryoSerializerInstance/","title":"KryoSerializerInstance","text":"<p><code>KryoSerializerInstance</code> is a SerializerInstance.</p>"},{"location":"serializer/KryoSerializerInstance/#creating-instance","title":"Creating Instance","text":"<p><code>KryoSerializerInstance</code> takes the following to be created:</p> <ul> <li> KryoSerializer <li> <code>useUnsafe</code> flag <li> <code>usePool</code> flag <p><code>KryoSerializerInstance</code> is created\u00a0when:</p> <ul> <li><code>KryoSerializer</code> is requested for a new SerializerInstance</li> </ul>"},{"location":"serializer/KryoSerializerInstance/#output","title":"Output <p><code>KryoSerializerInstance</code> creates Kryo's <code>Output</code> lazily (on demand and once only).</p> <p><code>KryoSerializerInstance</code> requests the KryoSerializer for a newKryoOutput.</p> <p><code>output</code>\u00a0is used for serialization.</p>","text":""},{"location":"serializer/KryoSerializerInstance/#serialize","title":"serialize <pre><code>serialize[T: ClassTag](\n  t: T): ByteBuffer\n</code></pre> <p><code>serialize</code>\u00a0is part of the SerializerInstance abstraction.</p> <p><code>serialize</code>...FIXME</p>","text":""},{"location":"serializer/KryoSerializerInstance/#deserialize","title":"deserialize <pre><code>deserialize[T: ClassTag](\n  bytes: ByteBuffer): T\n</code></pre> <p><code>deserialize</code>\u00a0is part of the SerializerInstance abstraction.</p> <p><code>deserialize</code>...FIXME</p>","text":""},{"location":"serializer/KryoSerializerInstance/#releasing-kryo-instance","title":"Releasing Kryo Instance <pre><code>releaseKryo(\n  kryo: Kryo): Unit\n</code></pre> <p><code>releaseKryo</code>...FIXME</p> <p><code>releaseKryo</code>\u00a0is used when:</p> <ul> <li><code>KryoSerializationStream</code> is requested to <code>close</code></li> <li><code>KryoDeserializationStream</code> is requested to <code>close</code></li> <li><code>KryoSerializerInstance</code> is requested to serialize and deserialize (and getAutoReset)</li> </ul>","text":""},{"location":"serializer/KryoSerializerInstance/#getautoreset","title":"getAutoReset <pre><code>getAutoReset(): Boolean\n</code></pre> <p><code>getAutoReset</code> uses Java Reflection to access the value of the <code>autoReset</code> field of the <code>Kryo</code> class.</p> <p><code>getAutoReset</code>\u00a0is used when:</p> <ul> <li><code>KryoSerializer</code> is requested for the supportsRelocationOfSerializedObjects flag</li> </ul>","text":""},{"location":"serializer/SerializationStream/","title":"SerializationStream","text":"<p><code>SerializationStream</code> is an abstraction of serialized streams for writing out serialized key-value records.</p>"},{"location":"serializer/SerializationStream/#contract","title":"Contract","text":""},{"location":"serializer/SerializationStream/#closing-stream","title":"Closing Stream <pre><code>close(): Unit\n</code></pre>","text":""},{"location":"serializer/SerializationStream/#flushing-stream","title":"Flushing Stream <pre><code>flush(): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>UnsafeShuffleWriter</code> is requested to insert a record into a ShuffleExternalSorter</li> <li><code>DiskBlockObjectWriter</code> is requested to commitAndGet</li> </ul>","text":""},{"location":"serializer/SerializationStream/#writing-out-object","title":"Writing Out Object <pre><code>writeObject[T: ClassTag](\n  t: T): SerializationStream\n</code></pre> <p>Used when:</p> <ul> <li><code>MemoryStore</code> is requested to putIteratorAsBytes</li> <li><code>JavaSerializerInstance</code> is requested to serialize</li> <li><code>RequestMessage</code> is requested to <code>serialize</code> (for NettyRpcEnv)</li> <li><code>ParallelCollectionPartition</code> is requested to <code>writeObject</code> (for ParallelCollectionRDD)</li> <li><code>ReliableRDDCheckpointData</code> is requested to doCheckpoint</li> <li><code>TorrentBroadcast</code> is created (and requested to writeBlocks)</li> <li><code>RangePartitioner</code> is requested to writeObject</li> <li><code>SerializationStream</code> is requested to writeKey, writeValue or writeAll</li> <li><code>FileSystemPersistenceEngine</code> is requested to <code>serializeIntoFile</code> (for Spark Standalone's <code>Master</code>)</li> </ul>","text":""},{"location":"serializer/SerializationStream/#implementations","title":"Implementations","text":"<ul> <li><code>JavaSerializationStream</code></li> <li><code>KryoSerializationStream</code></li> </ul>"},{"location":"serializer/SerializationStream/#writing-out-all-records","title":"Writing Out All Records <pre><code>writeAll[T: ClassTag](\n  iter: Iterator[T]): SerializationStream\n</code></pre> <p><code>writeAll</code> writes out records of the given iterator (one by one as objects).</p> <p><code>writeAll</code> is used when:</p> <ul> <li><code>ReliableCheckpointRDD</code> is requested to doCheckpoint</li> <li><code>SerializerManager</code> is requested to dataSerializeStream and dataSerializeWithExplicitClassTag</li> </ul>","text":""},{"location":"serializer/SerializationStream/#writing-out-key","title":"Writing Out Key <pre><code>writeKey[T: ClassTag](\n  key: T): SerializationStream\n</code></pre> <p>Writes out the key</p> <p><code>writeKey</code> is used when:</p> <ul> <li><code>UnsafeShuffleWriter</code> is requested to insert a record into a ShuffleExternalSorter</li> <li><code>DiskBlockObjectWriter</code> is requested to write the key and value of a record</li> </ul>","text":""},{"location":"serializer/SerializationStream/#writing-out-value","title":"Writing Out Value <pre><code>writeValue[T: ClassTag](\n  value: T): SerializationStream\n</code></pre> <p>Writes out the value</p> <p><code>writeValue</code> is used when:</p> <ul> <li><code>UnsafeShuffleWriter</code> is requested to insert a record into a ShuffleExternalSorter</li> <li><code>DiskBlockObjectWriter</code> is requested to write the key and value of a record</li> </ul>","text":""},{"location":"serializer/Serializer/","title":"Serializer","text":"<p><code>Serializer</code> is an abstraction of serializers for serialization and deserialization of tasks (closures) and data blocks in a Spark application.</p>"},{"location":"serializer/Serializer/#contract","title":"Contract","text":""},{"location":"serializer/Serializer/#creating-new-serializerinstance","title":"Creating New SerializerInstance <pre><code>newInstance(): SerializerInstance\n</code></pre> <p>Creates a new SerializerInstance</p> <p>Used when:</p> <ul> <li><code>Task</code> is created (only used in tests)</li> <li><code>SerializerSupport</code> (Spark SQL) utility is used to <code>newSerializer</code></li> <li><code>RangePartitioner</code> is requested to writeObject and readObject</li> <li><code>TorrentBroadcast</code> utility is used to blockifyObject and unBlockifyObject</li> <li><code>TaskRunner</code> is requested to run</li> <li><code>NettyBlockRpcServer</code> is requested to deserializeMetadata</li> <li><code>NettyBlockTransferService</code> is requested to uploadBlock</li> <li><code>PairRDDFunctions</code> is requested to...FIXME</li> <li><code>ParallelCollectionPartition</code> is requested to...FIXME</li> <li><code>RDD</code> is requested to...FIXME</li> <li><code>ReliableCheckpointRDD</code> utility is used to...FIXME</li> <li><code>NettyRpcEnvFactory</code> is requested to create a RpcEnv</li> <li><code>DAGScheduler</code> is created</li> <li>others</li> </ul>","text":""},{"location":"serializer/Serializer/#implementations","title":"Implementations","text":"<ul> <li><code>JavaSerializer</code></li> <li>KryoSerializer</li> <li><code>UnsafeRowSerializer</code> (Spark SQL)</li> </ul>"},{"location":"serializer/Serializer/#accessing-serializer","title":"Accessing Serializer","text":"<p><code>Serializer</code> is available using SparkEnv as the closureSerializer and serializer.</p>"},{"location":"serializer/Serializer/#closureserializer","title":"closureSerializer <pre><code>SparkEnv.get.closureSerializer\n</code></pre>","text":""},{"location":"serializer/Serializer/#serializer_1","title":"serializer <pre><code>SparkEnv.get.serializer\n</code></pre>","text":""},{"location":"serializer/Serializer/#serialized-objects-relocation-requirements","title":"Serialized Objects Relocation Requirements <pre><code>supportsRelocationOfSerializedObjects: Boolean\n</code></pre> <p><code>supportsRelocationOfSerializedObjects</code> is disabled (<code>false</code>) by default.</p> <p><code>supportsRelocationOfSerializedObjects</code> is used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to fetchContinuousBlocksInBatch</li> <li><code>SortShuffleManager</code> is requested to create a ShuffleHandle for a given ShuffleDependency (and checks out SerializedShuffleHandle requirements)</li> </ul>","text":""},{"location":"serializer/SerializerInstance/","title":"SerializerInstance","text":"<p><code>SerializerInstance</code> is an abstraction of serializer instances (for use by one thread at a time).</p>"},{"location":"serializer/SerializerInstance/#contract","title":"Contract","text":""},{"location":"serializer/SerializerInstance/#deserializing-from-bytebuffer","title":"Deserializing (from ByteBuffer) <pre><code>deserialize[T: ClassTag](\n  bytes: ByteBuffer): T\ndeserialize[T: ClassTag](\n  bytes: ByteBuffer,\n  loader: ClassLoader): T\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run</li> <li><code>ResultTask</code> is requested to run</li> <li><code>ShuffleMapTask</code> is requested to run</li> <li><code>TaskResultGetter</code> is requested to enqueueFailedTask</li> <li>others</li> </ul>","text":""},{"location":"serializer/SerializerInstance/#deserializing-from-inputstream","title":"Deserializing (from InputStream) <pre><code>deserializeStream(\n  s: InputStream): DeserializationStream\n</code></pre>","text":""},{"location":"serializer/SerializerInstance/#serializing-to-bytebuffer","title":"Serializing (to ByteBuffer) <pre><code>serialize[T: ClassTag](\n  t: T): ByteBuffer\n</code></pre>","text":""},{"location":"serializer/SerializerInstance/#serializing-to-outputstream","title":"Serializing (to OutputStream) <pre><code>serializeStream(\n  s: OutputStream): SerializationStream\n</code></pre>","text":""},{"location":"serializer/SerializerInstance/#implementations","title":"Implementations","text":"<ul> <li>JavaSerializerInstance</li> <li>KryoSerializerInstance</li> <li>UnsafeRowSerializerInstance (Spark SQL)</li> </ul>"},{"location":"serializer/SerializerManager/","title":"SerializerManager","text":"<p><code>SerializerManager</code> is used to select the Serializer for shuffle blocks.</p>"},{"location":"serializer/SerializerManager/#creating-instance","title":"Creating Instance","text":"<p><code>SerializerManager</code> takes the following to be created:</p> <ul> <li>Default Serializer</li> <li> SparkConf <li> (optional) Encryption Key (<code>Option[Array[Byte]]</code>) <p><code>SerializerManager</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv (for the driver and executors)</li> </ul>"},{"location":"serializer/SerializerManager/#kryo-compatible-types","title":"Kryo-Compatible Types <p>Kryo-Compatible Types are the following primitive types, <code>Array</code>s of the primitive types and <code>String</code>s:</p> <ul> <li><code>Boolean</code></li> <li><code>Byte</code></li> <li><code>Char</code></li> <li><code>Double</code></li> <li><code>Float</code></li> <li><code>Int</code></li> <li><code>Long</code></li> <li><code>Null</code></li> <li><code>Short</code></li> </ul>","text":""},{"location":"serializer/SerializerManager/#default-serializer","title":"Default Serializer <p><code>SerializerManager</code> is given a Serializer when created (based on spark.serializer configuration property).</p> <p>The <code>Serializer</code> is used when <code>SerializerManager</code> is requested for a Serializer.</p>  <p>Tip</p> <p>Enable <code>DEBUG</code> logging level of SparkEnv to be told about the selected <code>Serializer</code>.</p> <pre><code>Using serializer: [serializer]\n</code></pre>","text":""},{"location":"serializer/SerializerManager/#accessing-serializermanager","title":"Accessing SerializerManager <p><code>SerializerManager</code> is available using SparkEnv on the driver and executors.</p> <pre><code>import org.apache.spark.SparkEnv\nSparkEnv.get.serializerManager\n</code></pre>","text":""},{"location":"serializer/SerializerManager/#kryoserializer","title":"KryoSerializer <p><code>SerializerManager</code> creates a KryoSerializer when created.</p> <p><code>KryoSerializer</code> is used as the serializer when the types of a given key and value are Kryo-compatible.</p>","text":""},{"location":"serializer/SerializerManager/#selecting-serializer","title":"Selecting Serializer <pre><code>getSerializer(\n  ct: ClassTag[_],\n  autoPick: Boolean): Serializer\ngetSerializer(\n  keyClassTag: ClassTag[_],\n  valueClassTag: ClassTag[_]): Serializer\n</code></pre> <p><code>getSerializer</code> returns the KryoSerializer when the given <code>ClassTag</code>s are Kryo-compatible and the <code>autoPick</code> flag is <code>true</code>. Otherwise, <code>getSerializer</code> returns the default Serializer.</p> <p><code>autoPick</code> flag is <code>true</code> for all BlockIds but Spark Streaming's <code>StreamBlockId</code>s.</p> <p><code>getSerializer</code> (with <code>autoPick</code> flag) is used when:</p> <ul> <li><code>SerializerManager</code> is requested to dataSerializeStream, dataSerializeWithExplicitClassTag and dataDeserializeStream</li> <li><code>SerializedValuesHolder</code> (of MemoryStore) is requested for a <code>SerializationStream</code></li> </ul> <p><code>getSerializer</code> (with key and value <code>ClassTag</code>s only) is used when:</p> <ul> <li><code>ShuffledRDD</code> is requested for dependencies</li> </ul>","text":""},{"location":"serializer/SerializerManager/#dataserializestream","title":"dataSerializeStream <pre><code>dataSerializeStream[T: ClassTag](\n  blockId: BlockId,\n  outputStream: OutputStream,\n  values: Iterator[T]): Unit\n</code></pre> <p><code>dataSerializeStream</code>...FIXME</p> <p><code>dataSerializeStream</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to doPutIterator and dropFromMemory</li> </ul>","text":""},{"location":"serializer/SerializerManager/#dataserializewithexplicitclasstag","title":"dataSerializeWithExplicitClassTag <pre><code>dataSerializeWithExplicitClassTag(\n  blockId: BlockId,\n  values: Iterator[_],\n  classTag: ClassTag[_]): ChunkedByteBuffer\n</code></pre> <p><code>dataSerializeWithExplicitClassTag</code>...FIXME</p> <p><code>dataSerializeWithExplicitClassTag</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to doGetLocalBytes</li> <li><code>SerializerManager</code> is requested to dataSerialize</li> </ul>","text":""},{"location":"serializer/SerializerManager/#datadeserializestream","title":"dataDeserializeStream <pre><code>dataDeserializeStream[T](\n  blockId: BlockId,\n  inputStream: InputStream)\n  (classTag: ClassTag[T]): Iterator[T]\n</code></pre> <p><code>dataDeserializeStream</code>...FIXME</p> <p><code>dataDeserializeStream</code>\u00a0is used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to saveDeserializedValuesToMemoryStore</li> <li><code>BlockManager</code> is requested to getLocalValues and getRemoteValues</li> <li><code>MemoryStore</code> is requested to putIteratorAsBytes (when <code>PartiallySerializedBlock</code> is requested for a <code>PartiallyUnrolledIterator</code>)</li> </ul>","text":""},{"location":"shuffle/","title":"Shuffle System","text":"<p>Shuffle System is a core service of Apache Spark that is responsible for shuffle blocks.</p> <p>The main core abstraction is ShuffleManager with SortShuffleManager as the default and only known implementation.</p> <p>spark.shuffle.manager configuration property allows for a custom ShuffleManager.</p> <p>Shuffle System uses shuffle handles, readers and writers.</p>"},{"location":"shuffle/#resources","title":"Resources","text":"<ul> <li>Improving Apache Spark Downscaling by Christopher Crosbie (Google) Ben Sidhom (Google)</li> <li>Spark shuffle introduction by Raymond Liu (aka colorant)</li> </ul>"},{"location":"shuffle/BaseShuffleHandle/","title":"BaseShuffleHandle","text":"<p><code>BaseShuffleHandle</code> is a ShuffleHandle that is used to capture the parameters when <code>SortShuffleManager</code> is requested for a ShuffleHandle (and the other specialized ShuffleHandles could not be selected):</p> <ul> <li> Shuffle ID <li> ShuffleDependency"},{"location":"shuffle/BaseShuffleHandle/#extensions","title":"Extensions","text":"<ul> <li>BypassMergeSortShuffleHandle</li> <li>SerializedShuffleHandle</li> </ul>"},{"location":"shuffle/BaseShuffleHandle/#demo","title":"Demo","text":"<pre><code>// Start a Spark application, e.g. spark-shell, with the Spark properties to trigger selection of BaseShuffleHandle:\n// 1. spark.shuffle.spill.numElementsForceSpillThreshold=1\n// 2. spark.shuffle.sort.bypassMergeThreshold=1\n\n// numSlices &gt; spark.shuffle.sort.bypassMergeThreshold\nscala&gt; val rdd = sc.parallelize(0 to 4, numSlices = 2).groupBy(_ % 2)\nrdd: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[2] at groupBy at &lt;console&gt;:24\n\nscala&gt; rdd.dependencies\nDEBUG SortShuffleManager: Can't use serialized shuffle for shuffle 0 because an aggregator is defined\nres0: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@1160c54b)\n\nscala&gt; rdd.getNumPartitions\nres1: Int = 2\n\nscala&gt; import org.apache.spark.ShuffleDependency\nimport org.apache.spark.ShuffleDependency\n\nscala&gt; val shuffleDep = rdd.dependencies(0).asInstanceOf[ShuffleDependency[Int, Int, Int]]\nshuffleDep: org.apache.spark.ShuffleDependency[Int,Int,Int] = org.apache.spark.ShuffleDependency@1160c54b\n\n// mapSideCombine is disabled\nscala&gt; shuffleDep.mapSideCombine\nres2: Boolean = false\n\n// aggregator defined\nscala&gt; shuffleDep.aggregator\nres3: Option[org.apache.spark.Aggregator[Int,Int,Int]] = Some(Aggregator(&lt;function1&gt;,&lt;function2&gt;,&lt;function2&gt;))\n\n// the number of reduce partitions &lt; spark.shuffle.sort.bypassMergeThreshold\nscala&gt; shuffleDep.partitioner.numPartitions\nres4: Int = 2\n\nscala&gt; shuffleDep.shuffleHandle\nres5: org.apache.spark.shuffle.ShuffleHandle = org.apache.spark.shuffle.BaseShuffleHandle@22b0fe7e\n</code></pre>"},{"location":"shuffle/BlockStoreShuffleReader/","title":"BlockStoreShuffleReader","text":"<p><code>BlockStoreShuffleReader</code> is a ShuffleReader.</p>"},{"location":"shuffle/BlockStoreShuffleReader/#creating-instance","title":"Creating Instance","text":"<p><code>BlockStoreShuffleReader</code> takes the following to be created:</p> <ul> <li> BaseShuffleHandle <li> Blocks by Address (<code>Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]</code>) <li> TaskContext <li> <code>ShuffleReadMetricsReporter</code> <li> SerializerManager <li> BlockManager <li> MapOutputTracker <li> <code>shouldBatchFetch</code> flag (default: <code>false</code>) <p><code>BlockStoreShuffleReader</code> is created\u00a0when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleReader (for a <code>ShuffleHandle</code> and a range of reduce partitions)</li> </ul>"},{"location":"shuffle/BlockStoreShuffleReader/#reading-combined-records-for-reduce-task","title":"Reading Combined Records (for Reduce Task) <pre><code>read(): Iterator[Product2[K, C]]\n</code></pre> <p><code>read</code>\u00a0is part of the ShuffleReader abstraction.</p> <p><code>read</code> creates a ShuffleBlockFetcherIterator.</p> <p><code>read</code>...FIXME</p>","text":""},{"location":"shuffle/BlockStoreShuffleReader/#fetchcontinuousblocksinbatch","title":"fetchContinuousBlocksInBatch <pre><code>fetchContinuousBlocksInBatch: Boolean\n</code></pre> <p><code>fetchContinuousBlocksInBatch</code>...FIXME</p>","text":""},{"location":"shuffle/BlockStoreShuffleReader/#review-me","title":"Review Me <p>=== [[read]] Reading Combined Records For Reduce Task</p> <p>Internally, <code>read</code> first storage:ShuffleBlockFetcherIterator.md#creating-instance[creates a <code>ShuffleBlockFetcherIterator</code>] (passing in the values of &lt;&gt;, &lt;&gt; and &lt;&gt; Spark properties). <p>NOTE: <code>read</code> uses scheduler:MapOutputTracker.md#getMapSizesByExecutorId[<code>MapOutputTracker</code> to find the BlockManagers with the shuffle blocks and sizes] to create <code>ShuffleBlockFetcherIterator</code>.</p> <p><code>read</code> creates a new serializer:SerializerInstance.md[SerializerInstance] (using <code>Serializer</code> from ShuffleDependency).</p> <p><code>read</code> creates a key/value iterator by <code>deserializeStream</code> every shuffle block stream.</p> <p><code>read</code> updates the context task metrics for each record read.</p> <p>NOTE: <code>read</code> uses <code>CompletionIterator</code> (to count the records read) and spark-InterruptibleIterator.md[InterruptibleIterator] (to support task cancellation).</p> <p>If the <code>ShuffleDependency</code> has an <code>Aggregator</code> defined, <code>read</code> wraps the current iterator inside an iterator defined by Aggregator.combineCombinersByKey (for <code>mapSideCombine</code> enabled) or Aggregator.combineValuesByKey otherwise.</p> <p>NOTE: <code>run</code> reports an exception when <code>ShuffleDependency</code> has no <code>Aggregator</code> defined with <code>mapSideCombine</code> flag enabled.</p> <p>For keyOrdering defined in the <code>ShuffleDependency</code>, <code>run</code> does the following:</p> <ol> <li>shuffle:ExternalSorter.md#creating-instance[Creates an <code>ExternalSorter</code>]</li> <li>shuffle:ExternalSorter.md#insertAll[Inserts all the records] into the <code>ExternalSorter</code></li> <li>Updates context <code>TaskMetrics</code></li> <li>Returns a <code>CompletionIterator</code> for the <code>ExternalSorter</code></li> </ol>","text":""},{"location":"shuffle/BypassMergeSortShuffleHandle/","title":"BypassMergeSortShuffleHandle","text":"<p><code>BypassMergeSortShuffleHandle</code> is a BaseShuffleHandle that <code>SortShuffleManager</code> uses when can avoid merge-sorting data (when requested to register a shuffle).</p> <p><code>SerializedShuffleHandle</code> tells <code>SortShuffleManager</code> to use BypassMergeSortShuffleWriter when requested for a ShuffleWriter.</p>"},{"location":"shuffle/BypassMergeSortShuffleHandle/#creating-instance","title":"Creating Instance","text":"<p><code>BypassMergeSortShuffleHandle</code> takes the following to be created:</p> <ul> <li> Shuffle ID <li> ShuffleDependency <p><code>BypassMergeSortShuffleHandle</code> is created when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleHandle (for the ShuffleDependency)</li> </ul>"},{"location":"shuffle/BypassMergeSortShuffleHandle/#demo","title":"Demo","text":"<pre><code>val rdd = sc.parallelize(0 to 8).groupBy(_ % 3)\n\nassert(rdd.dependencies.length == 1)\n\nimport org.apache.spark.ShuffleDependency\nval shuffleDep = rdd.dependencies.head.asInstanceOf[ShuffleDependency[Int, Int, Int]]\n\nassert(shuffleDep.mapSideCombine == false, \"mapSideCombine should be disabled\")\nassert(shuffleDep.aggregator.isDefined)\n</code></pre> <pre><code>// Use ':paste -raw' mode to paste the code\npackage org.apache.spark\nobject open {\n  import org.apache.spark.SparkContext\n  def bypassMergeThreshold(sc: SparkContext) = {\n    import org.apache.spark.internal.config.SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD\n    sc.getConf.get(SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD)\n  }\n}\n</code></pre> <pre><code>import org.apache.spark.open\nval bypassMergeThreshold = open.bypassMergeThreshold(sc)\n\nassert(shuffleDep.partitioner.numPartitions &lt; bypassMergeThreshold)\n</code></pre> <pre><code>import org.apache.spark.shuffle.sort.BypassMergeSortShuffleHandle\n// BypassMergeSortShuffleHandle is private[spark]\n// so the following won't work :(\n// assert(shuffleDep.shuffleHandle.isInstanceOf[BypassMergeSortShuffleHandle[Int, Int]])\nassert(shuffleDep.shuffleHandle.toString.contains(\"BypassMergeSortShuffleHandle\"))\n</code></pre>"},{"location":"shuffle/BypassMergeSortShuffleWriter/","title":"BypassMergeSortShuffleWriter","text":"<p><code>BypassMergeSortShuffleWriter&amp;lt;K, V&amp;gt;</code> is a ShuffleWriter for ShuffleMapTasks to write records into one single shuffle block data file.</p> <p></p>"},{"location":"shuffle/BypassMergeSortShuffleWriter/#creating-instance","title":"Creating Instance","text":"<p><code>BypassMergeSortShuffleWriter</code> takes the following to be created:</p> <ul> <li> BlockManager <li> BypassMergeSortShuffleHandle (of <code>K</code> keys and <code>V</code> values) <li> Map ID <li> SparkConf <li> ShuffleWriteMetricsReporter <li> <code>ShuffleExecutorComponents</code> <p><code>BypassMergeSortShuffleWriter</code> is created when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleWriter (for a BypassMergeSortShuffleHandle)</li> </ul>"},{"location":"shuffle/BypassMergeSortShuffleWriter/#diskblockobjectwriters","title":"DiskBlockObjectWriters <pre><code>DiskBlockObjectWriter[] partitionWriters\n</code></pre> <p><code>BypassMergeSortShuffleWriter</code> uses a DiskBlockObjectWriter per partition (based on the Partitioner).</p> <p><code>BypassMergeSortShuffleWriter</code> asserts that no <code>partitionWriters</code> are created while writing out records to a shuffle file.</p> <p>While writing, <code>BypassMergeSortShuffleWriter</code> requests the BlockManager for as many DiskBlockObjectWriters as there are partitions (in the Partitioner).</p> <p>While writing, <code>BypassMergeSortShuffleWriter</code> requests the Partitioner for a partition for records (using keys) and finds the per-partition <code>DiskBlockObjectWriter</code> that is requested to write out the partition records. After all records are written out to their shuffle files, the <code>DiskBlockObjectWriter</code>s are requested to commitAndGet.</p> <p><code>BypassMergeSortShuffleWriter</code> uses the partition writers while writing out partition data and removes references to them (<code>null</code>ify them) in the end.</p> <p>In other words, after writing out partition data <code>partitionWriters</code> internal registry is <code>null</code>.</p> <p><code>partitionWriters</code> internal registry becomes <code>null</code> after <code>BypassMergeSortShuffleWriter</code> has finished:</p> <ul> <li>Writing out partition data</li> <li>Stopping</li> </ul>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#indexshuffleblockresolver","title":"IndexShuffleBlockResolver <p><code>BypassMergeSortShuffleWriter</code> is given a IndexShuffleBlockResolver when created.</p> <p><code>BypassMergeSortShuffleWriter</code> uses the <code>IndexShuffleBlockResolver</code> for writing out records (to writeIndexFileAndCommit and getDataFile).</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#serializer","title":"Serializer <p>When created, <code>BypassMergeSortShuffleWriter</code> requests the ShuffleDependency (of the given BypassMergeSortShuffleHandle) for the Serializer.</p> <p><code>BypassMergeSortShuffleWriter</code> creates a new instance of the <code>Serializer</code> for writing out records.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#configuration-properties","title":"Configuration Properties","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#sparkshufflefilebuffer","title":"spark.shuffle.file.buffer <p><code>BypassMergeSortShuffleWriter</code> uses spark.shuffle.file.buffer configuration property for...FIXME</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#sparkfiletransferto","title":"spark.file.transferTo <p>BypassMergeSortShuffleWriter uses spark.file.transferTo configuration property to control whether to use Java New I/O while writing to a partitioned file.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#writing-out-records-to-shuffle-file","title":"Writing Out Records to Shuffle File <pre><code>void write(\n  Iterator&lt;Product2&lt;K, V&gt;&gt; records)\n</code></pre> <p><code>write</code> is part of the ShuffleWriter abstraction.</p> <p><code>write</code> creates a new instance of the Serializer.</p> <p><code>write</code> initializes the partitionWriters and partitionWriterSegments internal registries (for DiskBlockObjectWriters and FileSegments for every partition, respectively).</p> <p><code>write</code> requests the BlockManager for the DiskBlockManager and for every partition <code>write</code> requests it for a shuffle block ID and the file. <code>write</code> creates a DiskBlockObjectWriter for the shuffle block (using the BlockManager). <code>write</code> stores the reference to <code>DiskBlockObjectWriters</code> in the partitionWriters internal registry.</p> <p>After all <code>DiskBlockObjectWriters</code> are created, <code>write</code> requests the ShuffleWriteMetrics to increment shuffle write time metric.</p> <p>For every record (a key-value pair), write requests the Partitioner for the partition ID for the key. The partition ID is then used as an index of the partition writer (among the DiskBlockObjectWriters) to write the current record out to a block file.</p> <p>Once all records have been writted out to their respective block files, write does the following for every DiskBlockObjectWriter:</p> <ol> <li> <p>Requests the <code>DiskBlockObjectWriter</code> to commit and return a corresponding FileSegment of the shuffle block</p> </li> <li> <p>Saves the (reference to) <code>FileSegments</code> in the partitionWriterSegments internal registry</p> </li> <li> <p>Requests the <code>DiskBlockObjectWriter</code> to close</p> </li> </ol>  <p>Note</p> <p>At this point, all the records are in shuffle block files on a local disk. The records are split across block files by key.</p>  <p><code>write</code> requests the IndexShuffleBlockResolver for the shuffle file for the shuffle and the mapDs&gt;&gt;.</p> <p><code>write</code> creates a temporary file (based on the name of the shuffle file) and writes all the per-partition shuffle files to it. The size of every per-partition shuffle files is saved as the partitionLengths internal registry.</p>  <p>Note</p> <p>At this point, all the per-partition shuffle block files are one single map shuffle data file.</p>  <p><code>write</code> requests the IndexShuffleBlockResolver to write shuffle index and data files for the shuffle and the map IDs (with the partitionLengths and the temporary shuffle output file).</p> <p><code>write</code> returns a shuffle map output status (with the shuffle server ID and the partitionLengths).</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#no-records","title":"No Records <p>When there is no records to write out, <code>write</code> initializes the partitionLengths internal array (of numPartitions size) with all elements being 0.</p> <p><code>write</code> requests the IndexShuffleBlockResolver to write shuffle index and data files, but the difference (compared to when there are records to write) is that the <code>dataTmp</code> argument is simply <code>null</code>.</p> <p><code>write</code> sets the internal <code>mapStatus</code> (with the address of BlockManager in use and partitionLengths).</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#requirements","title":"Requirements <p><code>write</code> requires that there are no DiskBlockObjectWriters.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#writing-out-partitioned-data","title":"Writing Out Partitioned Data <pre><code>long[] writePartitionedData(\n  ShuffleMapOutputWriter mapOutputWriter)\n</code></pre> <p><code>writePartitionedData</code> makes sure that DiskBlockObjectWriters are available (<code>partitionWriters != null</code>).</p> <p>For every partition, <code>writePartitionedData</code> takes the partition file (from the FileSegments). Only when the partition file exists, <code>writePartitionedData</code> requests the given ShuffleMapOutputWriter for a ShufflePartitionWriter and writes out the partitioned data. At the end, <code>writePartitionedData</code> deletes the file.</p> <p><code>writePartitionedData</code> requests the ShuffleWriteMetricsReporter to increment the write time.</p> <p>In the end, <code>writePartitionedData</code> requests the <code>ShuffleMapOutputWriter</code> to commitAllPartitions and returns the size of each partition of the output map file.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#copying-raw-bytes-between-input-streams","title":"Copying Raw Bytes Between Input Streams <pre><code>copyStream(\n  in: InputStream,\n  out: OutputStream,\n  closeStreams: Boolean = false,\n  transferToEnabled: Boolean = false): Long\n</code></pre> <p>copyStream branches off depending on the type of <code>in</code> and <code>out</code> streams, i.e. whether they are both <code>FileInputStream</code> with <code>transferToEnabled</code> input flag is enabled.</p> <p>If they are both <code>FileInputStream</code> with <code>transferToEnabled</code> enabled, copyStream gets their <code>FileChannels</code> and transfers bytes from the input file to the output file and counts the number of bytes, possibly zero, that were actually transferred.</p> <p>NOTE: copyStream uses Java's {java-javadoc-url}/java/nio/channels/FileChannel.html[java.nio.channels.FileChannel] to manage file channels.</p> <p>If either <code>in</code> and <code>out</code> input streams are not <code>FileInputStream</code> or <code>transferToEnabled</code> flag is disabled (default), copyStream reads data from <code>in</code> to write to <code>out</code> and counts the number of bytes written.</p> <p>copyStream can optionally close <code>in</code> and <code>out</code> streams (depending on the input <code>closeStreams</code> -- disabled by default).</p> <p>NOTE: <code>Utils.copyStream</code> is used when &lt;&gt; (among other places).  <p>Tip</p> <p>Visit the official web site of JSR 51: New I/O APIs for the Java Platform and read up on java.nio package.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#stopping-shufflewriter","title":"Stopping ShuffleWriter <pre><code>Option&lt;MapStatus&gt; stop(\n  boolean success)\n</code></pre> <p><code>stop</code>...FIXME</p> <p><code>stop</code>\u00a0is part of the ShuffleWriter abstraction.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#temporary-array-of-partition-lengths","title":"Temporary Array of Partition Lengths <pre><code>long[] partitionLengths\n</code></pre> <p>Temporary array of partition lengths after records are written to a shuffle system.</p> <p>Initialized every time <code>BypassMergeSortShuffleWriter</code> writes out records (before passing it in to IndexShuffleBlockResolver). After <code>IndexShuffleBlockResolver</code> finishes, it is used to initialize mapStatus internal property.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#internal-properties","title":"Internal Properties","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#numpartitions","title":"numPartitions","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#partitionwritersegments","title":"partitionWriterSegments","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#mapstatus","title":"mapStatus <p>MapStatus that BypassMergeSortShuffleWriter returns when stopped</p> <p>Initialized every time <code>BypassMergeSortShuffleWriter</code> writes out records.</p> <p>Used when BypassMergeSortShuffleWriter stops (with <code>success</code> enabled) as a marker if any records were written and returned if they did.</p>","text":""},{"location":"shuffle/DownloadFileManager/","title":"DownloadFileManager","text":"<p><code>DownloadFileManager</code> is an abstraction of file managers that can createTempFile and registerTempFileToClean.</p>"},{"location":"shuffle/DownloadFileManager/#contract","title":"Contract","text":""},{"location":"shuffle/DownloadFileManager/#createtempfile","title":"createTempFile <pre><code>DownloadFile createTempFile(\n  TransportConf transportConf)\n</code></pre> <p>Used when:</p> <ul> <li><code>DownloadCallback</code> (of OneForOneBlockFetcher) is created</li> </ul>","text":""},{"location":"shuffle/DownloadFileManager/#registertempfiletoclean","title":"registerTempFileToClean <pre><code>boolean registerTempFileToClean(\n  DownloadFile file)\n</code></pre> <p>Used when:</p> <ul> <li><code>DownloadCallback</code> (of OneForOneBlockFetcher) is requested to <code>onComplete</code></li> </ul>","text":""},{"location":"shuffle/DownloadFileManager/#implementations","title":"Implementations","text":"<ul> <li>RemoteBlockDownloadFileManager</li> <li>ShuffleBlockFetcherIterator</li> </ul>"},{"location":"shuffle/ExecutorDiskUtils/","title":"ExecutorDiskUtils","text":""},{"location":"shuffle/ExternalAppendOnlyMap/","title":"ExternalAppendOnlyMap","text":"<p><code>ExternalAppendOnlyMap</code> is a Spillable of SizeTrackers.</p> <p><code>ExternalAppendOnlyMap[K, V, C]</code> is a parameterized type of <code>K</code> keys, <code>V</code> values, and <code>C</code> combiner (partial) values.</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#creating-instance","title":"Creating Instance","text":"<p>ExternalAppendOnlyMap takes the following to be created:</p> <ul> <li>[[createCombiner]] createCombiner function (<code>V =&gt; C</code>)</li> <li>[[mergeValue]] mergeValue function (<code>(C, V) =&gt; C</code>)</li> <li>[[mergeCombiners]] mergeCombiners function (<code>(C, C) =&gt; C</code>)</li> <li>[[serializer]] Optional serializer:Serializer.md[Serializer] (default: core:SparkEnv.md#serializer[system Serializer])</li> <li>[[blockManager]] Optional storage:BlockManager.md[BlockManager] (default: core:SparkEnv.md#blockManager[system BlockManager])</li> <li>[[context]] TaskContext</li> <li>[[serializerManager]] Optional serializer:SerializerManager.md[SerializerManager] (default: core:SparkEnv.md#serializerManager[system SerializerManager])</li> </ul> <p>ExternalAppendOnlyMap is created when:</p> <ul> <li> <p>Aggregator is requested to rdd:Aggregator.md#combineValuesByKey[combineValuesByKey] and rdd:Aggregator.md#combineCombinersByKey[combineCombinersByKey]</p> </li> <li> <p><code>CoGroupedRDD</code> is requested to compute a partition</p> </li> </ul> <p>== [[currentMap]] SizeTrackingAppendOnlyMap</p> <p>ExternalAppendOnlyMap manages a SizeTrackingAppendOnlyMap.</p> <p>A SizeTrackingAppendOnlyMap is created immediately when ExternalAppendOnlyMap is and every time when &lt;&gt; and &lt;&gt; spilled to disk. <p>SizeTrackingAppendOnlyMap are dereferenced (<code>null</code>ed) for the memory to be garbage-collected when &lt;&gt; and &lt;&gt;. <p>SizeTrackingAppendOnlyMap is used when &lt;&gt;, &lt;&gt;, &lt;&gt; and &lt;&gt;. <p>== [[insertAll]] Inserting All Key-Value Pairs (from Iterator)</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala","title":"[source, scala]","text":"<p>insertAll(   entries: Iterator[Product2[K, V]]): Unit</p> <p>[[insertAll-update-function]] insertAll creates an update function that uses the &lt;&gt; function for an existing value or the &lt;&gt; function for a new value. <p>For every key-value pair (from the input iterator), insertAll does the following:</p> <ul> <li> <p>Requests the &lt;&gt; for the estimated size and, if greater than the &lt;&lt;_peakMemoryUsedBytes, _peakMemoryUsedBytes&gt;&gt; metric, updates it. <li> <p>shuffle:Spillable.md#maybeSpill[Spills to a disk if necessary] and, if spilled, creates a new &lt;&gt; <li> <p>Requests the &lt;&gt; to change value for the current value (with the &lt;&gt; function) <li> <p>shuffle:Spillable.md#addElementsRead[Increments the elements read counter]</p> </li> <p>=== [[insertAll-usage]] Usage</p> <p>insertAll is used when:</p> <ul> <li> <p>Aggregator is requested to rdd:Aggregator.md#combineValuesByKey[combineValuesByKey] and rdd:Aggregator.md#combineCombinersByKey[combineCombinersByKey]</p> </li> <li> <p><code>CoGroupedRDD</code> is requested to compute a partition</p> </li> <li> <p>ExternalAppendOnlyMap is requested to &lt;&gt; <p>=== [[insertAll-requirements]] Requirements</p> <p>insertAll throws an IllegalStateException when the &lt;&gt; internal registry is <code>null</code>:"},{"location":"shuffle/ExternalAppendOnlyMap/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#cannot-insert-new-elements-into-a-map-after-calling-iterator","title":"Cannot insert new elements into a map after calling iterator","text":"<p>== [[iterator]] Iterator of \"Combined\" Pairs</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_1","title":"[source, scala]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#iterator-iteratork-c","title":"iterator: Iterator[(K, C)]","text":"<p>iterator...FIXME</p> <p>iterator is used when...FIXME</p> <p>== [[spill]] Spilling to Disk if Necessary</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_2","title":"[source, scala]","text":"<p>spill(   collection: SizeTracker): Unit</p> <p>spill...FIXME</p> <p>spill is used when...FIXME</p> <p>== [[forceSpill]] Forcing Disk Spilling</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_3","title":"[source, scala]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#forcespill-boolean","title":"forceSpill(): Boolean","text":"<p>forceSpill returns a flag to indicate whether spilling to disk has really happened (<code>true</code>) or not (<code>false</code>).</p> <p>forceSpill branches off per the current state it is in (and should rather use a state-aware implementation).</p> <p>When a &lt;&gt; is in use, forceSpill requests it to spill and, if it did, dereferences (<code>null</code>ify) the &lt;&gt;. forceSpill returns whatever the spilling of the &lt;&gt; returned. <p>When there is at least one element in the &lt;&gt;, forceSpill &lt;&gt; it. forceSpill then creates a new &lt;&gt; and always returns <code>true</code>. <p>In other cases, forceSpill simply returns <code>false</code>.</p> <p>forceSpill is part of the shuffle:Spillable.md[Spillable] abstraction.</p> <p>== [[freeCurrentMap]] Freeing Up SizeTrackingAppendOnlyMap and Releasing Memory</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_4","title":"[source, scala]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#freecurrentmap-unit","title":"freeCurrentMap(): Unit","text":"<p>freeCurrentMap dereferences (<code>null</code>ify) the &lt;&gt; (if there still was one) followed by shuffle:Spillable.md#releaseMemory[releasing all memory]. <p>freeCurrentMap is used when SpillableIterator is requested to destroy itself.</p> <p>== [[spillMemoryIteratorToDisk]] spillMemoryIteratorToDisk Method</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_5","title":"[source, scala]","text":"<p>spillMemoryIteratorToDisk(   inMemoryIterator: Iterator[(K, C)]): DiskMapIterator</p> <p>spillMemoryIteratorToDisk...FIXME</p> <p>spillMemoryIteratorToDisk is used when...FIXME</p>"},{"location":"shuffle/ExternalSorter/","title":"ExternalSorter","text":"<p><code>ExternalSorter</code> is a Spillable of <code>WritablePartitionedPairCollection</code> of pairs (of K keys and C values).</p> <p><code>ExternalSorter[K, V, C]</code> is a parameterized type of <code>K</code> keys, <code>V</code> values, and <code>C</code> combiner (partial) values.</p> <p><code>ExternalSorter</code> is used for the following:</p> <ul> <li>SortShuffleWriter to write records</li> <li>BlockStoreShuffleReader to read records (with a key ordering defined)</li> </ul>"},{"location":"shuffle/ExternalSorter/#creating-instance","title":"Creating Instance","text":"<p><code>ExternalSorter</code> takes the following to be created:</p> <ul> <li> TaskContext <li> Optional Aggregator (default: undefined) <li> Optional Partitioner (default: undefined) <li> Optional <code>Ordering</code> (Scala) for keys (default: undefined) <li> Serializer (default: Serializer) <p><code>ExternalSorter</code> is created\u00a0when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read records (for a reduce task)</li> <li><code>SortShuffleWriter</code> is requested to write records (as a <code>ExternalSorter[K, V, C]</code> or <code>ExternalSorter[K, V, V]</code> based on Map-Size Partial Aggregation Flag)</li> </ul>"},{"location":"shuffle/ExternalSorter/#inserting-records","title":"Inserting Records <pre><code>insertAll(\n  records: Iterator[Product2[K, V]]): Unit\n</code></pre> <p><code>insertAll</code> branches off per whether the optional Aggregator was specified or not (when creating the ExternalSorter).</p> <p><code>insertAll</code> takes all records eagerly and materializes the given records iterator.</p>","text":""},{"location":"shuffle/ExternalSorter/#map-side-aggregator-specified","title":"Map-Side Aggregator Specified <p>With an Aggregator given, <code>insertAll</code> creates an update function based on the mergeValue and createCombiner functions of the <code>Aggregator</code>.</p> <p>For every record, <code>insertAll</code> increment internal read counter.</p> <p><code>insertAll</code> requests the PartitionedAppendOnlyMap to <code>changeValue</code> for the key (made up of the partition of the key of the current record and the key itself, i.e. <code>(partition, key)</code>) with the update function.</p> <p>In the end, <code>insertAll</code> spills the in-memory collection to disk if needed with the <code>usingMap</code> flag enabled (to indicate that the PartitionedAppendOnlyMap was updated).</p>","text":""},{"location":"shuffle/ExternalSorter/#no-map-side-aggregator-specified","title":"No Map-Side Aggregator Specified <p>With no Aggregator given, <code>insertAll</code> iterates over all the records and uses the PartitionedPairBuffer instead.</p> <p>For every record, <code>insertAll</code> increment internal read counter.</p> <p><code>insertAll</code> requests the PartitionedPairBuffer to insert with the partition of the key of the current record, the key itself and the value of the current record.</p> <p>In the end, <code>insertAll</code> spills the in-memory collection to disk if needed with the <code>usingMap</code> flag disabled (since this time the PartitionedPairBuffer was updated, not the PartitionedAppendOnlyMap).</p>","text":""},{"location":"shuffle/ExternalSorter/#spilling-in-memory-collection-to-disk","title":"Spilling In-Memory Collection to Disk <pre><code>maybeSpillCollection(\n  usingMap: Boolean): Unit\n</code></pre> <p><code>maybeSpillCollection</code> branches per the input <code>usingMap</code> flag (to indicate which in-memory collection to use, the PartitionedAppendOnlyMap or the PartitionedPairBuffer).</p> <p><code>maybeSpillCollection</code> requests the collection to estimate size (in bytes) that is tracked as the peakMemoryUsedBytes metric (for every size bigger than what is currently recorded).</p> <p><code>maybeSpillCollection</code> spills the collection to disk if needed. If spilled, <code>maybeSpillCollection</code> creates a new collection (a new <code>PartitionedAppendOnlyMap</code> or a new <code>PartitionedPairBuffer</code>).</p>","text":""},{"location":"shuffle/ExternalSorter/#usage","title":"Usage <p><code>insertAll</code> is used when:</p> <ul> <li><code>SortShuffleWriter</code> is requested to write records (as a <code>ExternalSorter[K, V, C]</code> or <code>ExternalSorter[K, V, V]</code> based on Map-Size Partial Aggregation Flag)</li> <li><code>BlockStoreShuffleReader</code> is requested to read records (with a key sorting defined)</li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#in-memory-collections-of-records","title":"In-Memory Collections of Records <p><code>ExternalSorter</code> uses <code>PartitionedPairBuffer</code>s or <code>PartitionedAppendOnlyMap</code>s to store records in memory before spilling to disk.</p> <p><code>ExternalSorter</code> uses <code>PartitionedPairBuffer</code>s when created with no Aggregator. Otherwise, <code>ExternalSorter</code> uses <code>PartitionedAppendOnlyMap</code>s.</p> <p><code>ExternalSorter</code> inserts records to the collections when insertAll.</p> <p><code>ExternalSorter</code> spills the in-memory collection to disk if needed and, if so, creates a new collection.</p> <p><code>ExternalSorter</code> releases the collections (<code>null</code>s them) when requested to forceSpill and stop. That is when the JVM garbage collector takes care of evicting them from memory completely.</p>","text":""},{"location":"shuffle/ExternalSorter/#peak-size-of-in-memory-collection","title":"Peak Size of In-Memory Collection <p><code>ExternalSorter</code> tracks the peak size (in bytes) of the in-memory collection whenever requested to spill the in-memory collection to disk if needed.</p> <p>The peak size is used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task (with an ordering defined)</li> <li><code>ExternalSorter</code> is requested to writePartitionedMapOutput</li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#spills","title":"Spills <pre><code>spills: ArrayBuffer[SpilledFile]\n</code></pre> <p><code>ExternalSorter</code> creates the <code>spills</code> internal buffer of SpilledFiles when created.</p> <p>A new <code>SpilledFile</code> is added when <code>ExternalSorter</code> is requested to spill.</p>  <p>Note</p> <p>No elements in <code>spills</code> indicate that there is only in-memory data.</p>  <p><code>SpilledFile</code>s are deleted physically from disk and the <code>spills</code> buffer is cleared when <code>ExternalSorter</code> is requested to stop.</p> <p><code>ExternalSorter</code> uses the <code>spills</code> buffer when requested for an partitionedIterator.</p>","text":""},{"location":"shuffle/ExternalSorter/#number-of-spills","title":"Number of Spills <pre><code>numSpills: Int\n</code></pre> <p><code>numSpills</code> is the number of spill files this sorter has spilled.</p>","text":""},{"location":"shuffle/ExternalSorter/#spilledfile","title":"SpilledFile <p><code>SpilledFile</code> is a metadata of a spilled file:</p> <ul> <li> <code>File</code> (Java) <li> BlockId <li> Serializer Batch Sizes (<code>Array[Long]</code>) <li> Elements per Partition (<code>Array[Long]</code>)","text":""},{"location":"shuffle/ExternalSorter/#spilling-data-to-disk","title":"Spilling Data to Disk <pre><code>spill(\n  collection: WritablePartitionedPairCollection[K, C]): Unit\n</code></pre> <p><code>spill</code> is part of the Spillable abstraction.</p> <p><code>spill</code> requests the given <code>WritablePartitionedPairCollection</code> for a destructive <code>WritablePartitionedIterator</code>.</p> <p><code>spill</code> spillMemoryIteratorToDisk (with the destructive <code>WritablePartitionedIterator</code>) that creates a SpilledFile.</p> <p>In the end, <code>spill</code> adds the <code>SpilledFile</code> to the spills internal registry.</p>","text":""},{"location":"shuffle/ExternalSorter/#spillmemoryiteratortodisk","title":"spillMemoryIteratorToDisk <pre><code>spillMemoryIteratorToDisk(\n  inMemoryIterator: WritablePartitionedIterator): SpilledFile\n</code></pre> <p><code>spillMemoryIteratorToDisk</code>...FIXME</p> <p><code>spillMemoryIteratorToDisk</code> is used when:</p> <ul> <li><code>ExternalSorter</code> is requested to spill</li> <li><code>SpillableIterator</code> is requested to <code>spill</code></li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#partitionediterator","title":"partitionedIterator <pre><code>partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])]\n</code></pre> <p><code>partitionedIterator</code>...FIXME</p> <p><code>partitionedIterator</code> is used when:</p> <ul> <li><code>ExternalSorter</code> is requested for an iterator and to writePartitionedMapOutput</li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#writepartitionedmapoutput","title":"writePartitionedMapOutput <pre><code>writePartitionedMapOutput(\n  shuffleId: Int,\n  mapId: Long,\n  mapOutputWriter: ShuffleMapOutputWriter): Unit\n</code></pre> <p><code>writePartitionedMapOutput</code>...FIXME</p> <p><code>writePartitionedMapOutput</code> is used when:</p> <ul> <li><code>SortShuffleWriter</code> is requested to write records</li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#iterator","title":"Iterator <pre><code>iterator: Iterator[Product2[K, C]]\n</code></pre> <p><code>iterator</code> turns the isShuffleSort flag off (<code>false</code>).</p> <p><code>iterator</code> partitionedIterator and takes the combined values (the second elements) only.</p> <p><code>iterator</code> is used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#stopping-externalsorter","title":"Stopping ExternalSorter <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>...FIXME</p> <p><code>stop</code> is used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read records (with ordering defined)</li> <li><code>SortShuffleWriter</code> is requested to stop</li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.util.collection.ExternalSorter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.util.collection.ExternalSorter=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"shuffle/FetchFailedException/","title":"FetchFailedException","text":"<p><code>FetchFailedException</code> exception may be thrown when a task runs (and <code>ShuffleBlockFetcherIterator</code> could not fetch shuffle blocks).</p> <p>When <code>FetchFailedException</code> is reported, <code>TaskRunner</code> catches it and notifies the ExecutorBackend (with <code>TaskState.FAILED</code> task state).</p>"},{"location":"shuffle/FetchFailedException/#creating-instance","title":"Creating Instance","text":"<p><code>FetchFailedException</code> takes the following to be created:</p> <ul> <li> BlockManagerId <li> Shuffle ID <li> Map ID <li> Map Index <li> Reduce ID <li> Error Message <li>Error Cause</li> <p>While being created, <code>FetchFailedException</code> requests the current TaskContext to setFetchFailed.</p> <p><code>FetchFailedException</code> is created\u00a0when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to throw a FetchFailedException (for a <code>ShuffleBlockId</code> or a <code>ShuffleBlockBatchId</code>)</li> </ul>"},{"location":"shuffle/FetchFailedException/#error-cause","title":"Error Cause <p><code>FetchFailedException</code> can be given an error cause when created.</p> <p>The root cause of the <code>FetchFailedException</code> is usually because the Executor (with the BlockManager for requested shuffle blocks) is lost and no longer available due to the following:</p> <ol> <li><code>OutOfMemoryError</code> could be thrown (aka OOMed) or some other unhandled exception</li> <li>The cluster manager that manages the workers with the executors of your Spark application (e.g. Kubernetes, Hadoop YARN) enforces the container memory limits and eventually decides to kill the executor due to excessive memory usage</li> </ol> <p>A solution is usually to tune the memory of your Spark application.</p>","text":""},{"location":"shuffle/FetchFailedException/#taskcontext","title":"TaskContext <p>TaskContext comes with setFetchFailed and fetchFailed to hold a <code>FetchFailedException</code> unmodified (regardless of what happens in a user code).</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/","title":"IndexShuffleBlockResolver","text":"<p><code>IndexShuffleBlockResolver</code> is a ShuffleBlockResolver that manages shuffle block data and uses shuffle index files for faster shuffle data access.</p>"},{"location":"shuffle/IndexShuffleBlockResolver/#creating-instance","title":"Creating Instance","text":"<p><code>IndexShuffleBlockResolver</code> takes the following to be created:</p> <ul> <li> SparkConf <li> BlockManager <p><code>IndexShuffleBlockResolver</code> is created\u00a0when:</p> <ul> <li><code>SortShuffleManager</code> is created</li> <li><code>LocalDiskShuffleExecutorComponents</code> is requested to initializeExecutor</li> </ul> <p></p>"},{"location":"shuffle/IndexShuffleBlockResolver/#getstoredshuffles","title":"getStoredShuffles <pre><code>getStoredShuffles(): Seq[ShuffleBlockInfo]\n</code></pre> <p><code>getStoredShuffles</code>\u00a0is part of the MigratableResolver abstraction.</p> <p><code>getStoredShuffles</code>...FIXME</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#putshuffleblockasstream","title":"putShuffleBlockAsStream <pre><code>putShuffleBlockAsStream(\n  blockId: BlockId,\n  serializerManager: SerializerManager): StreamCallbackWithID\n</code></pre> <p><code>putShuffleBlockAsStream</code>\u00a0is part of the MigratableResolver abstraction.</p> <p><code>putShuffleBlockAsStream</code>...FIXME</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#getmigrationblocks","title":"getMigrationBlocks <pre><code>getMigrationBlocks(\n  shuffleBlockInfo: ShuffleBlockInfo): List[(BlockId, ManagedBuffer)]\n</code></pre> <p><code>getMigrationBlocks</code>\u00a0is part of the MigratableResolver abstraction.</p> <p><code>getMigrationBlocks</code>...FIXME</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#writing-shuffle-index-and-data-files","title":"Writing Shuffle Index and Data Files <pre><code>writeIndexFileAndCommit(\n  shuffleId: Int,\n  mapId: Long,\n  lengths: Array[Long],\n  dataTmp: File): Unit\n</code></pre> <p><code>writeIndexFileAndCommit</code> finds the index and data files for the input <code>shuffleId</code> and <code>mapId</code>.</p> <p><code>writeIndexFileAndCommit</code> creates a temporary file for the index file (in the same directory) and writes offsets (as the moving sum of the input <code>lengths</code> starting from 0 to the final offset at the end for the end of the output file).</p>  <p>Note</p> <p>The offsets are the sizes in the input <code>lengths</code> exactly.</p>  <p></p> <p><code>writeIndexFileAndCommit</code>...FIXME (Review me)</p> <p><code>writeIndexFileAndCommit</code> &lt;&gt; for the input <code>shuffleId</code> and <code>mapId</code>. <p><code>writeIndexFileAndCommit</code> &lt;&gt; (aka consistency check). <p>If the consistency check fails, it means that another attempt for the same task has already written the map outputs successfully and so the input <code>dataTmp</code> and temporary index files are deleted (as no longer correct).</p> <p>If the consistency check succeeds, the existing index and data files are deleted (if they exist) and the temporary index and data files become \"official\", i.e. renamed to their final names.</p> <p>In case of any IO-related exception, <code>writeIndexFileAndCommit</code> throws a <code>IOException</code> with the messages:</p> <pre><code>fail to rename file [indexTmp] to [indexFile]\n</code></pre> <p>or</p> <pre><code>fail to rename file [dataTmp] to [dataFile]\n</code></pre> <p><code>writeIndexFileAndCommit</code>\u00a0is used when:</p> <ul> <li><code>LocalDiskShuffleMapOutputWriter</code> is requested to commitAllPartitions</li> <li><code>LocalDiskSingleSpillMapOutputWriter</code> is requested to transferMapSpillFile</li> </ul>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#removing-shuffle-index-and-data-files","title":"Removing Shuffle Index and Data Files <pre><code>removeDataByMap(\n  shuffleId: Int,\n  mapId: Long): Unit\n</code></pre> <p><code>removeDataByMap</code> finds and deletes the shuffle data file (for the input <code>shuffleId</code> and <code>mapId</code>) followed by finding and deleting the shuffle data index file.</p> <p><code>removeDataByMap</code>\u00a0is used when:</p> <ul> <li><code>SortShuffleManager</code> is requested to unregister a shuffle (and remove a shuffle from a shuffle system)</li> </ul>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#creating-shuffle-block-index-file","title":"Creating Shuffle Block Index File <pre><code>getIndexFile(\n  shuffleId: Int,\n  mapId: Long,\n  dirs: Option[Array[String]] = None): File\n</code></pre> <p><code>getIndexFile</code> creates a ShuffleIndexBlockId.</p> <p>With <code>dirs</code> local directories defined, <code>getIndexFile</code> places the index file of the <code>ShuffleIndexBlockId</code> (by the name) in the local directories (with the spark.diskStore.subDirectories).</p> <p>Otherwise, with no local directories, <code>getIndexFile</code> requests the DiskBlockManager (of the BlockManager) to get the data file.</p> <p><code>getIndexFile</code>\u00a0is used when:</p> <ul> <li><code>IndexShuffleBlockResolver</code> is requested to getBlockData, removeDataByMap, putShuffleBlockAsStream, getMigrationBlocks, writeIndexFileAndCommit</li> <li><code>FallbackStorage</code> is requested to copy</li> </ul>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#creating-shuffle-block-data-file","title":"Creating Shuffle Block Data File <pre><code>getDataFile(\n  shuffleId: Int,\n  mapId: Long): File // (1)\ngetDataFile(\n  shuffleId: Int,\n  mapId: Long,\n  dirs: Option[Array[String]]): File\n</code></pre> <ol> <li><code>dirs</code> is <code>None</code> (undefined)</li> </ol> <p><code>getDataFile</code> creates a ShuffleDataBlockId.</p> <p>With <code>dirs</code> local directories defined, <code>getDataFile</code> places the data file of the <code>ShuffleDataBlockId</code> (by the name) in the local directories (with the spark.diskStore.subDirectories).</p> <p>Otherwise, with no local directories, <code>getDataFile</code> requests the DiskBlockManager (of the BlockManager) to get the data file.</p> <p><code>getDataFile</code>\u00a0is used when:</p> <ul> <li><code>IndexShuffleBlockResolver</code> is requested to getBlockData, removeDataByMap, putShuffleBlockAsStream, getMigrationBlocks, writeIndexFileAndCommit</li> <li><code>LocalDiskShuffleMapOutputWriter</code> is created</li> <li><code>LocalDiskSingleSpillMapOutputWriter</code> is requested to transferMapSpillFile</li> <li><code>FallbackStorage</code> is requested to copy</li> </ul>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#creating-managedbuffer-to-read-shuffle-block-data-file","title":"Creating ManagedBuffer to Read Shuffle Block Data File <pre><code>getBlockData(\n  blockId: BlockId,\n  dirs: Option[Array[String]]): ManagedBuffer\n</code></pre> <p><code>getBlockData</code>\u00a0is part of the ShuffleBlockResolver abstraction.</p> <p><code>getBlockData</code>...FIXME</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#checking-consistency-of-shuffle-index-and-data-files","title":"Checking Consistency of Shuffle Index and Data Files <pre><code>checkIndexAndDataFile(\n  index: File,\n  data: File,\n  blocks: Int): Array[Long]\n</code></pre>  <p>Danger</p> <p>Review Me</p>  <p><code>checkIndexAndDataFile</code> first checks if the size of the input <code>index</code> file is exactly the input <code>blocks</code> multiplied by <code>8</code>.</p> <p><code>checkIndexAndDataFile</code> returns <code>null</code> when the numbers, and hence the shuffle index and data files, don't match.</p> <p><code>checkIndexAndDataFile</code> reads the shuffle <code>index</code> file and converts the offsets into lengths of each block.</p> <p><code>checkIndexAndDataFile</code> makes sure that the size of the input shuffle <code>data</code> file is exactly the sum of the block lengths.</p> <p><code>checkIndexAndDataFile</code> returns the block lengths if the numbers match, and <code>null</code> otherwise.</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#transportconf","title":"TransportConf <p><code>IndexShuffleBlockResolver</code> creates a TransportConf (for shuffle module) when created.</p> <p><code>transportConf</code>\u00a0is used in getMigrationBlocks and getBlockData.</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.IndexShuffleBlockResolver</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.shuffle.IndexShuffleBlockResolver=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"shuffle/LocalDiskShuffleDataIO/","title":"LocalDiskShuffleDataIO","text":"<p><code>LocalDiskShuffleDataIO</code> is a ShuffleDataIO.</p>"},{"location":"shuffle/LocalDiskShuffleDataIO/#shuffleexecutorcomponents","title":"ShuffleExecutorComponents <pre><code>ShuffleExecutorComponents executor()\n</code></pre> <p><code>executor</code>\u00a0is part of the ShuffleDataIO abstraction.</p> <p><code>executor</code> creates a new LocalDiskShuffleExecutorComponents.</p>","text":""},{"location":"shuffle/LocalDiskShuffleExecutorComponents/","title":"LocalDiskShuffleExecutorComponents","text":"<p><code>LocalDiskShuffleExecutorComponents</code> is a ShuffleExecutorComponents.</p>"},{"location":"shuffle/LocalDiskShuffleExecutorComponents/#creating-instance","title":"Creating Instance","text":"<p><code>LocalDiskShuffleExecutorComponents</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>LocalDiskShuffleExecutorComponents</code> is created\u00a0when:</p> <ul> <li><code>LocalDiskShuffleDataIO</code> is requested for a ShuffleExecutorComponents</li> </ul>"},{"location":"shuffle/LocalDiskShuffleMapOutputWriter/","title":"LocalDiskShuffleMapOutputWriter","text":"<p><code>LocalDiskShuffleMapOutputWriter</code> is...FIXME</p>"},{"location":"shuffle/LocalDiskSingleSpillMapOutputWriter/","title":"LocalDiskSingleSpillMapOutputWriter","text":"<p><code>LocalDiskSingleSpillMapOutputWriter</code> is...FIXME</p>"},{"location":"shuffle/MigratableResolver/","title":"MigratableResolver","text":"<p><code>MigratableResolver</code> is an abstraction of resolvers that allow Spark to migrate shuffle blocks.</p>"},{"location":"shuffle/MigratableResolver/#contract","title":"Contract","text":""},{"location":"shuffle/MigratableResolver/#getmigrationblocks","title":"getMigrationBlocks <pre><code>getMigrationBlocks(\n  shuffleBlockInfo: ShuffleBlockInfo): List[(BlockId, ManagedBuffer)]\n</code></pre> <p>Used when:</p> <ul> <li><code>ShuffleMigrationRunnable</code> is requested to run</li> </ul>","text":""},{"location":"shuffle/MigratableResolver/#getstoredshuffles","title":"getStoredShuffles <pre><code>getStoredShuffles(): Seq[ShuffleBlockInfo]\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockManagerDecommissioner</code> is requested to refreshOffloadingShuffleBlocks</li> </ul>","text":""},{"location":"shuffle/MigratableResolver/#putshuffleblockasstream","title":"putShuffleBlockAsStream <pre><code>putShuffleBlockAsStream(\n  blockId: BlockId,\n  serializerManager: SerializerManager): StreamCallbackWithID\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockManager</code> is requested to putBlockDataAsStream</li> </ul>","text":""},{"location":"shuffle/MigratableResolver/#implementations","title":"Implementations","text":"<ul> <li>IndexShuffleBlockResolver</li> </ul>"},{"location":"shuffle/SerializedShuffleHandle/","title":"SerializedShuffleHandle","text":"<p><code>SerializedShuffleHandle</code> is a BaseShuffleHandle that <code>SortShuffleManager</code> uses when canUseSerializedShuffle (when requested to register a shuffle and BypassMergeSortShuffleHandles could not be selected).</p> <p><code>SerializedShuffleHandle</code> tells <code>SortShuffleManager</code> to use UnsafeShuffleWriter when requested for a ShuffleWriter.</p>"},{"location":"shuffle/SerializedShuffleHandle/#creating-instance","title":"Creating Instance","text":"<p><code>SerializedShuffleHandle</code> takes the following to be created:</p> <ul> <li> Shuffle ID <li> ShuffleDependency <p><code>SerializedShuffleHandle</code> is created when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleHandle (for the ShuffleDependency)</li> </ul>"},{"location":"shuffle/ShuffleBlockPusher/","title":"ShuffleBlockPusher","text":"<p><code>ShuffleBlockPusher</code> is...FIXME</p>"},{"location":"shuffle/ShuffleBlockResolver/","title":"ShuffleBlockResolver","text":"<p>= [[ShuffleBlockResolver]] ShuffleBlockResolver</p> <p>ShuffleBlockResolver is an &lt;&gt; of &lt;&gt; that storage:BlockManager.md[BlockManager] uses to &lt;&gt; for a logical shuffle block identifier (i.e. map, reduce, and shuffle). <p>NOTE: Shuffle block data files are often referred to as map outputs files.</p> <p>[[implementations]] NOTE: shuffle:IndexShuffleBlockResolver.md[IndexShuffleBlockResolver] is the default and only known ShuffleBlockResolver in Apache Spark.</p> <p>[[contract]] .ShuffleBlockResolver Contract [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| getBlockData a| [[getBlockData]]</p>"},{"location":"shuffle/ShuffleBlockResolver/#source-scala","title":"[source, scala]","text":"<p>getBlockData(   blockId: ShuffleBlockId): ManagedBuffer</p> <p>Retrieves the data (as a <code>ManagedBuffer</code>) for the given storage:BlockId.md#ShuffleBlockId[block] (a tuple of <code>shuffleId</code>, <code>mapId</code> and <code>reduceId</code>).</p> <p>Used when <code>BlockManager</code> is requested to retrieve a storage:BlockManager.md#getLocalBytes[block data from a local block manager] and storage:BlockManager.md#getBlockData[block data]</p> <p>| stop a| [[stop]]</p>"},{"location":"shuffle/ShuffleBlockResolver/#source-scala_1","title":"[source, scala]","text":""},{"location":"shuffle/ShuffleBlockResolver/#stop-unit","title":"stop(): Unit","text":"<p>Stops the <code>ShuffleBlockResolver</code></p> <p>Used when <code>SortShuffleManager</code> is requested to SortShuffleManager.md#stop[stop]</p> <p>|===</p>"},{"location":"shuffle/ShuffleDataIO/","title":"ShuffleDataIO","text":"<p><code>ShuffleDataIO</code> is an abstraction of pluggable temporary shuffle block store plugins for storing shuffle blocks in arbitrary storage backends.</p>"},{"location":"shuffle/ShuffleDataIO/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleDataIO/#shuffledrivercomponents","title":"ShuffleDriverComponents <pre><code>ShuffleDriverComponents driver()\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"shuffle/ShuffleDataIO/#shuffleexecutorcomponents","title":"ShuffleExecutorComponents <pre><code>ShuffleExecutorComponents executor()\n</code></pre> <p>Used when:</p> <ul> <li><code>SortShuffleManager</code> utility is used to load the ShuffleExecutorComponents</li> </ul>","text":""},{"location":"shuffle/ShuffleDataIO/#implementations","title":"Implementations","text":"<ul> <li>LocalDiskShuffleDataIO</li> </ul>"},{"location":"shuffle/ShuffleDataIOUtils/","title":"ShuffleDataIOUtils","text":""},{"location":"shuffle/ShuffleDataIOUtils/#loading-shuffledataio","title":"Loading ShuffleDataIO <pre><code>loadShuffleDataIO(\n  conf: SparkConf): ShuffleDataIO\n</code></pre> <p><code>loadShuffleDataIO</code> uses the spark.shuffle.sort.io.plugin.class configuration property to load the ShuffleDataIO.</p> <p><code>loadShuffleDataIO</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is created</li> <li><code>SortShuffleManager</code> utility is used to loadShuffleExecutorComponents</li> </ul>","text":""},{"location":"shuffle/ShuffleDriverComponents/","title":"ShuffleDriverComponents","text":"<p><code>ShuffleDriverComponents</code> is...FIXME</p>"},{"location":"shuffle/ShuffleExecutorComponents/","title":"ShuffleExecutorComponents","text":"<p><code>ShuffleExecutorComponents</code> is an abstraction of executor shuffle builders.</p>"},{"location":"shuffle/ShuffleExecutorComponents/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleExecutorComponents/#createmapoutputwriter","title":"createMapOutputWriter <pre><code>ShuffleMapOutputWriter createMapOutputWriter(\n  int shuffleId,\n  long mapTaskId,\n  int numPartitions) throws IOException\n</code></pre> <p>Creates a ShuffleMapOutputWriter</p> <p>Used when:</p> <ul> <li><code>BypassMergeSortShuffleWriter</code> is requested to write records</li> <li><code>UnsafeShuffleWriter</code> is requested to mergeSpills and mergeSpillsUsingStandardWriter</li> <li><code>SortShuffleWriter</code> is requested to write records</li> </ul>","text":""},{"location":"shuffle/ShuffleExecutorComponents/#createsinglefilemapoutputwriter","title":"createSingleFileMapOutputWriter <pre><code>Optional&lt;SingleSpillShuffleMapOutputWriter&gt; createSingleFileMapOutputWriter(\n  int shuffleId,\n  long mapId) throws IOException\n</code></pre> <p>Creates a SingleSpillShuffleMapOutputWriter</p> <p>Default: empty</p> <p>Used when:</p> <ul> <li><code>UnsafeShuffleWriter</code> is requested to mergeSpills</li> </ul>","text":""},{"location":"shuffle/ShuffleExecutorComponents/#initializeexecutor","title":"initializeExecutor <pre><code>void initializeExecutor(\n  String appId,\n  String execId,\n  Map&lt;String, String&gt; extraConfigs);\n</code></pre> <p>Used when:</p> <ul> <li><code>SortShuffleManager</code> utility is used to loadShuffleExecutorComponents</li> </ul>","text":""},{"location":"shuffle/ShuffleExecutorComponents/#implementations","title":"Implementations","text":"<ul> <li>LocalDiskShuffleExecutorComponents</li> </ul>"},{"location":"shuffle/ShuffleExternalSorter/","title":"ShuffleExternalSorter","text":"<p><code>ShuffleExternalSorter</code> is a specialized cache-efficient sorter that sorts arrays of compressed record pointers and partition ids.</p> <p><code>ShuffleExternalSorter</code> uses only 8 bytes of space per record in the sorting array to fit more of the array into cache.</p> <p><code>ShuffleExternalSorter</code> is created and used by UnsafeShuffleWriter only.</p> <p></p>"},{"location":"shuffle/ShuffleExternalSorter/#memoryconsumer","title":"MemoryConsumer <p><code>ShuffleExternalSorter</code> is a MemoryConsumer with page size of 128 MB (unless TaskMemoryManager uses smaller).</p> <p><code>ShuffleExternalSorter</code> can spill to disk to free up execution memory.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#configuration-properties","title":"Configuration Properties","text":""},{"location":"shuffle/ShuffleExternalSorter/#sparkshufflefilebuffer","title":"spark.shuffle.file.buffer <p><code>ShuffleExternalSorter</code> uses spark.shuffle.file.buffer configuration property for...FIXME</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#sparkshufflespillnumelementsforcespillthreshold","title":"spark.shuffle.spill.numElementsForceSpillThreshold <p><code>ShuffleExternalSorter</code> uses spark.shuffle.spill.numElementsForceSpillThreshold configuration property for...FIXME</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#creating-instance","title":"Creating Instance <p><code>ShuffleExternalSorter</code> takes the following to be created:</p> <ul> <li> TaskMemoryManager <li> BlockManager <li> TaskContext <li> Initial Size <li> Number of Partitions <li> SparkConf <li> ShuffleWriteMetricsReporter  <p><code>ShuffleExternalSorter</code> is created when:</p> <ul> <li><code>UnsafeShuffleWriter</code> is requested to open a ShuffleExternalSorter</li> </ul>","text":""},{"location":"shuffle/ShuffleExternalSorter/#shuffleinmemorysorter","title":"ShuffleInMemorySorter <p><code>ShuffleExternalSorter</code> manages a ShuffleInMemorySorter:</p> <ul> <li> <p><code>ShuffleInMemorySorter</code> is created immediately when <code>ShuffleExternalSorter</code> is</p> </li> <li> <p><code>ShuffleInMemorySorter</code> is requested to free up memory and dereferenced (<code>null</code>ed) when <code>ShuffleExternalSorter</code> is requested to cleanupResources and closeAndGetSpills</p> </li> </ul> <p><code>ShuffleExternalSorter</code> uses the <code>ShuffleInMemorySorter</code> for the following:</p> <ul> <li>writeSortedFile</li> <li>spill</li> <li>getMemoryUsage</li> <li>growPointerArrayIfNecessary</li> <li>insertRecord</li> </ul>","text":""},{"location":"shuffle/ShuffleExternalSorter/#spilling-to-disk","title":"Spilling To Disk <pre><code>long spill(\n  long size,\n  MemoryConsumer trigger)\n</code></pre> <p><code>spill</code> is part of the MemoryConsumer abstraction.</p>  <p><code>spill</code> returns the memory bytes spilled (spill size).</p> <p><code>spill</code> prints out the following INFO message to the logs:</p> <pre><code>Thread [threadId] spilling sort data of [memoryUsage] to disk ([spillsSize] [time|times] so far)\n</code></pre> <p><code>spill</code> writeSortedFile (with the <code>isLastFile</code> flag disabled).</p> <p><code>spill</code> frees up execution memory (and records the memory bytes spilled as <code>spillSize</code>).</p> <p><code>spill</code> requests the ShuffleInMemorySorter to reset.</p> <p>In the end, <code>spill</code> requests the TaskContext for TaskMetrics to increase the memory bytes spilled.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#closeandgetspills","title":"closeAndGetSpills <pre><code>SpillInfo[] closeAndGetSpills()\n</code></pre> <p><code>closeAndGetSpills</code>...FIXME</p> <p><code>closeAndGetSpills</code> is used when <code>UnsafeShuffleWriter</code> is requested to closeAndWriteOutput.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#getmemoryusage","title":"getMemoryUsage <pre><code>long getMemoryUsage()\n</code></pre> <p><code>getMemoryUsage</code>...FIXME</p> <p><code>getMemoryUsage</code> is used when <code>ShuffleExternalSorter</code> is created and requested to spill and updatePeakMemoryUsed.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#updatepeakmemoryused","title":"updatePeakMemoryUsed <pre><code>void updatePeakMemoryUsed()\n</code></pre> <p><code>updatePeakMemoryUsed</code>...FIXME</p> <p><code>updatePeakMemoryUsed</code> is used when <code>ShuffleExternalSorter</code> is requested to getPeakMemoryUsedBytes and freeMemory.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#writesortedfile","title":"writeSortedFile <pre><code>void writeSortedFile(\n  boolean isLastFile)\n</code></pre> <p><code>writeSortedFile</code>...FIXME</p>  <p><code>writeSortedFile</code> is used when:</p> <ul> <li><code>ShuffleExternalSorter</code> is requested to spill and closeAndGetSpills</li> </ul>","text":""},{"location":"shuffle/ShuffleExternalSorter/#cleanupresources","title":"cleanupResources <pre><code>void cleanupResources()\n</code></pre> <p><code>cleanupResources</code>...FIXME</p> <p><code>cleanupResources</code> is used when <code>UnsafeShuffleWriter</code> is requested to write records and stop.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#inserting-serialized-record-into-shuffleinmemorysorter","title":"Inserting Serialized Record Into ShuffleInMemorySorter <pre><code>void insertRecord(\n  Object recordBase,\n  long recordOffset,\n  int length,\n  int partitionId)\n</code></pre> <p><code>insertRecord</code>...FIXME</p> <p><code>insertRecord</code> growPointerArrayIfNecessary.</p> <p><code>insertRecord</code>...FIXME</p> <p><code>insertRecord</code> acquireNewPageIfNecessary.</p> <p><code>insertRecord</code>...FIXME</p> <p><code>insertRecord</code> is used when <code>UnsafeShuffleWriter</code> is requested to insertRecordIntoSorter</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#growpointerarrayifnecessary","title":"growPointerArrayIfNecessary <pre><code>void growPointerArrayIfNecessary()\n</code></pre> <p><code>growPointerArrayIfNecessary</code>...FIXME</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#acquirenewpageifnecessary","title":"acquireNewPageIfNecessary <pre><code>void acquireNewPageIfNecessary(\n  int required)\n</code></pre> <p><code>acquireNewPageIfNecessary</code>...FIXME</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#freememory","title":"freeMemory <pre><code>long freeMemory()\n</code></pre> <p><code>freeMemory</code>...FIXME</p> <p><code>freeMemory</code> is used when <code>ShuffleExternalSorter</code> is requested to spill, cleanupResources, and closeAndGetSpills.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#peak-memory-used","title":"Peak Memory Used <pre><code>long getPeakMemoryUsedBytes()\n</code></pre> <p><code>getPeakMemoryUsedBytes</code>...FIXME</p> <p><code>getPeakMemoryUsedBytes</code> is used when <code>UnsafeShuffleWriter</code> is requested to updatePeakMemoryUsed.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.sort.ShuffleExternalSorter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.shuffle.sort.ShuffleExternalSorter=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"shuffle/ShuffleHandle/","title":"ShuffleHandle","text":"<p><code>ShuffleHandle</code> is an abstraction of shuffle handles for ShuffleManager to pass information about shuffles to tasks.</p> <p><code>ShuffleHandle</code> is <code>Serializable</code> (Java).</p>"},{"location":"shuffle/ShuffleHandle/#implementations","title":"Implementations","text":"<ul> <li>BaseShuffleHandle</li> </ul>"},{"location":"shuffle/ShuffleHandle/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleHandle</code> takes the following to be created:</p> <ul> <li> Shuffle ID <p>Abstract Class</p> <p><code>ShuffleHandle</code> is an abstract class and cannot be created directly. It is created indirectly for the concrete ShuffleHandles.</p>"},{"location":"shuffle/ShuffleInMemorySorter/","title":"ShuffleInMemorySorter","text":"<p><code>ShuffleInMemorySorter</code> is used by ShuffleExternalSorter to &lt;&gt; using &lt;&gt; sort algorithms. <p></p> <p>== [[creating-instance]] Creating Instance</p> <p>ShuffleInMemorySorter takes the following to be created:</p> <ul> <li>[[consumer]] memory:MemoryConsumer.md[MemoryConsumer]</li> <li>[[initialSize]] Initial size</li> <li>[[useRadixSort]] useRadixSort flag (to indicate whether to use &lt;&gt;) <p>ShuffleInMemorySorter requests the given &lt;&gt; to memory:MemoryConsumer.md#allocateArray[allocate an array] of the given &lt;&gt; for the &lt;&gt;. <p>ShuffleInMemorySorter is created for a shuffle:ShuffleExternalSorter.md#inMemSorter[ShuffleExternalSorter].</p> <p>== [[getSortedIterator]] Iterator of Records Sorted</p>"},{"location":"shuffle/ShuffleInMemorySorter/#source-java","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#shufflesorteriterator-getsortediterator","title":"ShuffleSorterIterator getSortedIterator()","text":"<p>getSortedIterator...FIXME</p> <p>getSortedIterator is used when ShuffleExternalSorter is requested to shuffle:ShuffleExternalSorter.md#writeSortedFile[writeSortedFile].</p> <p>== [[reset]] Resetting</p>"},{"location":"shuffle/ShuffleInMemorySorter/#source-java_1","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#void-reset","title":"void reset()","text":"<p>reset...FIXME</p> <p>reset is used when...FIXME</p> <p>== [[numRecords]] numRecords Method</p>"},{"location":"shuffle/ShuffleInMemorySorter/#source-java_2","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#int-numrecords","title":"int numRecords()","text":"<p>numRecords...FIXME</p> <p>numRecords is used when...FIXME</p> <p>== [[getUsableCapacity]] Calculating Usable Capacity</p>"},{"location":"shuffle/ShuffleInMemorySorter/#source-java_3","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#int-getusablecapacity","title":"int getUsableCapacity()","text":"<p>getUsableCapacity calculates the capacity that is a half or two-third of the memory used for the &lt;&gt;. <p>getUsableCapacity is used when...FIXME</p> <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.sort.ShuffleExternalSorter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>"},{"location":"shuffle/ShuffleInMemorySorter/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#log4jloggerorgapachesparkshufflesortshuffleexternalsorterall","title":"log4j.logger.org.apache.spark.shuffle.sort.ShuffleExternalSorter=ALL","text":"<p>Refer to spark-logging.md[Logging].</p> <p>== [[internal-properties]] Internal Properties</p> <p>=== [[array]] Unsafe LongArray of Record Pointers and Partition IDs</p> <p>ShuffleInMemorySorter uses a LongArray.</p> <p>=== [[usableCapacity]] Usable Capacity</p> <p>ShuffleInMemorySorter...FIXME</p>"},{"location":"shuffle/ShuffleManager/","title":"ShuffleManager","text":"<p><code>ShuffleManager</code> is an abstraction of shuffle managers that manage shuffle data.</p> <p><code>ShuffleManager</code> is specified using spark.shuffle.manager configuration property.</p> <p><code>ShuffleManager</code> is used to create a BlockManager.</p>"},{"location":"shuffle/ShuffleManager/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleManager/#getting-shufflereader-for-shufflehandle","title":"Getting ShuffleReader for ShuffleHandle <pre><code>getReader[K, C](\n  handle: ShuffleHandle,\n  startPartition: Int,\n  endPartition: Int,\n  context: TaskContext,\n  metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]\n</code></pre> <p>ShuffleReader to read shuffle data (for the given ShuffleHandle)</p> <p>Used when the following <code>RDD</code>s are requested to compute a partition:</p> <ul> <li><code>CoGroupedRDD</code> is requested to compute a partition</li> <li><code>ShuffledRDD</code> is requested to compute a partition</li> <li><code>SubtractedRDD</code> is requested to compute a partition</li> <li><code>ShuffledRowRDD</code> (Spark SQL) is requested to <code>compute</code> a partition</li> </ul>","text":""},{"location":"shuffle/ShuffleManager/#getreaderforrange","title":"getReaderForRange <pre><code>getReaderForRange[K, C](\n  handle: ShuffleHandle,\n  startMapIndex: Int,\n  endMapIndex: Int,\n  startPartition: Int,\n  endPartition: Int,\n  context: TaskContext,\n  metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]\n</code></pre> <p>ShuffleReader for a range of reduce partitions to read from map output in the ShuffleHandle</p> <p>Used when <code>ShuffledRowRDD</code> (Spark SQL) is requested to compute a partition</p>","text":""},{"location":"shuffle/ShuffleManager/#getting-shufflewriter-for-shufflehandle","title":"Getting ShuffleWriter for ShuffleHandle <pre><code>getWriter[K, V](\n  handle: ShuffleHandle,\n  mapId: Long,\n  context: TaskContext,\n  metrics: ShuffleWriteMetricsReporter): ShuffleWriter[K, V]\n</code></pre> <p>ShuffleWriter to write shuffle data in the ShuffleHandle</p> <p>Used when <code>ShuffleWriteProcessor</code> is requested to write a partition</p>","text":""},{"location":"shuffle/ShuffleManager/#registering-shuffle-of-shuffledependency-and-getting-shufflehandle","title":"Registering Shuffle of ShuffleDependency (and Getting ShuffleHandle) <pre><code>registerShuffle[K, V, C](\n  shuffleId: Int,\n  dependency: ShuffleDependency[K, V, C]): ShuffleHandle\n</code></pre> <p>Registers a shuffle (by the given <code>shuffleId</code> and ShuffleDependency) and gives a ShuffleHandle</p> <p>Used when <code>ShuffleDependency</code> is created (and registers with the shuffle system)</p>","text":""},{"location":"shuffle/ShuffleManager/#shuffleblockresolver","title":"ShuffleBlockResolver <pre><code>shuffleBlockResolver: ShuffleBlockResolver\n</code></pre> <p>ShuffleBlockResolver of the shuffle system</p> <p>Used when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleWriter for a ShuffleHandle, to unregister a shuffle and stop</li> <li><code>BlockManager</code> is requested to getLocalBlockData and getHostLocalShuffleData</li> </ul>","text":""},{"location":"shuffle/ShuffleManager/#stopping-shufflemanager","title":"Stopping ShuffleManager <pre><code>stop(): Unit\n</code></pre> <p>Stops the shuffle system</p> <p>Used when <code>SparkEnv</code> is requested to stop</p>","text":""},{"location":"shuffle/ShuffleManager/#unregistering-shuffle","title":"Unregistering Shuffle <pre><code>unregisterShuffle(\n  shuffleId: Int): Boolean\n</code></pre> <p>Unregisters a given shuffle</p> <p>Used when <code>BlockManagerSlaveEndpoint</code> is requested to handle a RemoveShuffle message</p>","text":""},{"location":"shuffle/ShuffleManager/#implementations","title":"Implementations","text":"<ul> <li>SortShuffleManager</li> </ul>"},{"location":"shuffle/ShuffleManager/#accessing-shufflemanager-using-sparkenv","title":"Accessing ShuffleManager using SparkEnv <p><code>ShuffleManager</code> is available on the driver and executors using SparkEnv.shuffleManager.</p> <pre><code>val shuffleManager = SparkEnv.get.shuffleManager\n</code></pre>","text":""},{"location":"shuffle/ShuffleMapOutputWriter/","title":"ShuffleMapOutputWriter","text":"<p><code>ShuffleMapOutputWriter</code> is...FIXME</p>"},{"location":"shuffle/ShuffleReader/","title":"ShuffleReader","text":"<p><code>ShuffleReader</code> is an abstraction of shuffle block readers that can read combined key-value records for a reduce task.</p>"},{"location":"shuffle/ShuffleReader/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleReader/#reading-combined-records-for-reduce-task","title":"Reading Combined Records (for Reduce Task) <pre><code>read(): Iterator[Product2[K, C]]\n</code></pre> <p>Used when:</p> <ul> <li>CoGroupedRDD, ShuffledRDD, and SubtractedRDD are requested to compute a partition (for a <code>ShuffleDependency</code> dependency)</li> <li><code>ShuffledRowRDD</code> (Spark SQL) is requested to <code>compute</code> a partition</li> </ul>","text":""},{"location":"shuffle/ShuffleReader/#implementations","title":"Implementations","text":"<ul> <li>BlockStoreShuffleReader</li> </ul>"},{"location":"shuffle/ShuffleWriteMetricsReporter/","title":"ShuffleWriteMetricsReporter","text":"<p><code>ShuffleWriteMetricsReporter</code> is an abstraction of shuffle write metrics reporters.</p>"},{"location":"shuffle/ShuffleWriteMetricsReporter/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#decbyteswritten","title":"decBytesWritten <pre><code>decBytesWritten(\n  v: Long): Unit\n</code></pre>","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#decrecordswritten","title":"decRecordsWritten <pre><code>decRecordsWritten(\n  v: Long): Unit\n</code></pre>","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#incbyteswritten","title":"incBytesWritten <pre><code>incBytesWritten(\n  v: Long): Unit\n</code></pre>","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#increcordswritten","title":"incRecordsWritten <pre><code>incRecordsWritten(\n  v: Long): Unit\n</code></pre> <p>See ShuffleWriteMetrics</p> <p>Used when:</p> <ul> <li><code>ShufflePartitionPairsWriter</code> is requested to <code>recordWritten</code></li> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile</li> <li><code>DiskBlockObjectWriter</code> is requested to record bytes written</li> </ul>","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#incwritetime","title":"incWriteTime <pre><code>incWriteTime(\n  v: Long): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>BypassMergeSortShuffleWriter</code> is requested to write partition records and writePartitionedData</li> <li><code>UnsafeShuffleWriter</code> is requested to mergeSpillsWithTransferTo</li> <li><code>DiskBlockObjectWriter</code> is requested to commitAndGet</li> <li><code>TimeTrackingOutputStream</code> is requested to <code>write</code>, <code>flush</code>, and <code>close</code></li> </ul>","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#implementations","title":"Implementations","text":"<ul> <li>ShuffleWriteMetrics</li> <li>SQLShuffleWriteMetricsReporter (Spark SQL)</li> </ul>"},{"location":"shuffle/ShuffleWriteProcessor/","title":"ShuffleWriteProcessor","text":"<p><code>ShuffleWriteProcessor</code> controls write behavior in ShuffleMapTasks while writing partition records out to the shuffle system.</p> <p><code>ShuffleWriteProcessor</code> is used to create a ShuffleDependency.</p>"},{"location":"shuffle/ShuffleWriteProcessor/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleWriteProcessor</code> takes no arguments to be created.</p> <p><code>ShuffleWriteProcessor</code> is created when:</p> <ul> <li><code>ShuffleDependency</code> is created</li> <li><code>ShuffleExchangeExec</code> (Spark SQL) physical operator is requested to <code>createShuffleWriteProcessor</code></li> </ul>"},{"location":"shuffle/ShuffleWriteProcessor/#writing-partition-records-to-shuffle-system","title":"Writing Partition Records to Shuffle System <pre><code>write(\n  rdd: RDD[_],\n  dep: ShuffleDependency[_, _, _],\n  mapId: Long,\n  context: TaskContext,\n  partition: Partition): MapStatus\n</code></pre> <p><code>write</code> requests the ShuffleManager for the ShuffleWriter for the ShuffleHandle (of the given ShuffleDependency).</p> <p><code>write</code> requests the <code>ShuffleWriter</code> to write out records (of the given Partition and RDD).</p> <p>In the end, <code>write</code> requests the <code>ShuffleWriter</code> to stop (with the <code>success</code> flag enabled).</p> <p>In case of any <code>Exception</code>s, <code>write</code> requests the <code>ShuffleWriter</code> to stop (with the <code>success</code> flag disabled).</p> <p><code>write</code>\u00a0is used when <code>ShuffleMapTask</code> is requested to run.</p>","text":""},{"location":"shuffle/ShuffleWriteProcessor/#creating-metricsreporter","title":"Creating MetricsReporter <pre><code>createMetricsReporter(\n  context: TaskContext): ShuffleWriteMetricsReporter\n</code></pre> <p><code>createMetricsReporter</code> creates a ShuffleWriteMetricsReporter from the given TaskContext.</p> <p><code>createMetricsReporter</code> requests the given TaskContext for TaskMetrics and then for the ShuffleWriteMetrics.</p>","text":""},{"location":"shuffle/ShuffleWriter/","title":"ShuffleWriter","text":"<p><code>ShuffleWriter[K, V]</code> (of <code>K</code> keys and <code>V</code> values) is an abstraction of shuffle writers that can write out key-value records (of a RDD partition) to a shuffle system.</p> <p><code>ShuffleWriter</code> is used when ShuffleMapTask is requested to run (and uses a <code>ShuffleWriteProcessor</code> to write partition records to a shuffle system).</p>"},{"location":"shuffle/ShuffleWriter/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleWriter/#writing-out-partition-records-to-shuffle-system","title":"Writing Out Partition Records to Shuffle System <pre><code>write(\n  records: Iterator[Product2[K, V]]): Unit\n</code></pre> <p>Writes key-value records (of a partition) out to a shuffle system</p> <p>Used when:</p> <ul> <li><code>ShuffleWriteProcessor</code> is requested to write</li> </ul>","text":""},{"location":"shuffle/ShuffleWriter/#stopping-shufflewriter","title":"Stopping ShuffleWriter <pre><code>stop(\n  success: Boolean): Option[MapStatus]\n</code></pre> <p>Stops (closes) the <code>ShuffleWriter</code> and returns a MapStatus if the writing completed successfully. The <code>success</code> flag is the status of the task execution.</p> <p>Used when:</p> <ul> <li><code>ShuffleWriteProcessor</code> is requested to write</li> </ul>","text":""},{"location":"shuffle/ShuffleWriter/#implementations","title":"Implementations","text":"<ul> <li>BypassMergeSortShuffleWriter</li> <li>SortShuffleWriter</li> <li>UnsafeShuffleWriter</li> </ul>"},{"location":"shuffle/SingleSpillShuffleMapOutputWriter/","title":"SingleSpillShuffleMapOutputWriter","text":"<p><code>SingleSpillShuffleMapOutputWriter</code> is...FIXME</p>"},{"location":"shuffle/SortShuffleManager/","title":"SortShuffleManager","text":"<p><code>SortShuffleManager</code> is the default and only ShuffleManager in Apache Spark (with the short name <code>sort</code> or <code>tungsten-sort</code>).</p> <p></p>"},{"location":"shuffle/SortShuffleManager/#creating-instance","title":"Creating Instance","text":"<p><code>SortShuffleManager</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>SortShuffleManager</code> is created when <code>SparkEnv</code> is created (on the driver and executors at the very beginning of a Spark application's lifecycle).</p>"},{"location":"shuffle/SortShuffleManager/#taskidmapsforshuffle-registry","title":"taskIdMapsForShuffle Registry <pre><code>taskIdMapsForShuffle: ConcurrentHashMap[Int, OpenHashSet[Long]]\n</code></pre> <p><code>SortShuffleManager</code> uses <code>taskIdMapsForShuffle</code> internal registry to track task (attempt) IDs by shuffle.</p> <p>A new shuffle and task IDs are added when <code>SortShuffleManager</code> is requested for a ShuffleWriter (for a partition and a <code>ShuffleHandle</code>).</p> <p>A shuffle ID (and associated task IDs) are removed when <code>SortShuffleManager</code> is requested to unregister a shuffle.</p>","text":""},{"location":"shuffle/SortShuffleManager/#getting-shufflewriter-for-partition-and-shufflehandle","title":"Getting ShuffleWriter for Partition and ShuffleHandle <pre><code>getWriter[K, V](\n  handle: ShuffleHandle,\n  mapId: Int,\n  context: TaskContext): ShuffleWriter[K, V]\n</code></pre> <p><code>getWriter</code> registers the given ShuffleHandle (by the shuffleId and numMaps) in the taskIdMapsForShuffle internal registry unless already done.</p>  <p>Note</p> <p><code>getWriter</code> expects that the input <code>ShuffleHandle</code> is a BaseShuffleHandle. Moreover, <code>getWriter</code> expects that in two (out of three cases) it is a more specialized IndexShuffleBlockResolver.</p>  <p><code>getWriter</code> then creates a new <code>ShuffleWriter</code> based on the type of the given <code>ShuffleHandle</code>.</p>    ShuffleHandle ShuffleWriter     SerializedShuffleHandle UnsafeShuffleWriter   BypassMergeSortShuffleHandle BypassMergeSortShuffleWriter   BaseShuffleHandle SortShuffleWriter    <p><code>getWriter</code> is part of the ShuffleManager abstraction.</p>","text":""},{"location":"shuffle/SortShuffleManager/#shuffleexecutorcomponents","title":"ShuffleExecutorComponents <pre><code>shuffleExecutorComponents: ShuffleExecutorComponents\n</code></pre> <p><code>SortShuffleManager</code> defines the <code>shuffleExecutorComponents</code> internal registry for a ShuffleExecutorComponents.</p> <p><code>shuffleExecutorComponents</code>\u00a0is used when:</p> <ul> <li><code>SortShuffleManager</code> is requested for the ShuffleWriter</li> </ul>","text":""},{"location":"shuffle/SortShuffleManager/#loadshuffleexecutorcomponents","title":"loadShuffleExecutorComponents <pre><code>loadShuffleExecutorComponents(\n  conf: SparkConf): ShuffleExecutorComponents\n</code></pre> <p><code>loadShuffleExecutorComponents</code> loads the ShuffleDataIO that is then requested for the ShuffleExecutorComponents.</p> <p><code>loadShuffleExecutorComponents</code> requests the <code>ShuffleExecutorComponents</code> to initialize before returning it.</p>","text":""},{"location":"shuffle/SortShuffleManager/#creating-shufflehandle-for-shuffledependency","title":"Creating ShuffleHandle for ShuffleDependency <pre><code>registerShuffle[K, V, C](\n  shuffleId: Int,\n  dependency: ShuffleDependency[K, V, C]): ShuffleHandle\n</code></pre> <p><code>registerShuffle</code>\u00a0is part of the ShuffleManager abstraction.</p> <p><code>registerShuffle</code> creates a new ShuffleHandle (for the given ShuffleDependency) that is one of the following:</p> <ol> <li> <p>BypassMergeSortShuffleHandle (with <code>ShuffleDependency[K, V, V]</code>) when shouldBypassMergeSort condition holds</p> </li> <li> <p>SerializedShuffleHandle (with <code>ShuffleDependency[K, V, V]</code>) when canUseSerializedShuffle condition holds</p> </li> <li> <p>BaseShuffleHandle</p> </li> </ol>","text":""},{"location":"shuffle/SortShuffleManager/#serializedshufflehandle-requirements","title":"SerializedShuffleHandle Requirements <pre><code>canUseSerializedShuffle(\n  dependency: ShuffleDependency[_, _, _]): Boolean\n</code></pre> <p><code>canUseSerializedShuffle</code> is <code>true</code> when all of the following hold for the given ShuffleDependency:</p> <ol> <li> <p>Serializer (of the given <code>ShuffleDependency</code>) supports relocation of serialized objects</p> </li> <li> <p>mapSideCombine flag (of the given <code>ShuffleDependency</code>) is <code>false</code></p> </li> <li> <p>Number of partitions (of the Partitioner of the given <code>ShuffleDependency</code>) is not greater than the supported maximum number</p> </li> </ol> <p>With all of the above positive, <code>canUseSerializedShuffle</code> prints out the following DEBUG message to the logs:</p> <pre><code>Can use serialized shuffle for shuffle [shuffleId]\n</code></pre> <p>Otherwise, <code>canUseSerializedShuffle</code> is <code>false</code> and prints out one of the following DEBUG messages based on the failed requirement:</p> <pre><code>Can't use serialized shuffle for shuffle [id] because the serializer, [name], does not support object relocation\n</code></pre> <pre><code>Can't use serialized shuffle for shuffle [id] because we need to do map-side aggregation\n</code></pre> <pre><code>Can't use serialized shuffle for shuffle [id] because it has more than [number] partitions\n</code></pre> <p><code>canUseSerializedShuffle</code>\u00a0is used when:</p> <ul> <li><code>SortShuffleManager</code> is requested to register a shuffle (and creates a ShuffleHandle)</li> </ul>","text":""},{"location":"shuffle/SortShuffleManager/#maximum-number-of-partition-identifiers-for-serialized-mode","title":"Maximum Number of Partition Identifiers for Serialized Mode <p><code>SortShuffleManager</code> defines <code>MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE</code> internal constant to be <code>(1 &lt;&lt; 24)</code> (<code>16777216</code>) for the maximum number of shuffle output partitions.</p> <p><code>MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE</code> is used when:</p> <ul> <li><code>UnsafeShuffleWriter</code> is created</li> <li><code>SortShuffleManager</code> utility is used to check out SerializedShuffleHandle requirements</li> <li><code>ShuffleExchangeExec</code> (Spark SQL) utility is used to <code>needToCopyObjectsBeforeShuffle</code></li> </ul>","text":""},{"location":"shuffle/SortShuffleManager/#creating-shuffleblockresolver","title":"Creating ShuffleBlockResolver <pre><code>shuffleBlockResolver: IndexShuffleBlockResolver\n</code></pre> <p><code>shuffleBlockResolver</code>\u00a0is part of the ShuffleManager abstraction.</p> <p><code>shuffleBlockResolver</code> is a IndexShuffleBlockResolver (and is created immediately alongside this SortShuffleManager).</p>","text":""},{"location":"shuffle/SortShuffleManager/#unregistering-shuffle","title":"Unregistering Shuffle <pre><code>unregisterShuffle(\n  shuffleId: Int): Boolean\n</code></pre> <p><code>unregisterShuffle</code>\u00a0is part of the ShuffleManager abstraction.</p> <p><code>unregisterShuffle</code> removes the given <code>shuffleId</code> from the taskIdMapsForShuffle internal registry.</p> <p>If the <code>shuffleId</code> was found and removed successfully, <code>unregisterShuffle</code> requests the IndexShuffleBlockResolver to remove the shuffle index and data files for every <code>mapTaskId</code> (mappers producing the output for the shuffle).</p> <p><code>unregisterShuffle</code> is always <code>true</code>.</p>","text":""},{"location":"shuffle/SortShuffleManager/#getting-shufflereader-for-shufflehandle","title":"Getting ShuffleReader for ShuffleHandle <pre><code>getReader[K, C](\n  handle: ShuffleHandle,\n  startMapIndex: Int,\n  endMapIndex: Int,\n  startPartition: Int,\n  endPartition: Int,\n  context: TaskContext,\n  metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]\n</code></pre> <p><code>getReader</code>\u00a0is part of the ShuffleManager abstraction.</p> <p><code>getReader</code> requests the MapOutputTracker (via SparkEnv) for the getMapSizesByExecutorId for the <code>shuffleId</code> (of the given ShuffleHandle).</p> <p>In the end, <code>getReader</code> creates a new BlockStoreShuffleReader.</p>","text":""},{"location":"shuffle/SortShuffleManager/#stopping-shufflemanager","title":"Stopping ShuffleManager <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>\u00a0is part of the ShuffleManager abstraction.</p> <p><code>stop</code> requests the IndexShuffleBlockResolver to stop.</p>","text":""},{"location":"shuffle/SortShuffleManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.sort.SortShuffleManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.shuffle.sort.SortShuffleManager=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"shuffle/SortShuffleWriter/","title":"SortShuffleWriter \u2014 Fallback ShuffleWriter","text":"<p><code>SortShuffleWriter</code> is a \"fallback\" ShuffleWriter (when <code>SortShuffleManager</code> is requested for a ShuffleWriter and the more specialized BypassMergeSortShuffleWriter and UnsafeShuffleWriter could not be used).</p> <p><code>SortShuffleWriter[K, V, C]</code> is a parameterized type with <code>K</code> keys, <code>V</code> values, and <code>C</code> combiner values.</p>"},{"location":"shuffle/SortShuffleWriter/#creating-instance","title":"Creating Instance","text":"<p><code>SortShuffleWriter</code> takes the following to be created:</p> <ul> <li> IndexShuffleBlockResolver (unused) <li> BaseShuffleHandle <li> Map ID <li> TaskContext <li> ShuffleExecutorComponents <p><code>SortShuffleWriter</code> is created\u00a0when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleWriter (for a given ShuffleHandle)</li> </ul>"},{"location":"shuffle/SortShuffleWriter/#mapstatus","title":"MapStatus <p><code>SortShuffleWriter</code> uses <code>mapStatus</code> internal registry for a MapStatus after writing records.</p> <p>Writing records itself does not return a value and <code>SortShuffleWriter</code> uses the registry when requested to stop (which allows returning a <code>MapStatus</code>).</p>","text":""},{"location":"shuffle/SortShuffleWriter/#writing-records-into-shuffle-partitioned-file-in-disk-store","title":"Writing Records (Into Shuffle Partitioned File In Disk Store) <pre><code>write(\n  records: Iterator[Product2[K, V]]): Unit\n</code></pre> <p><code>write</code> is part of the ShuffleWriter abstraction.</p> <p><code>write</code> creates an ExternalSorter based on the ShuffleDependency (of the BaseShuffleHandle), namely the Map-Size Partial Aggregation flag. The <code>ExternalSorter</code> uses the aggregator and key ordering when the flag is enabled.</p> <p><code>write</code> requests the <code>ExternalSorter</code> to insert all the given records.</p> <p><code>write</code>...FIXME</p>","text":""},{"location":"shuffle/SortShuffleWriter/#stopping-sortshufflewriter-and-calculating-mapstatus","title":"Stopping SortShuffleWriter (and Calculating MapStatus) <pre><code>stop(\n  success: Boolean): Option[MapStatus]\n</code></pre> <p><code>stop</code> is part of the ShuffleWriter abstraction.</p> <p><code>stop</code> turns the stopping flag on and returns the internal mapStatus if the input <code>success</code> is enabled.</p> <p>Otherwise, when stopping flag is already enabled or the input <code>success</code> is disabled, <code>stop</code> returns no <code>MapStatus</code> (i.e. <code>None</code>).</p> <p>In the end, <code>stop</code> requests the <code>ExternalSorter</code> to stop and increments the shuffle write time task metrics.</p>","text":""},{"location":"shuffle/SortShuffleWriter/#requirements-of-bypassmergesortshufflehandle-as-shufflehandle","title":"Requirements of BypassMergeSortShuffleHandle (as ShuffleHandle) <pre><code>shouldBypassMergeSort(\n  conf: SparkConf,\n  dep: ShuffleDependency[_, _, _]): Boolean\n</code></pre> <p><code>shouldBypassMergeSort</code> returns <code>true</code> when all of the following hold:</p> <ol> <li> <p>No map-side aggregation (the mapSideCombine flag of the given ShuffleDependency is off)</p> </li> <li> <p>Number of partitions (of the Partitioner of the given ShuffleDependency) is not greater than spark.shuffle.sort.bypassMergeThreshold configuration property</p> </li> </ol> <p>Otherwise, <code>shouldBypassMergeSort</code> does not hold (<code>false</code>).</p> <p><code>shouldBypassMergeSort</code> is used when:</p> <ul> <li><code>SortShuffleManager</code> is requested to register a shuffle (and creates a ShuffleHandle)</li> </ul>","text":""},{"location":"shuffle/SortShuffleWriter/#stopping-flag","title":"stopping Flag <p><code>SortShuffleWriter</code> uses <code>stopping</code> internal flag to indicate whether or not this <code>SortShuffleWriter</code> has been stopped.</p>","text":""},{"location":"shuffle/SortShuffleWriter/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.sort.SortShuffleWriter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.shuffle.sort.SortShuffleWriter=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"shuffle/Spillable/","title":"Spillable","text":"<p><code>Spillable</code> is an extension of the MemoryConsumer abstraction for spillable collections that can spill to disk.</p> <p><code>Spillable[C]</code> is a parameterized type of <code>C</code> combiner (partial) values.</p>"},{"location":"shuffle/Spillable/#contract","title":"Contract","text":""},{"location":"shuffle/Spillable/#forcespill","title":"forceSpill <pre><code>forceSpill(): Boolean\n</code></pre> <p>Force spilling the current in-memory collection to disk to release memory.</p> <p>Used when <code>Spillable</code> is requested to spill</p>","text":""},{"location":"shuffle/Spillable/#spill","title":"spill <pre><code>spill(\n  collection: C): Unit\n</code></pre> <p>Spills the current in-memory collection to disk, and releases the memory.</p> <p>Used when:</p> <ul> <li><code>ExternalAppendOnlyMap</code> is requested to forceSpill</li> <li><code>Spillable</code> is requested to spilling to disk if necessary</li> </ul>","text":""},{"location":"shuffle/Spillable/#implementations","title":"Implementations","text":"<ul> <li>ExternalAppendOnlyMap</li> <li>ExternalSorter</li> </ul>"},{"location":"shuffle/Spillable/#memory-threshold","title":"Memory Threshold <p><code>Spillable</code> uses a threshold for the memory size (in bytes) to know when to spill to disk.</p> <p>When the size of the in-memory collection is above the threshold, <code>Spillable</code> will try to acquire more memory. Unless given all requested memory, <code>Spillable</code> spills to disk.</p> <p>The memory threshold starts as spark.shuffle.spill.initialMemoryThreshold configuration property and is increased every time <code>Spillable</code> is requested to spill to disk if needed, but managed to acquire required memory. The threshold goes back to the initial value when requested to release all memory.</p> <p>Used when <code>Spillable</code> is requested to spill and releaseMemory.</p>","text":""},{"location":"shuffle/Spillable/#creating-instance","title":"Creating Instance <p><code>Spillable</code> takes the following to be created:</p> <ul> <li> TaskMemoryManager   Abstract Class <p><code>Spillable</code> is an abstract class and cannot be created directly. It is created indirectly for the concrete Spillables.</p>","text":""},{"location":"shuffle/Spillable/#configuration-properties","title":"Configuration Properties","text":""},{"location":"shuffle/Spillable/#sparkshufflespillnumelementsforcespillthreshold","title":"spark.shuffle.spill.numElementsForceSpillThreshold <p><code>Spillable</code> uses spark.shuffle.spill.numElementsForceSpillThreshold configuration property to force spilling in-memory objects to disk when requested to maybeSpill.</p>","text":""},{"location":"shuffle/Spillable/#sparkshufflespillinitialmemorythreshold","title":"spark.shuffle.spill.initialMemoryThreshold <p><code>Spillable</code> uses spark.shuffle.spill.initialMemoryThreshold configuration property as the initial threshold for the size of a collection (and the minimum memory required to operate properly).</p> <p><code>Spillable</code> uses it when requested to spill and releaseMemory.</p>","text":""},{"location":"shuffle/Spillable/#releasing-all-memory","title":"Releasing All Memory <pre><code>releaseMemory(): Unit\n</code></pre> <p><code>releaseMemory</code>...FIXME</p> <p><code>releaseMemory</code> is used when:</p> <ul> <li><code>ExternalAppendOnlyMap</code> is requested to freeCurrentMap</li> <li><code>ExternalSorter</code> is requested to stop</li> <li><code>Spillable</code> is requested to maybeSpill and spill (and spilled to disk in either case)</li> </ul>","text":""},{"location":"shuffle/Spillable/#spilling-in-memory-collection-to-disk-to-release-memory","title":"Spilling In-Memory Collection to Disk (to Release Memory) <pre><code>spill(\n  collection: C): Unit\n</code></pre> <p><code>spill</code> spills the given in-memory collection to disk to release memory.</p> <p><code>spill</code> is used when:</p> <ul> <li><code>ExternalAppendOnlyMap</code> is requested to forceSpill</li> <li><code>Spillable</code> is requested to maybeSpill</li> </ul>","text":""},{"location":"shuffle/Spillable/#forcespill_1","title":"forceSpill <pre><code>forceSpill(): Boolean\n</code></pre> <p><code>forceSpill</code> forcefully spills the Spillable to disk to release memory.</p> <p><code>forceSpill</code> is used when <code>Spillable</code> is requested to spill an in-memory collection to disk.</p>","text":""},{"location":"shuffle/Spillable/#spilling-to-disk-if-necessary","title":"Spilling to Disk if Necessary <pre><code>maybeSpill(\n  collection: C,\n  currentMemory: Long): Boolean\n</code></pre> <p><code>maybeSpill</code>...FIXME</p> <p><code>maybeSpill</code> is used when:</p> <ul> <li><code>ExternalAppendOnlyMap</code> is requested to insertAll</li> <li><code>ExternalSorter</code> is requested to attempt to spill an in-memory collection to disk if needed</li> </ul>","text":""},{"location":"shuffle/UnsafeShuffleWriter/","title":"UnsafeShuffleWriter","text":"<p><code>UnsafeShuffleWriter&lt;K, V&gt;</code> is a ShuffleWriter for SerializedShuffleHandles.</p> <p></p> <p><code>UnsafeShuffleWriter</code> opens resources (a ShuffleExternalSorter and the buffers) while being created.</p> <p></p>"},{"location":"shuffle/UnsafeShuffleWriter/#creating-instance","title":"Creating Instance","text":"<p><code>UnsafeShuffleWriter</code> takes the following to be created:</p> <ul> <li> BlockManager <li> TaskMemoryManager <li> SerializedShuffleHandle <li> Map ID <li> TaskContext <li> SparkConf <li> ShuffleWriteMetricsReporter <li> <code>ShuffleExecutorComponents</code> <p><code>UnsafeShuffleWriter</code> is created when <code>SortShuffleManager</code> is requested for a ShuffleWriter for a SerializedShuffleHandle.</p> <p><code>UnsafeShuffleWriter</code> makes sure that the number of partitions at most 16MB reduce partitions (<code>1 &lt;&lt; 24</code>) (as the upper bound of the partition identifiers that can be encoded) or throws an <code>IllegalArgumentException</code>:</p> <pre><code>UnsafeShuffleWriter can only be used for shuffles with at most 16777215 reduce partitions\n</code></pre> <p><code>UnsafeShuffleWriter</code> uses the number of partitions of the Partitioner that is used for the ShuffleDependency of the SerializedShuffleHandle.</p> <p>Note</p> <p>The number of shuffle output partitions is first enforced when <code>SortShuffleManager</code> is requested to check out whether a SerializedShuffleHandle can be used for ShuffleHandle (that eventually leads to <code>UnsafeShuffleWriter</code>).</p> <p>In the end, <code>UnsafeShuffleWriter</code> creates a ShuffleExternalSorter and a SerializationStream.</p>"},{"location":"shuffle/UnsafeShuffleWriter/#shuffleexternalsorter","title":"ShuffleExternalSorter <p><code>UnsafeShuffleWriter</code> uses a ShuffleExternalSorter.</p> <p><code>ShuffleExternalSorter</code> is created when <code>UnsafeShuffleWriter</code> is requested to open (while being created) and dereferenced (<code>null</code>ed) when requested to close internal resources and merge spill files.</p> <p>Used when <code>UnsafeShuffleWriter</code> is requested for the following:</p> <ul> <li>Updating peak memory used</li> <li>Writing records</li> <li>Closing internal resources and merging spill files</li> <li>Inserting a record</li> <li>Stopping</li> </ul>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#indexshuffleblockresolver","title":"IndexShuffleBlockResolver <p><code>UnsafeShuffleWriter</code> is given a IndexShuffleBlockResolver when created.</p> <p><code>UnsafeShuffleWriter</code> uses the IndexShuffleBlockResolver for...FIXME</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#initial-serialized-buffer-size","title":"Initial Serialized Buffer Size <p><code>UnsafeShuffleWriter</code> uses a fixed buffer size for the output stream of serialized data written into a byte array (default: <code>1024 * 1024</code>).</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#inputbuffersizeinbytes","title":"inputBufferSizeInBytes <p><code>UnsafeShuffleWriter</code> uses the spark.shuffle.file.buffer configuration property for...FIXME</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#outputbuffersizeinbytes","title":"outputBufferSizeInBytes <p><code>UnsafeShuffleWriter</code> uses the spark.shuffle.unsafe.file.output.buffer configuration property for...FIXME</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#transfertoenabled","title":"transferToEnabled <p><code>UnsafeShuffleWriter</code> can use a specialized NIO-based fast merge procedure that avoids extra serialization/deserialization when spark.file.transferTo configuration property is enabled.</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#initialsortbuffersize","title":"initialSortBufferSize <p><code>UnsafeShuffleWriter</code> uses the initial buffer size for sorting (default: <code>4096</code>) when creating a ShuffleExternalSorter (when requested to open).</p>  <p>Tip</p> <p>Use spark.shuffle.sort.initialBufferSize configuration property to change the buffer size.</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#merging-spills","title":"Merging Spills <pre><code>long[] mergeSpills(\n  SpillInfo[] spills,\n  File outputFile)\n</code></pre>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#many-spills","title":"Many Spills <p>With multiple <code>SpillInfos</code> to merge, <code>mergeSpills</code> selects between fast and slow merge strategies. The fast merge strategy can be transferTo- or fileStream-based.</p> <p>mergeSpills uses the spark.shuffle.unsafe.fastMergeEnabled configuration property to consider one of the fast merge strategies.</p> <p>A fast merge strategy is supported when spark.shuffle.compress configuration property is disabled or the IO compression codec supports decompression of concatenated compressed streams.</p> <p>With spark.shuffle.compress configuration property enabled, <code>mergeSpills</code> will always use the slow merge strategy.</p> <p>With fast merge strategy enabled and supported, transferToEnabled enabled and encryption disabled, <code>mergeSpills</code> prints out the following DEBUG message to the logs and mergeSpillsWithTransferTo.</p> <pre><code>Using transferTo-based fast merge\n</code></pre> <p>With fast merge strategy enabled and supported, no transferToEnabled or encryption enabled, <code>mergeSpills</code> prints out the following DEBUG message to the logs and mergeSpillsWithFileStream (with no compression codec).</p> <pre><code>Using fileStream-based fast merge\n</code></pre> <p>For slow merge, <code>mergeSpills</code> prints out the following DEBUG message to the logs and mergeSpillsWithFileStream (with the compression codec).</p> <pre><code>Using slow merge\n</code></pre> <p>In the end, <code>mergeSpills</code> requests the ShuffleWriteMetrics to decBytesWritten and incBytesWritten, and returns the partition length array.</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#one-spill","title":"One Spill <p>With one <code>SpillInfo</code> to merge, <code>mergeSpills</code> simply renames the spill file to be the output file and returns the partition length array of the one spill.</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#no-spills","title":"No Spills <p>With no <code>SpillInfo</code>s to merge, <code>mergeSpills</code> creates an empty output file and returns an array of <code>0</code>s of size of the numPartitions of the Partitioner.</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#usage","title":"Usage <p><code>mergeSpills</code> is used when <code>UnsafeShuffleWriter</code> is requested to close internal resources and merge spill files.</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#mergespillswithtransferto","title":"mergeSpillsWithTransferTo <pre><code>long[] mergeSpillsWithTransferTo(\n  SpillInfo[] spills,\n  File outputFile)\n</code></pre> <p><code>mergeSpillsWithTransferTo</code>...FIXME</p> <p><code>mergeSpillsWithTransferTo</code> is used when <code>UnsafeShuffleWriter</code> is requested to mergeSpills (with the transferToEnabled flag enabled and no encryption).</p> <p>== [[updatePeakMemoryUsed]] updatePeakMemoryUsed Internal Method</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java","title":"[source, java]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#void-updatepeakmemoryused","title":"void updatePeakMemoryUsed() <p>updatePeakMemoryUsed...FIXME</p> <p>updatePeakMemoryUsed is used when UnsafeShuffleWriter is requested for the &lt;&gt; and to &lt;&gt;.","text":""},{"location":"shuffle/UnsafeShuffleWriter/#writing-key-value-records-of-partition","title":"Writing Key-Value Records of Partition <pre><code>void write(\n  Iterator&lt;Product2&lt;K, V&gt;&gt; records)\n</code></pre> <p><code>write</code> traverses the input sequence of records (for a RDD partition) and insertRecordIntoSorter one by one. When all the records have been processed, <code>write</code> closes internal resources and merges spill files.</p> <p>In the end, <code>write</code> requests <code>ShuffleExternalSorter</code> to clean up.</p> <p>CAUTION: FIXME</p> <p>When requested to &lt;&gt;, UnsafeShuffleWriter simply &lt;&gt; followed by &lt;&gt; (that, among other things, creates the &lt;&gt;). <p><code>write</code> is part of the ShuffleWriter abstraction.</p> <p>== [[stop]] Stopping ShuffleWriter</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_1","title":"[source, java] <p>Option stop(   boolean success)  <p><code>stop</code>...FIXME</p> <p>When requested to &lt;&gt;, UnsafeShuffleWriter records the peak execution memory metric and returns the &lt;&gt; (that was created when requested to &lt;&gt;). <p><code>stop</code> is part of the ShuffleWriter abstraction.</p> <p>== [[insertRecordIntoSorter]] Inserting Record Into ShuffleExternalSorter</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_2","title":"[source, java] <p>void insertRecordIntoSorter(   Product2 record)  <p>insertRecordIntoSorter requires that the &lt;&gt; is available. <p>insertRecordIntoSorter requests the &lt;&gt; to reset (so that all currently accumulated output in the output stream is discarded and reusing the already allocated buffer space). <p>insertRecordIntoSorter requests the &lt;&gt; to write out the record (write the serializer:SerializationStream.md#writeKey[key] and the serializer:SerializationStream.md#writeValue[value]) and to serializer:SerializationStream.md#flush[flush]. <p>[[insertRecordIntoSorter-serializedRecordSize]] insertRecordIntoSorter requests the &lt;&gt; for the length of the buffer. <p>[[insertRecordIntoSorter-partitionId]] insertRecordIntoSorter requests the &lt;&gt; for the ../rdd/Partitioner.md#getPartition[partition] for the given record (by the key). <p>In the end, insertRecordIntoSorter requests the &lt;&gt; to ShuffleExternalSorter.md#insertRecord[insert] the &lt;&gt; as a byte array (with the &lt;&gt; and the &lt;&gt;). <p>insertRecordIntoSorter is used when UnsafeShuffleWriter is requested to &lt;&gt;.","text":""},{"location":"shuffle/UnsafeShuffleWriter/#closing-and-writing-output-merging-spill-files","title":"Closing and Writing Output (Merging Spill Files) <pre><code>void closeAndWriteOutput()\n</code></pre> <p><code>closeAndWriteOutput</code> asserts that the ShuffleExternalSorter is created (non-<code>null</code>).</p> <p><code>closeAndWriteOutput</code> updates peak memory used.</p> <p><code>closeAndWriteOutput</code> removes the references to the &lt;&gt; and &lt;&gt; output streams (<code>null</code>s them). <p><code>closeAndWriteOutput</code> requests the &lt;&gt; to ShuffleExternalSorter.md#closeAndGetSpills[close and return spill metadata]. <p><code>closeAndWriteOutput</code> removes the reference to the &lt;&gt; (<code>null</code>s it). <p><code>closeAndWriteOutput</code> requests the &lt;&gt; for the IndexShuffleBlockResolver.md#getDataFile[output data file] for the &lt;&gt; and &lt;&gt; IDs. <p>[[closeAndWriteOutput-partitionLengths]][[closeAndWriteOutput-tmp]] closeAndWriteOutput creates a temporary file (along the data output file) and uses it to &lt;&gt; (that gives a partition length array). All spill files are then deleted. <p>closeAndWriteOutput requests the &lt;&gt; to IndexShuffleBlockResolver.md#writeIndexFileAndCommit[write shuffle index and data files] (for the &lt;&gt; and &lt;&gt; IDs, the &lt;&gt; and the &lt;&gt;). <p>In the end, closeAndWriteOutput creates a scheduler:MapStatus.md[MapStatus] with the storage:BlockManager.md#shuffleServerId[location of the local BlockManager] and the &lt;&gt;. <p>closeAndWriteOutput prints out the following ERROR message to the logs if there is an issue with deleting spill files:</p> <pre><code>Error while deleting spill file [path]\n</code></pre> <p>closeAndWriteOutput prints out the following ERROR message to the logs if there is an issue with deleting the &lt;&gt;: <pre><code>Error while deleting temp file [path]\n</code></pre> <p><code>closeAndWriteOutput</code> is used when <code>UnsafeShuffleWriter</code> is requested to write records.</p> <p>== [[getPeakMemoryUsedBytes]] Getting Peak Memory Used</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_3","title":"[source, java]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#long-getpeakmemoryusedbytes","title":"long getPeakMemoryUsedBytes() <p>getPeakMemoryUsedBytes simply &lt;&gt; and returns the internal &lt;&gt; registry. <p>getPeakMemoryUsedBytes is used when UnsafeShuffleWriter is requested to &lt;&gt;. <p>== [[open]] Opening UnsafeShuffleWriter and Buffers</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_4","title":"[source, java]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#void-open","title":"void open() <p>open requires that there is no &lt;&gt; available. <p>open creates a ShuffleExternalSorter.md[ShuffleExternalSorter].</p> <p>open creates a &lt;&gt; with the capacity of &lt;&gt;. <p>open requests the &lt;&gt; for a serializer:SerializerInstance.md#serializeStream[SerializationStream] to the &lt;&gt; (available internally as the &lt;&gt; reference). <p>open is used when UnsafeShuffleWriter is &lt;&gt;. <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.sort.UnsafeShuffleWriter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#log4jloggerorgapachesparkshufflesortunsafeshufflewriterall","title":"log4j.logger.org.apache.spark.shuffle.sort.UnsafeShuffleWriter=ALL <p>Refer to spark-logging.md[Logging].</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#internal-properties","title":"Internal Properties","text":""},{"location":"shuffle/UnsafeShuffleWriter/#mapstatus","title":"MapStatus <p>MapStatus</p> <p>Created when UnsafeShuffleWriter is requested to &lt;&gt; (with the storage:BlockManagerId.md[] of the &lt;&gt; and <code>partitionLengths</code>) <p>Returned when UnsafeShuffleWriter is requested to &lt;&gt;","text":""},{"location":"shuffle/UnsafeShuffleWriter/#partitioner","title":"Partitioner <p>Partitioner (as used by the BaseShuffleHandle.md#dependency[ShuffleDependency] of the &lt;&gt;) <p>Used when UnsafeShuffleWriter is requested for the following:</p> <ul> <li> <p>&lt;&gt; (and create a ShuffleExternalSorter.md[ShuffleExternalSorter] with the given ../rdd/Partitioner.md#numPartitions[number of partitions])  <li> <p>&lt;&gt; (and request the ../rdd/Partitioner.md#getPartition[partition for the key])  <li> <p>&lt;&gt;, &lt;&gt; and &lt;&gt; (for the ../rdd/Partitioner.md#numPartitions[number of partitions] to create partition lengths)","text":""},{"location":"shuffle/UnsafeShuffleWriter/#peak-memory-used","title":"Peak Memory Used <p>Peak memory used (in bytes) that is updated exclusively in &lt;&gt; (after requesting the &lt;&gt; for ShuffleExternalSorter.md#getPeakMemoryUsedBytes[getPeakMemoryUsedBytes]) <p>Use &lt;&gt; to access the current value","text":""},{"location":"shuffle/UnsafeShuffleWriter/#bytearrayoutputstream-for-serialized-data","title":"ByteArrayOutputStream for Serialized Data <p>{java-javadoc-url}/java/io/ByteArrayOutputStream.html[java.io.ByteArrayOutputStream] of serialized data (written into a byte array of &lt;&gt; initial size) <p>Used when UnsafeShuffleWriter is requested for the following:</p> <ul> <li> <p>&lt;&gt; (and create the internal &lt;&gt;)  <li> <p>&lt;&gt;   <p>Destroyed (<code>null</code>) when requested to &lt;&gt;. <p>=== [[serializer]] serializer</p> <p>serializer:SerializerInstance.md[SerializerInstance] (that is a new instance of the Serializer of the BaseShuffleHandle.md#dependency[ShuffleDependency] of the &lt;&gt;) <p>Used exclusively when UnsafeShuffleWriter is requested to &lt;&gt; (and creates the &lt;&gt;) <p>=== [[serOutputStream]] serOutputStream</p> <p>serializer:SerializationStream.md[SerializationStream] (that is created when the &lt;&gt; is requested to serializer:SerializerInstance.md#serializeStream[serializeStream] with the &lt;&gt;) <p>Used when UnsafeShuffleWriter is requested to &lt;&gt; <p>Destroyed (<code>null</code>) when requested to &lt;&gt;.","text":""},{"location":"shuffle/UnsafeShuffleWriter/#shuffle-id","title":"Shuffle ID <p>Shuffle ID (of the ShuffleDependency of the SerializedShuffleHandle)</p> <p>Used exclusively when requested to &lt;&gt; <p>=== [[writeMetrics]] writeMetrics</p> <p>executor:ShuffleWriteMetrics.md[] (of the TaskMetrics of the &lt;&gt;) <p>Used when UnsafeShuffleWriter is requested for the following:</p> <ul> <li> <p>&lt;&gt; (and creates the &lt;&gt;)  <li> <p>&lt;&gt;  <li> <p>&lt;&gt;  <li> <p>&lt;&gt;","text":""},{"location":"stage-level-scheduling/","title":"Stage-Level Scheduling","text":"<p>Stage-Level Scheduling uses ResourceProfiles for the following:</p> <ul> <li>Spark developers can specify task and executor resource requirements at stage level</li> <li>Spark (Scheduler) uses the stage-level requirements to acquire the necessary resources and executors and schedule tasks based on the per-stage requirements</li> </ul> <p>Apache Spark 3.1.1</p> <p>Stage-Level Scheduling was introduced in Apache Spark 3.1.1 (cf. SPARK-27495)</p>"},{"location":"stage-level-scheduling/#resource-profiles","title":"Resource Profiles","text":"<p>Resource Profiles are managed by ResourceProfileManager.</p> <p>The Default ResourceProfile is known by ID <code>0</code>.</p> <p>Custom Resource Profiles are ResourceProfiles with non-<code>0</code> IDs. Custom Resource Profiles are only supported on YARN, Kubernetes and Spark Standalone.</p> <p><code>ResourceProfile</code>s are associated with an <code>RDD</code> using withResources operator.</p>"},{"location":"stage-level-scheduling/#resource-requests","title":"Resource Requests","text":""},{"location":"stage-level-scheduling/#executor","title":"Executor","text":"<p>Executor Resource Requests are specified using executorResources of a <code>ResourceProfile</code>.</p> <p>Executor Resource Requests can be the following built-in resources:</p> <ul> <li><code>cores</code></li> <li><code>memory</code></li> <li><code>memoryOverhead</code></li> <li><code>pyspark.memory</code></li> <li><code>offHeap</code></li> </ul> <p>Other (deployment environment-specific) executor resource requests can be defined as Custom Executor Resources.</p>"},{"location":"stage-level-scheduling/#task","title":"Task","text":"<p>Default Task Resources are specified based on spark.task.cpus and spark.task.resource-prefixed configuration properties.</p>"},{"location":"stage-level-scheduling/#sparklistenerresourceprofileadded","title":"SparkListenerResourceProfileAdded","text":"<p><code>ResourceProfile</code>s can be monitored using SparkListenerResourceProfileAdded.</p>"},{"location":"stage-level-scheduling/#dynamic-allocation","title":"Dynamic Allocation","text":"<p>Dynamic Allocation of Executors is not supported.</p>"},{"location":"stage-level-scheduling/#demo","title":"Demo","text":""},{"location":"stage-level-scheduling/#describe-distributed-computation","title":"Describe Distributed Computation","text":"<p>Let's describe a distributed computation (using RDD API) over a 10-record dataset.</p> <pre><code>val rdd = sc.range(0, 9)\n</code></pre>"},{"location":"stage-level-scheduling/#describe-required-resources","title":"Describe Required Resources","text":"<p>Optional Step</p> <p>This demo assumes to be executed in <code>local</code> deployment mode (that supports the default ResourceProfile only) and so the step is considered optional until a supported cluster manager is used.</p> <pre><code>import org.apache.spark.resource.ResourceProfileBuilder\nval rpb = new ResourceProfileBuilder\nval rp1 = rpb.build()\n</code></pre> <pre><code>scala&gt; println(rp1.toString)\nProfile: id = 1, executor resources: , task resources:\n</code></pre>"},{"location":"stage-level-scheduling/#configure-default-resourceprofile","title":"Configure Default ResourceProfile","text":"<p>FIXME</p> <p>Use <code>spark.task.resource</code>-prefixed properties per ResourceUtils.</p>"},{"location":"stage-level-scheduling/#associate-required-resources-to-distributed-computation","title":"Associate Required Resources to Distributed Computation","text":"<pre><code>rdd.withResources(rp1)\n</code></pre> <pre><code>scala&gt; rdd.withResources(rp1)\norg.apache.spark.SparkException: TaskResourceProfiles are only supported for Standalone cluster for now when dynamic allocation is disabled.\n  at org.apache.spark.resource.ResourceProfileManager.isSupported(ResourceProfileManager.scala:71)\n  at org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:126)\n  at org.apache.spark.rdd.RDD.withResources(RDD.scala:1802)\n  ... 42 elided\n</code></pre> SPARK-43912 <p>Reported as SPARK-43912 Incorrect SparkException for Stage-Level Scheduling in local mode.</p> <p>Until it is fixed, enable Dynamic Allocation.</p> <pre><code>$ ./bin/spark-shell -c spark.dynamicAllocation.enabled=true\n</code></pre>"},{"location":"stage-level-scheduling/ExecutorResourceInfo/","title":"ExecutorResourceInfo","text":"<p><code>ExecutorResourceInfo</code> is a ResourceAllocator.</p>"},{"location":"stage-level-scheduling/ExecutorResourceInfo/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorResourceInfo</code> takes the following to be created:</p> <ul> <li> Resource Name <li> Addresses <li> Number of slots (per address) <p><code>ExecutorResourceInfo</code> is created when:</p> <ul> <li><code>DriverEndpoint</code> is requested to handle a RegisterExecutor event</li> </ul>"},{"location":"stage-level-scheduling/ExecutorResourceRequest/","title":"ExecutorResourceRequest","text":""},{"location":"stage-level-scheduling/ExecutorResourceRequest/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorResourceRequest</code> takes the following to be created:</p> <ul> <li> Resource Name <li> Amount <li> Discovery Script <li> Vendor <p><code>ExecutorResourceRequest</code> is created when:</p> <ul> <li><code>ExecutorResourceRequests</code> is requested to memory, offHeapMemory, memoryOverhead, pysparkMemory, cores and resource</li> <li><code>JsonProtocol</code> utility is used to executorResourceRequestFromJson</li> </ul>"},{"location":"stage-level-scheduling/ExecutorResourceRequest/#serializable","title":"Serializable","text":"<p><code>ExecutorResourceRequest</code> is a <code>Serializable</code> (Java).</p>"},{"location":"stage-level-scheduling/ExecutorResourceRequests/","title":"ExecutorResourceRequests","text":"<p><code>ExecutorResourceRequests</code> is a set of ExecutorResourceRequests for Spark developers to (programmatically) specify resources for an RDD to be applied at stage level:</p> <ul> <li><code>cores</code></li> <li><code>memory</code></li> <li><code>memoryOverhead</code></li> <li><code>offHeap</code></li> <li><code>pyspark.memory</code></li> <li>custom resource</li> </ul>"},{"location":"stage-level-scheduling/ExecutorResourceRequests/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorResourceRequests</code> takes no arguments to be created.</p> <p><code>ExecutorResourceRequests</code> is created when:</p> <ul> <li><code>ResourceProfile</code> utility is used to get the default executor resource requests (for tasks)</li> </ul>"},{"location":"stage-level-scheduling/ExecutorResourceRequests/#serializable","title":"Serializable","text":"<p><code>ExecutorResourceRequests</code> is a <code>Serializable</code> (Java).</p>"},{"location":"stage-level-scheduling/ExecutorResourceRequests/#resource","title":"resource <pre><code>resource(\n  resourceName: String,\n  amount: Long,\n  discoveryScript: String = \"\",\n  vendor: String = \"\"): this.type\n</code></pre> <p><code>resource</code> creates a ExecutorResourceRequest and registers it under <code>resourceName</code>.</p> <p><code>resource</code> is used when:</p> <ul> <li><code>ResourceProfile</code> utility is used for the default executor resources</li> </ul>","text":""},{"location":"stage-level-scheduling/ExecutorResourceRequests/#text-representation","title":"Text Representation <p><code>ExecutorResourceRequests</code> presents itself as:</p> <pre><code>Executor resource requests: [_executorResources]\n</code></pre>","text":""},{"location":"stage-level-scheduling/ExecutorResourceRequests/#demo","title":"Demo <pre><code>import org.apache.spark.resource.ExecutorResourceRequests\nval executorResources = new ExecutorResourceRequests()\n  .memory(\"2g\")\n  .memoryOverhead(\"512m\")\n  .cores(8)\n  .resource(\n    resourceName = \"my-custom-resource\",\n    amount = 1,\n    discoveryScript = \"/this/is/path/to/discovery/script.sh\",\n    vendor = \"pl.japila\")\n</code></pre> <pre><code>scala&gt; println(executorResources)\nExecutor resource requests: {memoryOverhead=name: memoryOverhead, amount: 512, script: , vendor: , memory=name: memory, amount: 2048, script: , vendor: , cores=name: cores, amount: 8, script: , vendor: , my-custom-resource=name: my-custom-resource, amount: 1, script: /this/is/path/to/discovery/script.sh, vendor: pl.japila}\n</code></pre>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/","title":"ResourceAllocator","text":"<p><code>ResourceAllocator</code> is an abstraction of resource allocators.</p>"},{"location":"stage-level-scheduling/ResourceAllocator/#contract","title":"Contract","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#resourceaddresses","title":"resourceAddresses <pre><code>resourceAddresses: Seq[String]\n</code></pre> <p>Used when:</p> <ul> <li><code>ResourceAllocator</code> is requested for the addressAvailabilityMap</li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#resourcename","title":"resourceName <pre><code>resourceName: String\n</code></pre> <p>Used when:</p> <ul> <li><code>ResourceAllocator</code> is requested to acquire and release addresses</li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#slotsperaddress","title":"slotsPerAddress <pre><code>slotsPerAddress: Int\n</code></pre> <p>Used when:</p> <ul> <li><code>ResourceAllocator</code> is requested for the addressAvailabilityMap, assignedAddrs and to release</li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#implementations","title":"Implementations","text":"<ul> <li>ExecutorResourceInfo</li> <li><code>WorkerResourceInfo</code> (Spark Standalone)</li> </ul>"},{"location":"stage-level-scheduling/ResourceAllocator/#acquiring-addresses","title":"Acquiring Addresses <pre><code>acquire(\n  addrs: Seq[String]): Unit\n</code></pre> <p><code>acquire</code>...FIXME</p> <p><code>acquire</code> is used when:</p> <ul> <li><code>DriverEndpoint</code> is requested to launchTasks</li> <li><code>WorkerResourceInfo</code> (Spark Standalone) is requested to <code>acquire</code> and <code>recoverResources</code></li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#releasing-addresses","title":"Releasing Addresses <pre><code>release(\n  addrs: Seq[String]): Unit\n</code></pre> <p><code>release</code>...FIXME</p> <p><code>release</code> is used when:</p> <ul> <li><code>DriverEndpoint</code> is requested to handle a StatusUpdate event</li> <li><code>WorkerInfo</code> (Spark Standalone) is requested to <code>releaseResources</code></li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#assignedaddrs","title":"assignedAddrs <pre><code>assignedAddrs: Seq[String]\n</code></pre> <p><code>assignedAddrs</code>...FIXME</p> <p><code>assignedAddrs</code> is used when:</p> <ul> <li><code>WorkerInfo</code> (Spark Standalone) is requested for the <code>resourcesInfoUsed</code></li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#availableaddrs","title":"availableAddrs <pre><code>availableAddrs: Seq[String]\n</code></pre> <p><code>availableAddrs</code>...FIXME</p> <p><code>availableAddrs</code> is used when:</p> <ul> <li><code>WorkerInfo</code> (Spark Standalone) is requested for the <code>resourcesInfoFree</code></li> <li><code>WorkerResourceInfo</code> (Spark Standalone) is requested to <code>acquire</code> and <code>resourcesAmountFree</code></li> <li><code>DriverEndpoint</code> is requested to makeOffers</li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#addressavailabilitymap","title":"addressAvailabilityMap <pre><code>addressAvailabilityMap: Seq[String]\n</code></pre> <p><code>addressAvailabilityMap</code>...FIXME</p>  Lazy Value <p><code>addressAvailabilityMap</code> is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.</p> <p>Learn more in the Scala Language Specification.</p>  <p><code>addressAvailabilityMap</code> is used when:</p> <ul> <li><code>ResourceAllocator</code> is requested to availableAddrs, assignedAddrs, acquire, release</li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceID/","title":"ResourceID","text":"<p><code>ResourceID</code> is...FIXME</p>","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/ResourceProfile/","title":"ResourceProfile","text":"<p><code>ResourceProfile</code> is a resource profile that describes executor and task requirements of an RDD in Stage-Level Scheduling.</p> <p><code>ResourceProfile</code> can be associated with an <code>RDD</code> using RDD.withResources method.</p> <p>The <code>ResourceProfile</code> of an <code>RDD</code> is available using RDD.getResourceProfile method.</p>"},{"location":"stage-level-scheduling/ResourceProfile/#creating-instance","title":"Creating Instance","text":"<p><code>ResourceProfile</code> takes the following to be created:</p> <ul> <li> Executor Resources (<code>Map[String, ExecutorResourceRequest]</code>) <li> Task Resources (<code>Map[String, TaskResourceRequest]</code>) <p><code>ResourceProfile</code> is created (directly or using getOrCreateDefaultProfile)\u00a0when:</p> <ul> <li><code>DriverEndpoint</code> is requested to handle a RetrieveSparkAppConfig message</li> <li><code>ResourceProfileBuilder</code> utility is requested to build</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfile/#allSupportedExecutorResources","title":"Built-In Executor Resources","text":"<p><code>ResourceProfile</code> defines the following names as the Supported Executor Resources (among the specified executorResources):</p> <ul> <li><code>cores</code></li> <li><code>memory</code></li> <li><code>memoryOverhead</code></li> <li><code>pyspark.memory</code></li> <li><code>offHeap</code></li> </ul> <p>All other executor resources (names) are considered Custom Executor Resources.</p>"},{"location":"stage-level-scheduling/ResourceProfile/#getCustomExecutorResources","title":"Custom Executor Resources","text":"<pre><code>getCustomExecutorResources(): Map[String, ExecutorResourceRequest]\n</code></pre> <p><code>getCustomExecutorResources</code> is the Executor Resources that are not supported executor resources.</p> <p><code>getCustomExecutorResources</code> is used when:</p> <ul> <li><code>ApplicationDescription</code> is requested to <code>resourceReqsPerExecutor</code></li> <li><code>ApplicationInfo</code> is requested to <code>createResourceDescForResourceProfile</code></li> <li><code>ResourceProfile</code> is requested to calculateTasksAndLimitingResource</li> <li><code>ResourceUtils</code> is requested to getOrDiscoverAllResourcesForResourceProfile, warnOnWastedResources</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfile/#limitingResource","title":"Limiting Resource","text":"<pre><code>limitingResource(\n  sparkConf: SparkConf): String\n</code></pre> <p><code>limitingResource</code> takes the _limitingResource, if calculated earlier, or calculateTasksAndLimitingResource.</p> <p><code>limitingResource</code> is used when:</p> <ul> <li><code>ResourceProfileManager</code> is requested to add a new ResourceProfile (to recompute a limiting resource eagerly)</li> <li><code>ResourceUtils</code> is requested to warnOnWastedResources (for reporting purposes only)</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfile/#_limitingResource","title":"_limitingResource","text":"<pre><code>_limitingResource: Option[String] = None\n</code></pre> <p><code>ResourceProfile</code> defines <code>_limitingResource</code> variable that is determined (if there is one) while calculateTasksAndLimitingResource.</p> <p><code>_limitingResource</code> can be the following:</p> <ul> <li>A \"special\" empty resource identifier (that is assumed <code>cpus</code> in TaskSchedulerImpl)</li> <li><code>cpus</code> built-in task resource identifier</li> <li>any custom resource identifier</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfile/#defaultProfile","title":"Default Profile","text":"<p><code>ResourceProfile</code> (Scala object) defines <code>defaultProfile</code> internal registry for the default ResourceProfile (per JVM instance).</p> <p><code>defaultProfile</code> is undefined (<code>None</code>) and gets a new <code>ResourceProfile</code> when first requested.</p> <p><code>defaultProfile</code> can be accessed using getOrCreateDefaultProfile.</p> <p><code>defaultProfile</code> is cleared (removed) in clearDefaultProfile.</p>"},{"location":"stage-level-scheduling/ResourceProfile/#getOrCreateDefaultProfile","title":"getOrCreateDefaultProfile","text":"<pre><code>getOrCreateDefaultProfile(\n  conf: SparkConf): ResourceProfile\n</code></pre> <p><code>getOrCreateDefaultProfile</code> returns the default profile (if already defined) or creates a new one.</p> <p>Unless defined, <code>getOrCreateDefaultProfile</code> creates a ResourceProfile with the default task and executor resource descriptions and makes it the defaultProfile.</p> <p><code>getOrCreateDefaultProfile</code> prints out the following INFO message to the logs:</p> <pre><code>Default ResourceProfile created,\nexecutor resources: [executorResources], task resources: [taskResources]\n</code></pre> <p><code>getOrCreateDefaultProfile</code>\u00a0is used when:</p> <ul> <li><code>TaskResourceProfile</code> is requested to getCustomExecutorResources</li> <li><code>ResourceProfile</code> is requested to getDefaultProfileExecutorResources</li> <li><code>ResourceProfileManager</code> is created</li> <li><code>YarnAllocator</code> (Spark on YARN) is requested to <code>initDefaultProfile</code></li> </ul>"},{"location":"stage-level-scheduling/ResourceProfile/#getDefaultExecutorResources","title":"Default Executor Resource Requests","text":"<pre><code>getDefaultExecutorResources(\n  conf: SparkConf): Map[String, ExecutorResourceRequest]\n</code></pre> <p><code>getDefaultExecutorResources</code> creates an ExecutorResourceRequests with the following:</p> Property Configuration Property cores spark.executor.cores memory spark.executor.memory memoryOverhead spark.executor.memoryOverhead pysparkMemory spark.executor.pyspark.memory offHeapMemory spark.memory.offHeap.size <p><code>getDefaultExecutorResources</code> finds executor resource requests (with the <code>spark.executor</code> component name in the given SparkConf) for ExecutorResourceRequests.</p> <p><code>getDefaultExecutorResources</code> initializes the defaultProfileExecutorResources (with the executor resource requests).</p> <p>In the end, <code>getDefaultExecutorResources</code> requests the <code>ExecutorResourceRequests</code> for all the resource requests</p>"},{"location":"stage-level-scheduling/ResourceProfile/#getDefaultTaskResources","title":"Default Task Resource Requests","text":"<pre><code>getDefaultTaskResources(\n  conf: SparkConf): Map[String, TaskResourceRequest]\n</code></pre> <p><code>getDefaultTaskResources</code> creates a new TaskResourceRequests with the cpus based on spark.task.cpus configuration property.</p> <p><code>getDefaultTaskResources</code> adds task resource requests (configured in the given SparkConf using <code>spark.task.resource</code>-prefixed properties).</p> <p>In the end, <code>getDefaultTaskResources</code> requests the <code>TaskResourceRequests</code> for the requests.</p>"},{"location":"stage-level-scheduling/ResourceProfile/#getresourcesforclustermanager","title":"getResourcesForClusterManager <pre><code>getResourcesForClusterManager(\n  rpId: Int,\n  execResources: Map[String, ExecutorResourceRequest],\n  overheadFactor: Double,\n  conf: SparkConf,\n  isPythonApp: Boolean,\n  resourceMappings: Map[String, String]): ExecutorResourcesOrDefaults\n</code></pre> <p><code>getResourcesForClusterManager</code> takes the DefaultProfileExecutorResources.</p> <p><code>getResourcesForClusterManager</code> calculates the overhead memory with the following:</p> <ul> <li><code>memoryOverheadMiB</code> and <code>executorMemoryMiB</code> of the <code>DefaultProfileExecutorResources</code></li> <li>Given <code>overheadFactor</code></li> </ul> <p>If the given <code>rpId</code> resource profile ID is not the default ID (<code>0</code>), <code>getResourcesForClusterManager</code>...FIXME (there is so much to \"digest\")</p> <p><code>getResourcesForClusterManager</code>...FIXME</p> <p>In the end, <code>getResourcesForClusterManager</code> creates a <code>ExecutorResourcesOrDefaults</code>.</p>  <p><code>getResourcesForClusterManager</code> is used when:</p> <ul> <li><code>BasicExecutorFeatureStep</code> (Spark on Kubernetes) is created</li> <li><code>YarnAllocator</code> (Spark on YARN) is requested to <code>createYarnResourceForResourceProfile</code></li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceProfile/#getDefaultProfileExecutorResources","title":"getDefaultProfileExecutorResources <pre><code>getDefaultProfileExecutorResources(\n  conf: SparkConf): DefaultProfileExecutorResources\n</code></pre> <p><code>getDefaultProfileExecutorResources</code>...FIXME</p>  <p><code>getDefaultProfileExecutorResources</code> is used when:</p> <ul> <li><code>ResourceProfile</code> is requested to getResourcesForClusterManager</li> <li><code>YarnAllocator</code> (Spark on YARN) is requested to <code>runAllocatedContainers</code></li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceProfile/#serializable","title":"Serializable <p><code>ResourceProfile</code> is a Java Serializable.</p>","text":""},{"location":"stage-level-scheduling/ResourceProfile/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.resource.ResourceProfile</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.ResourceProfile.name = org.apache.spark.resource.ResourceProfile\nlogger.ResourceProfile.level = all\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"stage-level-scheduling/ResourceProfileBuilder/","title":"ResourceProfileBuilder","text":"<p><code>ResourceProfileBuilder</code> is a fluent API for Spark developers to build ResourceProfiles (to associate with an RDD).</p> Available in Scala and Python APIs <p><code>ResourceProfileBuilder</code> is available in Scala and Python APIs.</p>"},{"location":"stage-level-scheduling/ResourceProfileBuilder/#creating-instance","title":"Creating Instance","text":"<p><code>ResourceProfileBuilder</code> takes no arguments to be created.</p>"},{"location":"stage-level-scheduling/ResourceProfileBuilder/#build","title":"Building ResourceProfile","text":"<pre><code>build: ResourceProfile\n</code></pre> <p><code>build</code> creates a ResourceProfile:</p> <ul> <li>TaskResourceProfile when _executorResources are undefined</li> <li>ResourceProfile with the executorResources and the taskResources</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileBuilder/#executorResources","title":"Executor Resources","text":"<pre><code>executorResources: Map[String, ExecutorResourceRequest]\n</code></pre> <p><code>executorResources</code>...FIXME</p>"},{"location":"stage-level-scheduling/ResourceProfileBuilder/#taskResources","title":"Task Resources <pre><code>taskResources: Map[String, TaskResourceRequest]\n</code></pre> <p><code>taskResources</code> is TaskResourceRequests specified by users (by their resource names)</p> <p><code>taskResources</code> are specified using require method.</p> <p><code>taskResources</code> can be removed using clearTaskResourceRequests method.</p> <p><code>taskResources</code> can be printed out using toString method.</p> <p><code>taskResources</code> is used when:</p> <ul> <li><code>ResourceProfileBuilder</code> is requested to build a ResourceProfile</li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceProfileBuilder/#demo","title":"Demo","text":"<pre><code>import org.apache.spark.resource.ResourceProfileBuilder\nval rp1 = new ResourceProfileBuilder()\n\nimport org.apache.spark.resource.ExecutorResourceRequests\nval execReqs = new ExecutorResourceRequests().cores(4).resource(\"gpu\", 4)\n\nimport org.apache.spark.resource.ExecutorResourceRequests\nval taskReqs = new TaskResourceRequests().cpus(1).resource(\"gpu\", 1)\n\nrp1.require(execReqs).require(taskReqs)\nval rprof1 = rp1.build\n</code></pre> <pre><code>val rpManager = sc.resourceProfileManager // (1)!\nrpManager.addResourceProfile(rprof1)\n</code></pre> <ol> <li>NOTE: <code>resourceProfileManager</code> is <code>private[spark]</code></li> </ol>"},{"location":"stage-level-scheduling/ResourceProfileManager/","title":"ResourceProfileManager","text":"<p><code>ResourceProfileManager</code> manages ResourceProfiles.</p>"},{"location":"stage-level-scheduling/ResourceProfileManager/#creating-instance","title":"Creating Instance","text":"<p><code>ResourceProfileManager</code> takes the following to be created:</p> <ul> <li> SparkConf <li> LiveListenerBus <p><code>ResourceProfileManager</code> is created when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileManager/#accessing-resourceprofilemanager","title":"Accessing ResourceProfileManager","text":"<p><code>ResourceProfileManager</code> is available to other Spark services using SparkContext.</p>"},{"location":"stage-level-scheduling/ResourceProfileManager/#resourceProfileIdToResourceProfile","title":"Registered ResourceProfiles","text":"<pre><code>resourceProfileIdToResourceProfile: HashMap[Int, ResourceProfile]\n</code></pre> <p><code>ResourceProfileManager</code> creates <code>resourceProfileIdToResourceProfile</code> registry of ResourceProfiles by their ID.</p> <p>A new <code>ResourceProfile</code> is added when addResourceProfile.</p> <p><code>ResourceProfile</code>s are resolved (looked up) using resourceProfileFromId.</p> <p><code>ResourceProfile</code>s can be equivalent when they specify the same resources.</p> <p><code>resourceProfileIdToResourceProfile</code> is used when:</p> <ul> <li>canBeScheduled</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileManager/#defaultProfile","title":"Default ResourceProfile","text":"<p><code>ResourceProfileManager</code> gets or creates the default ResourceProfile when created and registers it immediately.</p> <p>The default profile is available as defaultResourceProfile.</p>"},{"location":"stage-level-scheduling/ResourceProfileManager/#defaultResourceProfile","title":"Accessing Default ResourceProfile","text":"<pre><code>defaultResourceProfile: ResourceProfile\n</code></pre> <p><code>defaultResourceProfile</code> returns the default ResourceProfile.</p> <p><code>defaultResourceProfile</code> is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is created</li> <li><code>SparkContext</code> is requested to requestTotalExecutors and createTaskScheduler</li> <li><code>DAGScheduler</code> is requested to mergeResourceProfilesForStage</li> <li><code>CoarseGrainedSchedulerBackend</code> is requested to requestExecutors</li> <li><code>StandaloneSchedulerBackend</code> (Spark Standalone) is created</li> <li><code>KubernetesClusterSchedulerBackend</code> (Spark on Kubernetes) is created</li> <li><code>MesosCoarseGrainedSchedulerBackend</code> (Spark on Mesos) is created</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileManager/#addResourceProfile","title":"Registering ResourceProfile","text":"<pre><code>addResourceProfile(\n  rp: ResourceProfile): Unit\n</code></pre> <p><code>addResourceProfile</code> checks if the given ResourceProfile is supported.</p> <p><code>addResourceProfile</code> registers the given ResourceProfile (in the resourceProfileIdToResourceProfile registry) unless done earlier (by ResourceProfile ID).</p> <p>With a new <code>ResourceProfile</code>, <code>addResourceProfile</code> requests the given ResourceProfile for the limiting resource (for no reason but to calculate it upfront) and prints out the following INFO message to the logs:</p> <pre><code>Added ResourceProfile id: [id]\n</code></pre> <p>In the end (for a new <code>ResourceProfile</code>), <code>addResourceProfile</code> requests the LiveListenerBus to post a SparkListenerResourceProfileAdded.</p> <p><code>addResourceProfile</code> is used when:</p> <ul> <li>RDD.withResources operator is used</li> <li><code>ResourceProfileManager</code> is created (and registers the default profile)</li> <li><code>DAGScheduler</code> is requested to mergeResourceProfilesForStage</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileManager/#dynamicEnabled","title":"Dynamic Allocation","text":"<p><code>ResourceProfileManager</code> initializes <code>dynamicEnabled</code> flag to be isDynamicAllocationEnabled when created.</p> <p><code>dynamicEnabled</code> flag is used when:</p> <ul> <li>isSupported</li> <li>canBeScheduled</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileManager/#isSupported","title":"isSupported","text":"<pre><code>isSupported(\n  rp: ResourceProfile): Boolean\n</code></pre> <p><code>isSupported</code>...FIXME</p>"},{"location":"stage-level-scheduling/ResourceProfileManager/#canBeScheduled","title":"canBeScheduled","text":"<pre><code>canBeScheduled(\n  taskRpId: Int,\n  executorRpId: Int): Boolean\n</code></pre> <p><code>canBeScheduled</code> asserts that the given <code>taskRpId</code> and <code>executorRpId</code> are valid ResourceProfile IDs or throws an <code>AssertionError</code>:</p> <pre><code>Tasks and executors must have valid resource profile id\n</code></pre> <p><code>canBeScheduled</code> finds the ResourceProfile.</p> <p><code>canBeScheduled</code> holds positive (<code>true</code>) when either holds:</p> <ol> <li>The given <code>taskRpId</code> and <code>executorRpId</code> are the same</li> <li>Dynamic Allocation is disabled and the <code>ResourceProfile</code> is a TaskResourceProfile</li> </ol> <p><code>canBeScheduled</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to resourceOfferSingleTaskSet and calculateAvailableSlots</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileManager/#logging","title":"Logging","text":"<p>Enable <code>ALL</code> logging level for <code>org.apache.spark.resource.ResourceProfileManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.ResourceProfileManager.name = org.apache.spark.resource.ResourceProfileManager\nlogger.ResourceProfileManager.level = all\n</code></pre> <p>Refer to Logging.</p>"},{"location":"stage-level-scheduling/ResourceUtils/","title":"ResourceUtils","text":""},{"location":"stage-level-scheduling/ResourceUtils/#addTaskResourceRequests","title":"Registering Task Resource Requests (from SparkConf)","text":"<pre><code>addTaskResourceRequests(\n  sparkConf: SparkConf,\n  treqs: TaskResourceRequests): Unit\n</code></pre> <p><code>addTaskResourceRequests</code> registers all task resource requests in the given SparkConf with the given TaskResourceRequests.</p> <p><code>addTaskResourceRequests</code> listResourceIds with <code>spark.task</code> component name in the given SparkConf.</p> <p>For every ResourceID discovered, <code>addTaskResourceRequests</code> does the following:</p> <ol> <li>Finds all the settings with the confPrefix</li> <li>Looks up <code>amount</code> setting (or throws a <code>SparkException</code>)</li> <li>Registers the resourceName with the <code>amount</code> in the given TaskResourceRequests</li> </ol> <p><code>addTaskResourceRequests</code> is used when:</p> <ul> <li><code>ResourceProfile</code> is requested for the default task resource requests</li> </ul>"},{"location":"stage-level-scheduling/ResourceUtils/#listResourceIds","title":"Listing All Configured Resources","text":"<pre><code>listResourceIds(\n  sparkConf: SparkConf,\n  componentName: String): Seq[ResourceID]\n</code></pre> <p><code>listResourceIds</code> requests the given SparkConf to find all Spark settings with the keys with the prefix of the following pattern:</p> <pre><code>[componentName].resource.\n</code></pre> Internals <p><code>listResourceIds</code> gets resource-related settings (from SparkConf) with the prefix removed (e.g., <code>spark.my_component.resource.gpu.amount</code> becomes just <code>gpu.amount</code>).</p> Example<pre><code>// Use the following to start spark-shell\n// ./bin/spark-shell -c spark.my_component.resource.gpu.amount=5\n\nval sparkConf = sc.getConf\n\n// Component names must start with `spark.` prefix\n// Spark assumes valid Spark settings start with `spark.` prefix\nval componentName = \"spark.my_component\"\n\n// this is copied verbatim from ResourceUtils.listResourceIds\n// Note that `resource` is hardcoded\nsparkConf.getAllWithPrefix(s\"$componentName.resource.\").foreach(println)\n\n// (gpu.amount,5)\n</code></pre> <p><code>listResourceIds</code> asserts that resource settings include a <code>.</code> (dot) to separate their resource names from configs or throws the following <code>SparkException</code>:</p> <pre><code>You must specify an amount config for resource: [key] config: [componentName].resource.[key]\n</code></pre> SPARK-43947 <p>Although the exception says <code>You must specify an amount config for resource</code>, only the dot is checked.</p> <pre><code>// Use the following to start spark-shell\n// 1. No amount config specified\n// 2. spark.driver is a Spark built-in resource\n// ./bin/spark-shell -c spark.driver.resource.gpu=5\n</code></pre> <p>Reported as SPARK-43947.</p> <p>In the end, <code>listResourceIds</code> creates a ResourceID for every resource (with the given<code>componentName</code> and resource names discovered).</p> <p><code>listResourceIds</code> is used when:</p> <ul> <li><code>ResourceUtils</code> is requested to parseAllResourceRequests, addTaskResourceRequests, parseResourceRequirements, parseAllocatedOrDiscoverResources</li> </ul>"},{"location":"stage-level-scheduling/ResourceUtils/#parseAllResourceRequests","title":"parseAllResourceRequests","text":"<pre><code>parseAllResourceRequests(\n  sparkConf: SparkConf,\n  componentName: String): Seq[ResourceRequest]\n</code></pre> <p><code>parseAllResourceRequests</code>...FIXME</p> When componentName ResourceProfile <code>spark.executor</code> ResourceUtils <code>KubernetesUtils</code> (Spark on Kubernetes) <p><code>parseAllResourceRequests</code> is used when:</p> <ul> <li><code>ResourceProfile</code> is requested for the default executor resource requests</li> <li><code>ResourceUtils</code> is requested to getOrDiscoverAllResources</li> <li><code>KubernetesUtils</code> (Spark on Kubernetes) is requested to <code>buildResourcesQuantities</code></li> </ul>"},{"location":"stage-level-scheduling/ResourceUtils/#getOrDiscoverAllResources","title":"getOrDiscoverAllResources","text":"<pre><code>getOrDiscoverAllResources(\n  sparkConf: SparkConf,\n  componentName: String,\n  resourcesFileOpt: Option[String]): Map[String, ResourceInformation]\n</code></pre> <p><code>getOrDiscoverAllResources</code>...FIXME</p> When componentName resourcesFileOpt <code>SparkContext</code> <code>spark.driver</code> spark.driver.resourcesFile <code>Worker</code> (Spark Standalone) <code>spark.worker</code> spark.worker.resourcesFile <p><code>getOrDiscoverAllResources</code> is used when:</p> <ul> <li><code>SparkContext</code> is created (and initializes _resources)</li> <li><code>Worker</code> (Spark Standalone) is requested to <code>setupWorkerResources</code></li> </ul>"},{"location":"stage-level-scheduling/ResourceUtils/#parseAllocatedOrDiscoverResources","title":"parseAllocatedOrDiscoverResources","text":"<pre><code>parseAllocatedOrDiscoverResources(\n  sparkConf: SparkConf,\n  componentName: String,\n  resourcesFileOpt: Option[String]): Seq[ResourceAllocation]\n</code></pre> <p><code>parseAllocatedOrDiscoverResources</code>...FIXME</p>"},{"location":"stage-level-scheduling/ResourceUtils/#parseResourceRequirements","title":"parseResourceRequirements (Spark Standalone)","text":"<pre><code>parseResourceRequirements(\n  sparkConf: SparkConf,\n  componentName: String): Seq[ResourceRequirement]\n</code></pre> <p><code>parseResourceRequirements</code>...FIXME</p> <p>componentName</p> <p><code>componentName</code> seems to be always <code>spark.driver</code> for the use cases that seems to be Spark Standalone only.</p> <p><code>parseResourceRequirements</code> is used when:</p> <ul> <li><code>ClientEndpoint</code> (Spark Standalone) is requested to <code>onStart</code></li> <li><code>StandaloneSubmitRequestServlet</code> (Spark Standalone) is requested to <code>buildDriverDescription</code></li> </ul>"},{"location":"stage-level-scheduling/SparkListenerResourceProfileAdded/","title":"SparkListenerResourceProfileAdded","text":"<p><code>SparkListenerResourceProfileAdded</code> is a SparkListenerEvent.</p> <p><code>SparkListenerResourceProfileAdded</code> can be intercepted using the following Spark listeners:</p> <ul> <li><code>SparkFirehoseListener</code></li> <li>SparkListenerInterface</li> <li>SparkListener</li> </ul> <p><code>SparkListenerResourceProfileAdded</code> is recorded using AppStatusListener for status reporting and monitoring.</p>","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/SparkListenerResourceProfileAdded/#creating-instance","title":"Creating Instance","text":"<p><code>SparkListenerResourceProfileAdded</code> takes the following to be created:</p> <ul> <li> ResourceProfile <p><code>SparkListenerResourceProfileAdded</code> is created when:</p> <ul> <li><code>ResourceProfileManager</code> is requested to register a new ResourceProfile</li> <li><code>JsonProtocol</code> (Spark History Server) is requested to resourceProfileAddedFromJson</li> </ul>","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/SparkListenerResourceProfileAdded/#spark-history-server","title":"Spark History Server","text":"<p><code>SparkListenerResourceProfileAdded</code> is logged in Spark History Server using EventLoggingListener.</p> <p><code>SparkListenerResourceProfileAdded</code> is converted from and to JSON format using JsonProtocol (resourceProfileAddedFromJson and resourceProfileAddedToJson, respectively).</p>","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/TaskResourceProfile/","title":"TaskResourceProfile","text":"<p><code>TaskResourceProfile</code> is a ResourceProfile.</p>"},{"location":"stage-level-scheduling/TaskResourceProfile/#creating-instance","title":"Creating Instance","text":"<p><code>TaskResourceProfile</code> takes the following to be created:</p> <ul> <li> Task Resources <p><code>TaskResourceProfile</code> is created when:</p> <ul> <li><code>ResourceProfileBuilder</code> is requested to build a ResourceProfile</li> <li><code>DAGScheduler</code> is requested to merge ResourceProfiles</li> </ul>"},{"location":"stage-level-scheduling/TaskResourceProfile/#getCustomExecutorResources","title":"getCustomExecutorResources","text":"ResourceProfile <pre><code>getCustomExecutorResources(): Map[String, ExecutorResourceRequest]\n</code></pre> <p><code>getCustomExecutorResources</code> is part of the ResourceProfile abstraction.</p> <p><code>getCustomExecutorResources</code>...FIXME</p>"},{"location":"stage-level-scheduling/TaskResourceRequest/","title":"TaskResourceRequest","text":"<p><code>TaskResourceRequest</code> is...FIXME</p>"},{"location":"stage-level-scheduling/TaskResourceRequests/","title":"TaskResourceRequests","text":"<p><code>TaskResourceRequests</code> is a convenience API to work with TaskResourceRequests (and hence the name \ud83d\ude09).</p> <p><code>TaskResourceRequests</code> can be defined as required using ResourceProfileBuilder.</p> <p><code>TaskResourceRequests</code> can be specified using configuration properties (using <code>spark.task</code> prefix).</p> Resource Name Registerer <code>cpus</code> cpus user-defined name resource, addRequest"},{"location":"stage-level-scheduling/TaskResourceRequests/#creating-instance","title":"Creating Instance","text":"<p><code>TaskResourceRequests</code> takes no arguments to be created.</p> <p><code>TaskResourceRequests</code> is created when:</p> <ul> <li><code>ResourceProfile</code> is requested for the default task resource requests</li> </ul>"},{"location":"stage-level-scheduling/TaskResourceRequests/#serializable","title":"Serializable","text":"<p><code>TaskResourceRequests</code> is <code>Serializable</code> (Java).</p>"},{"location":"stage-level-scheduling/TaskResourceRequests/#cpus","title":"cpus","text":"<pre><code>cpus(\n  amount: Int): this.type\n</code></pre> <p><code>cpus</code> registers a TaskResourceRequest with <code>cpus</code> resource name and the given <code>amount</code> (in the _taskResources registry) under the name <code>cpus</code>.</p> <p>Fluent API</p> <p><code>cpus</code> is part of the fluent API of (and hence this strange-looking <code>this.type</code> return type).</p> <p><code>cpus</code> is used when:</p> <ul> <li><code>ResourceProfile</code> is requested for the default task resource requests</li> </ul>"},{"location":"stage-level-scheduling/TaskResourceRequests/#_taskResources","title":"_taskResources","text":"<pre><code>_taskResources: ConcurrentHashMap[String, TaskResourceRequest]\n</code></pre> <p><code>_taskResources</code> is a collection of TaskResourceRequests by their resource name.</p> <p><code>_taskResources</code> is available as requests.</p>"},{"location":"stage-level-scheduling/TaskResourceRequests/#requests","title":"requests","text":"<pre><code>requests: Map[String, TaskResourceRequest]\n</code></pre> <p><code>requests</code> returns the _taskResources (converted to Scala).</p> <p><code>requests</code> is used when:</p> <ul> <li><code>ResourceProfile</code> is requested for the default task resource requests</li> <li><code>ResourceProfileBuilder</code> is requested to require</li> <li><code>TaskResourceRequests</code> is requested for the string representation</li> </ul>"},{"location":"status/","title":"Status","text":"<p>Status system uses AppStatusListener to write the state of a Spark application to AppStatusStore for reporting and monitoring:</p> <ul> <li>web UI</li> <li>REST API</li> <li>Spark History Server</li> <li>Metrics</li> </ul>"},{"location":"status/AppStatusListener/","title":"AppStatusListener","text":"<p><code>AppStatusListener</code> is a SparkListener that writes application state information to a data store.</p>"},{"location":"status/AppStatusListener/#event-handlers","title":"Event Handlers","text":"Event Handler LiveEntities onJobStart <ul><li><code>LiveJob</code><li><code>LiveStage</code><li><code>RDDOperationGraph</code> onStageSubmitted"},{"location":"status/AppStatusListener/#creating-instance","title":"Creating Instance","text":"<p><code>AppStatusListener</code> takes the following to be created:</p> <ul> <li>ElementTrackingStore</li> <li> SparkConf <li>live flag</li> <li> AppStatusSource (default: <code>None</code>) <li> Last Update Time (default: <code>None</code>) <p><code>AppStatusListener</code> is created when:</p> <ul> <li><code>AppStatusStore</code> is requested for a in-memory store for a running Spark application (with the live flag enabled)</li> <li><code>FsHistoryProvider</code> is requested to rebuildAppStore (with the live flag disabled)</li> </ul>"},{"location":"status/AppStatusListener/#elementtrackingstore","title":"ElementTrackingStore <p><code>AppStatusListener</code> is given an ElementTrackingStore when created.</p> <p><code>AppStatusListener</code> registers triggers to clean up state in the store:</p> <ul> <li>cleanupExecutors</li> <li>cleanupJobs</li> <li>cleanupStages</li> </ul> <p><code>ElementTrackingStore</code> is used to write and...FIXME</p>","text":""},{"location":"status/AppStatusListener/#live-flag","title":"live Flag <p><code>AppStatusListener</code> is given a <code>live</code> flag when created.</p> <p><code>live</code> flag indicates whether <code>AppStatusListener</code> is created for the following:</p> <ul> <li><code>true</code> when created for a active (live) Spark application (for AppStatusStore)</li> <li><code>false</code> when created for Spark History Server (for FsHistoryProvider)</li> </ul>","text":""},{"location":"status/AppStatusListener/#updating-elementtrackingstore-for-active-spark-application","title":"Updating ElementTrackingStore for Active Spark Application <pre><code>liveUpdate(\n  entity: LiveEntity,\n  now: Long): Unit\n</code></pre> <p><code>liveUpdate</code> update the ElementTrackingStore when the live flag is enabled.</p>","text":""},{"location":"status/AppStatusListener/#updating-elementtrackingstore","title":"Updating ElementTrackingStore <pre><code>update(\n  entity: LiveEntity,\n  now: Long,\n  last: Boolean = false): Unit\n</code></pre> <p><code>update</code> requests the given LiveEntity to write (with the ElementTrackingStore and <code>checkTriggers</code> flag being the given <code>last</code> flag).</p>","text":""},{"location":"status/AppStatusListener/#getorcreateexecutor","title":"getOrCreateExecutor <pre><code>getOrCreateExecutor(\n  executorId: String,\n  addTime: Long): LiveExecutor\n</code></pre> <p><code>getOrCreateExecutor</code>...FIXME</p> <p><code>getOrCreateExecutor</code>\u00a0is used when:</p> <ul> <li><code>AppStatusListener</code> is requested to onExecutorAdded and onBlockManagerAdded</li> </ul>","text":""},{"location":"status/AppStatusListener/#getorcreatestage","title":"getOrCreateStage <pre><code>getOrCreateStage(\n  info: StageInfo): LiveStage\n</code></pre> <p><code>getOrCreateStage</code>...FIXME</p> <p><code>getOrCreateStage</code>\u00a0is used when:</p> <ul> <li><code>AppStatusListener</code> is requested to onJobStart and onStageSubmitted</li> </ul>","text":""},{"location":"status/AppStatusSource/","title":"AppStatusSource","text":"<p><code>AppStatusSource</code> is...FIXME</p>"},{"location":"status/AppStatusStore/","title":"AppStatusStore","text":"<p><code>AppStatusStore</code> stores the state of a Spark application in a data store (listening to state changes using AppStatusListener).</p>"},{"location":"status/AppStatusStore/#creating-instance","title":"Creating Instance","text":"<p><code>AppStatusStore</code> takes the following to be created:</p> <ul> <li> KVStore <li> AppStatusListener <p><code>AppStatusStore</code> is created\u00a0using createLiveStore utility.</p> <p></p>"},{"location":"status/AppStatusStore/#creating-in-memory-store-for-live-spark-application","title":"Creating In-Memory Store for Live Spark Application <pre><code>createLiveStore(\n  conf: SparkConf,\n  appStatusSource: Option[AppStatusSource] = None): AppStatusStore\n</code></pre> <p><code>createLiveStore</code> creates an ElementTrackingStore (with InMemoryStore and the SparkConf).</p> <p><code>createLiveStore</code> creates an AppStatusListener (with the <code>ElementTrackingStore</code>, live flag on and the <code>AppStatusSource</code>).</p> <p>In the end, creates an AppStatusStore (with the <code>ElementTrackingStore</code> and <code>AppStatusListener</code>).</p> <p><code>createLiveStore</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"status/AppStatusStore/#accessing-appstatusstore","title":"Accessing AppStatusStore <p><code>AppStatusStore</code> is available using SparkContext.</p>","text":""},{"location":"status/AppStatusStore/#sparkstatustracker","title":"SparkStatusTracker <p><code>AppStatusStore</code> is used to create SparkStatusTracker.</p>","text":""},{"location":"status/AppStatusStore/#sparkui","title":"SparkUI <p><code>AppStatusStore</code> is used to create SparkUI.</p>","text":""},{"location":"status/AppStatusStore/#rdds","title":"RDDs <pre><code>rddList(\n  cachedOnly: Boolean = true): Seq[v1.RDDStorageInfo]\n</code></pre> <p><code>rddList</code> requests the KVStore for (a view over) <code>RDDStorageInfo</code>s (cached or not based on the given <code>cachedOnly</code> flag).</p> <p><code>rddList</code>\u00a0is used when:</p> <ul> <li><code>AbstractApplicationResource</code> is requested for the RDDs</li> <li><code>StageTableBase</code> is created (and renders a stage table for AllStagesPage, JobPage and PoolPage)</li> <li><code>StoragePage</code> is requested to render</li> </ul>","text":""},{"location":"status/AppStatusStore/#streaming-blocks","title":"Streaming Blocks <pre><code>streamBlocksList(): Seq[StreamBlockData]\n</code></pre> <p><code>streamBlocksList</code> requests the KVStore for (a view over) <code>StreamBlockData</code>s.</p> <p><code>streamBlocksList</code>\u00a0is used when:</p> <ul> <li><code>StoragePage</code> is requested to render</li> </ul>","text":""},{"location":"status/AppStatusStore/#stages","title":"Stages <pre><code>stageList(\n  statuses: JList[v1.StageStatus]): Seq[v1.StageData]\n</code></pre> <p><code>stageList</code> requests the KVStore for (a view over) <code>StageData</code>s.</p> <p><code>stageList</code>\u00a0is used when:</p> <ul> <li><code>SparkStatusTracker</code> is requested for active stage IDs</li> <li><code>StagesResource</code> is requested for stages</li> <li><code>AllStagesPage</code> is requested to render</li> </ul>","text":""},{"location":"status/AppStatusStore/#jobs","title":"Jobs <pre><code>jobsList(\n  statuses: JList[JobExecutionStatus]): Seq[v1.JobData]\n</code></pre> <p><code>jobsList</code> requests the KVStore for (a view over) <code>JobData</code>s.</p> <p><code>jobsList</code>\u00a0is used when:</p> <ul> <li><code>SparkStatusTracker</code> is requested for getJobIdsForGroup and getActiveJobIds</li> <li><code>AbstractApplicationResource</code> is requested for jobs</li> <li><code>AllJobsPage</code> is requested to render</li> </ul>","text":""},{"location":"status/AppStatusStore/#executors","title":"Executors <pre><code>executorList(\n  activeOnly: Boolean): Seq[v1.ExecutorSummary]\n</code></pre> <p><code>executorList</code> requests the KVStore for (a view over) <code>ExecutorSummary</code>s.</p> <p><code>executorList</code>\u00a0is used when:</p> <ul> <li>FIXME</li> </ul>","text":""},{"location":"status/AppStatusStore/#application-summary","title":"Application Summary <pre><code>appSummary(): AppSummary\n</code></pre> <p><code>appSummary</code> requests the KVStore to read the <code>AppSummary</code>.</p> <p><code>appSummary</code>\u00a0is used when:</p> <ul> <li><code>AllJobsPage</code> is requested to render</li> <li><code>AllStagesPage</code> is requested to render</li> </ul>","text":""},{"location":"status/ElementTrackingStore/","title":"ElementTrackingStore","text":"<p><code>ElementTrackingStore</code> is a KVStore that tracks the number of entities (elements) of specific types in a store and triggers actions once they reach a threshold.</p>"},{"location":"status/ElementTrackingStore/#creating-instance","title":"Creating Instance","text":"<p><code>ElementTrackingStore</code> takes the following to be created:</p> <ul> <li> KVStore <li> SparkConf <p><code>ElementTrackingStore</code> is created\u00a0when:</p> <ul> <li><code>AppStatusStore</code> is requested to createLiveStore</li> <li><code>FsHistoryProvider</code> is requested to rebuildAppStore</li> </ul>"},{"location":"status/ElementTrackingStore/#writing-value-to-store","title":"Writing Value to Store <pre><code>write(\n  value: Any): Unit\n</code></pre> <p><code>write</code>\u00a0is part of the KVStore abstraction.</p> <p><code>write</code> requests the KVStore to write the value</p>","text":""},{"location":"status/ElementTrackingStore/#writing-value-to-store-and-checking-triggers","title":"Writing Value to Store and Checking Triggers <pre><code>write(\n  value: Any,\n  checkTriggers: Boolean): WriteQueueResult\n</code></pre> <p><code>write</code> writes the value.</p> <p><code>write</code>...FIXME</p> <p><code>write</code> is used when:</p> <ul> <li><code>LiveEntity</code> is requested to write</li> <li><code>StreamingQueryStatusListener</code> (Spark Structured Streaming) is requested to <code>onQueryStarted</code> and <code>onQueryTerminated</code></li> </ul>","text":""},{"location":"status/ElementTrackingStore/#creating-view-of-specific-entities","title":"Creating View of Specific Entities <pre><code>view[T](\n  klass: Class[T]): KVStoreView[T]\n</code></pre> <p><code>view</code>\u00a0is part of the KVStore abstraction.</p> <p><code>view</code> requests the KVStore for a view of <code>klass</code> entities.</p>","text":""},{"location":"status/ElementTrackingStore/#registering-trigger","title":"Registering Trigger <pre><code>addTrigger(\n  klass: Class[_],\n  threshold: Long)(\n  action: Long =&gt; Unit): Unit\n</code></pre> <p><code>addTrigger</code>...FIXME</p> <p><code>addTrigger</code> is used when:</p> <ul> <li><code>AppStatusListener</code> is created</li> <li><code>HiveThriftServer2Listener</code> (Spark Thrift Server) is created</li> <li><code>SQLAppStatusListener</code> (Spark SQL) is created</li> <li><code>StreamingQueryStatusListener</code> (Spark Structured Streaming) is created</li> </ul>","text":""},{"location":"status/LiveEntity/","title":"LiveEntity","text":"<p><code>LiveEntity</code> is an abstraction of entities of a running (live) Spark application.</p>"},{"location":"status/LiveEntity/#contract","title":"Contract","text":""},{"location":"status/LiveEntity/#doupdate","title":"doUpdate <pre><code>doUpdate(): Any\n</code></pre> <p>Updated view of this entity's data</p> <p>Used when:</p> <ul> <li><code>LiveEntity</code> is requested to write out to the store</li> </ul>","text":""},{"location":"status/LiveEntity/#implementations","title":"Implementations","text":"<ul> <li>LiveExecutionData (Spark SQL)</li> <li>LiveExecutionData (Spark Thrift Server)</li> <li>LiveExecutor</li> <li>LiveExecutorStageSummary</li> <li>LiveJob</li> <li>LiveRDD</li> <li>LiveResourceProfile</li> <li>LiveSessionData</li> <li>LiveStage</li> <li>LiveTask</li> <li>SchedulerPool</li> </ul>"},{"location":"status/LiveEntity/#writing-out-to-store","title":"Writing Out to Store <pre><code>write(\n  store: ElementTrackingStore,\n  now: Long,\n  checkTriggers: Boolean = false): Unit\n</code></pre> <p><code>write</code>...FIXME</p> <p><code>write</code>\u00a0is used when:</p> <ul> <li><code>AppStatusListener</code> is requested to update</li> <li><code>HiveThriftServer2Listener</code> (Spark Thrift Server) is requested to <code>updateStoreWithTriggerEnabled</code> and <code>updateLiveStore</code></li> <li><code>SQLAppStatusListener</code> (Spark SQL) is requested to <code>update</code></li> </ul>","text":""},{"location":"storage/","title":"Storage System","text":"<p>Storage System is a core component of Apache Spark that uses BlockManager to manage blocks in memory and on disk (based on StorageLevel).</p>"},{"location":"storage/BlockData/","title":"BlockData","text":"<p>= BlockData</p> <p>BlockData is...FIXME</p>"},{"location":"storage/BlockDataManager/","title":"BlockDataManager","text":"<p><code>BlockDataManager</code> is an abstraction of block data managers that manage storage for blocks of data (aka block storage management API).</p> <p><code>BlockDataManager</code> uses BlockId to uniquely identify blocks of data and <code>ManagedBuffer</code> to represent them.</p> <p><code>BlockDataManager</code> is used to initialize a BlockTransferService.</p> <p><code>BlockDataManager</code> is used to create a NettyBlockRpcServer.</p>"},{"location":"storage/BlockDataManager/#contract","title":"Contract","text":""},{"location":"storage/BlockDataManager/#diagnoseshuffleblockcorruption","title":"diagnoseShuffleBlockCorruption <pre><code>diagnoseShuffleBlockCorruption(\n  blockId: BlockId,\n  checksumByReader: Long,\n  algorithm: String): Cause\n</code></pre>","text":""},{"location":"storage/BlockDataManager/#gethostlocalshuffledata","title":"getHostLocalShuffleData <pre><code>getHostLocalShuffleData(\n  blockId: BlockId,\n  dirs: Array[String]): ManagedBuffer\n</code></pre> <p>Used when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to fetchHostLocalBlock</li> </ul>","text":""},{"location":"storage/BlockDataManager/#getlocalblockdata","title":"getLocalBlockData <pre><code>getLocalBlockData(\n  blockId: BlockId): ManagedBuffer\n</code></pre> <p>Used when:</p> <ul> <li><code>NettyBlockRpcServer</code> is requested to receive a request (<code>OpenBlocks</code> and <code>FetchShuffleBlocks</code>)</li> </ul>","text":""},{"location":"storage/BlockDataManager/#getlocaldiskdirs","title":"getLocalDiskDirs <pre><code>getLocalDiskDirs: Array[String]\n</code></pre> <p>Used when:</p> <ul> <li><code>NettyBlockRpcServer</code> is requested to handle a GetLocalDirsForExecutors request</li> </ul>","text":""},{"location":"storage/BlockDataManager/#putblockdata","title":"putBlockData <pre><code>putBlockData(\n  blockId: BlockId,\n  data: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Boolean\n</code></pre> <p>Stores (puts) a block data (as a <code>ManagedBuffer</code>) for the given BlockId. Returns <code>true</code> when completed successfully or <code>false</code> when failed.</p> <p>Used when:</p> <ul> <li><code>NettyBlockRpcServer</code> is requested to receive a UploadBlock request</li> </ul>","text":""},{"location":"storage/BlockDataManager/#putblockdataasstream","title":"putBlockDataAsStream <pre><code>putBlockDataAsStream(\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[_]): StreamCallbackWithID\n</code></pre> <p>Used when:</p> <ul> <li><code>NettyBlockRpcServer</code> is requested to receiveStream</li> </ul>","text":""},{"location":"storage/BlockDataManager/#releaselock","title":"releaseLock <pre><code>releaseLock(\n  blockId: BlockId,\n  taskContext: Option[TaskContext]): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>TorrentBroadcast</code> is requested to releaseBlockManagerLock</li> <li><code>BlockManager</code> is requested to handleLocalReadFailure, getLocalValues, getOrElseUpdate, doPut, releaseLockAndDispose</li> </ul>","text":""},{"location":"storage/BlockDataManager/#implementations","title":"Implementations","text":"<ul> <li>BlockManager</li> </ul>"},{"location":"storage/BlockEvictionHandler/","title":"BlockEvictionHandler","text":"<p><code>BlockEvictionHandler</code> is an abstraction of block eviction handlers that can drop blocks from memory.</p>"},{"location":"storage/BlockEvictionHandler/#contract","title":"Contract","text":""},{"location":"storage/BlockEvictionHandler/#dropping-block-from-memory","title":"Dropping Block from Memory <pre><code>dropFromMemory[T: ClassTag](\n  blockId: BlockId,\n  data: () =&gt; Either[Array[T], ChunkedByteBuffer]): StorageLevel\n</code></pre> <p>Used when:</p> <ul> <li><code>MemoryStore</code> is requested to evict blocks</li> </ul>","text":""},{"location":"storage/BlockEvictionHandler/#implementations","title":"Implementations","text":"<ul> <li>BlockManager</li> </ul>"},{"location":"storage/BlockId/","title":"BlockId","text":"<p><code>BlockId</code> is an abstraction of data block identifiers based on an unique name.</p>"},{"location":"storage/BlockId/#contract","title":"Contract","text":""},{"location":"storage/BlockId/#name","title":"Name <pre><code>name: String\n</code></pre> <p>A globally unique identifier of this <code>Block</code></p> <p>Used when:</p> <ul> <li><code>BlockManager</code> is requested to putBlockDataAsStream and readDiskBlockFromSameHostExecutor</li> <li><code>UpdateBlockInfo</code> is requested to writeExternal</li> <li><code>DiskBlockManager</code> is requested to getFile and containsBlock</li> <li><code>DiskStore</code> is requested to getBytes, remove, moveFileToBlock, contains</li> </ul>","text":""},{"location":"storage/BlockId/#implementations","title":"Implementations","text":"Sealed Abstract Class <p><code>BlockId</code> is a Scala sealed abstract class which means that all of the implementations are in the same compilation unit (a single file).</p>"},{"location":"storage/BlockId/#broadcastblockid","title":"BroadcastBlockId <p><code>BlockId</code> for broadcast variable blocks:</p> <ul> <li><code>broadcastId</code> identifier</li> <li>Optional <code>field</code> name (default: <code>empty</code>)</li> </ul> <p>Uses broadcast_ prefix for the name</p> <p>Used when:</p> <ul> <li><code>TorrentBroadcast</code> is created, requested to store a broadcast and the blocks in a local BlockManager, and read blocks</li> <li><code>BlockManager</code> is requested to remove all the blocks of a broadcast variable</li> <li><code>SerializerManager</code> is requested to shouldCompress</li> <li><code>AppStatusListener</code> is requested to onBlockUpdated</li> </ul>","text":""},{"location":"storage/BlockId/#rddblockid","title":"RDDBlockId <p><code>BlockId</code> for RDD partitions:</p> <ul> <li><code>rddId</code> identifier</li> <li><code>splitIndex</code> identifier</li> </ul> <p>Uses rdd_ prefix for the name</p> <p>Used when:</p> <ul> <li><code>StorageStatus</code> is requested to register the status of a data block, get the status of a data block, updateStorageInfo</li> <li><code>LocalRDDCheckpointData</code> is requested to doCheckpoint</li> <li><code>RDD</code> is requested to getOrCompute</li> <li><code>DAGScheduler</code> is requested for the BlockManagers (executors) for cached RDD partitions</li> <li><code>BlockManagerMasterEndpoint</code> is requested to removeRdd</li> <li><code>AppStatusListener</code> is requested to updateRDDBlock (when onBlockUpdated for an <code>RDDBlockId</code>)</li> </ul> <p>Compressed when spark.rdd.compress configuration property is enabled</p>","text":""},{"location":"storage/BlockId/#shuffleblockbatchid","title":"ShuffleBlockBatchId","text":""},{"location":"storage/BlockId/#shuffleblockid","title":"ShuffleBlockId <p><code>BlockId</code> for shuffle blocks:</p> <ul> <li><code>shuffleId</code> identifier</li> <li><code>mapId</code> identifier</li> <li><code>reduceId</code> identifier</li> </ul> <p>Uses shuffle_ prefix for the name</p> <p>Used when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to throwFetchFailedException</li> <li><code>MapOutputTracker</code> utility is requested to convertMapStatuses</li> <li><code>NettyBlockRpcServer</code> is requested to handle a FetchShuffleBlocks message</li> <li><code>ExternalSorter</code> is requested to writePartitionedMapOutput</li> <li><code>ShuffleBlockFetcherIterator</code> is requested to mergeContinuousShuffleBlockIdsIfNeeded</li> <li><code>IndexShuffleBlockResolver</code> is requested to getBlockData</li> </ul> <p>Compressed when spark.shuffle.compress configuration property is enabled</p>","text":""},{"location":"storage/BlockId/#shuffledatablockid","title":"ShuffleDataBlockId","text":""},{"location":"storage/BlockId/#shuffleindexblockid","title":"ShuffleIndexBlockId","text":""},{"location":"storage/BlockId/#streamblockid","title":"StreamBlockId <p><code>BlockId</code> for ...FIXME:</p> <ul> <li><code>streamId</code></li> <li><code>uniqueId</code></li> </ul> <p>Uses the following name:</p> <pre><code>input-[streamId]-[uniqueId]\n</code></pre> <p>Used in Spark Streaming</p>","text":""},{"location":"storage/BlockId/#taskresultblockid","title":"TaskResultBlockId","text":""},{"location":"storage/BlockId/#templocalblockid","title":"TempLocalBlockId","text":""},{"location":"storage/BlockId/#tempshuffleblockid","title":"TempShuffleBlockId","text":""},{"location":"storage/BlockId/#testblockid","title":"TestBlockId","text":""},{"location":"storage/BlockId/#creating-blockid-by-name","title":"Creating BlockId by Name <pre><code>apply(\n  name: String): BlockId\n</code></pre> <p><code>apply</code> creates one of the available BlockIds by the given name (that uses a prefix to differentiate between different <code>BlockId</code>s).</p> <p><code>apply</code> is used when:</p> <ul> <li><code>NettyBlockRpcServer</code> is requested to handle OpenBlocks, UploadBlock messages and receiveStream</li> <li><code>UpdateBlockInfo</code> is requested to deserialize (<code>readExternal</code>)</li> <li><code>DiskBlockManager</code> is requested for all the blocks (from files stored on disk)</li> <li><code>ShuffleBlockFetcherIterator</code> is requested to sendRequest</li> <li><code>JsonProtocol</code> utility is used to accumValueFromJson, taskMetricsFromJson and blockUpdatedInfoFromJson</li> </ul>","text":""},{"location":"storage/BlockInfo/","title":"BlockInfo","text":"<p><code>BlockInfo</code> is a metadata of data blocks (stored in MemoryStore or DiskStore).</p>"},{"location":"storage/BlockInfo/#creating-instance","title":"Creating Instance","text":"<p><code>BlockInfo</code> takes the following to be created:</p> <ul> <li> StorageLevel <li> <code>ClassTag</code> (Scala) <li> <code>tellMaster</code> flag <p><code>BlockInfo</code> is created\u00a0when:</p> <ul> <li><code>BlockManager</code> is requested to doPut</li> </ul>"},{"location":"storage/BlockInfo/#block-size","title":"Block Size <p><code>BlockInfo</code> knows the size of the block (in bytes).</p> <p>The size is <code>0</code> by default and changes when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save</li> <li><code>BlockManager</code> is requested to doPutIterator</li> </ul>","text":""},{"location":"storage/BlockInfo/#reader-count","title":"Reader Count <p><code>readerCount</code> is the number of times that this block has been locked for reading</p> <p><code>readerCount</code> is <code>0</code> by default.</p> <p><code>readerCount</code> changes back to <code>0</code> when:</p> <ul> <li><code>BlockInfoManager</code> is requested to remove a block and clear</li> </ul> <p><code>readerCount</code> is incremented when a read lock is acquired and decreases when the following happens:</p> <ul> <li><code>BlockInfoManager</code> is requested to release a lock and releaseAllLocksForTask</li> </ul>","text":""},{"location":"storage/BlockInfo/#writer-task","title":"Writer Task <p><code>writerTask</code> attribute is the task ID that owns the write lock for the block or the following:</p> <ul> <li> <code>-1</code> for no writers and hence no write lock in use <li> <code>-1024</code> for non-task threads (by a driver thread or by unit test code)  <p><code>writerTask</code> is assigned a task ID when:</p> <ul> <li><code>BlockInfoManager</code> is requested to lockForWriting, unlock, releaseAllLocksForTask, removeBlock, clear</li> </ul>","text":""},{"location":"storage/BlockInfoManager/","title":"BlockInfoManager","text":"<p><code>BlockInfoManager</code> is used by BlockManager (and MemoryStore) to manage metadata of memory blocks and control concurrent access by locks for reading and writing.</p> <p><code>BlockInfoManager</code> is used to create a MemoryStore and a <code>BlockManagerManagedBuffer</code>.</p>"},{"location":"storage/BlockInfoManager/#creating-instance","title":"Creating Instance","text":"<p><code>BlockInfoManager</code> takes no arguments to be created.</p> <p><code>BlockInfoManager</code> is created\u00a0for BlockManager</p> <p></p>"},{"location":"storage/BlockInfoManager/#block-metadata","title":"Block Metadata <pre><code>infos: HashMap[BlockId, BlockInfo]\n</code></pre> <p><code>BlockInfoManager</code> uses a registry of block metadatas per block.</p>","text":""},{"location":"storage/BlockInfoManager/#locks","title":"Locks <p>Locks are the mechanism to control concurrent access to data and prevent destructive interaction between operations that use the same resource.</p> <p><code>BlockInfoManager</code> uses read and write locks by task attempts.</p>","text":""},{"location":"storage/BlockInfoManager/#read-locks","title":"Read Locks <pre><code>readLocksByTask: HashMap[TaskAttemptId, ConcurrentHashMultiset[BlockId]]\n</code></pre> <p><code>BlockInfoManager</code> uses <code>readLocksByTask</code> registry to track tasks (by <code>TaskAttemptId</code>) and the blocks they locked for reading (as BlockIds).</p> <p>A new entry is added when <code>BlockInfoManager</code> is requested to register a task (attempt).</p> <p>A new <code>BlockId</code> is added to an existing task attempt in lockForReading.</p>","text":""},{"location":"storage/BlockInfoManager/#write-locks","title":"Write Locks <p>Tracks tasks (by <code>TaskAttemptId</code>) and the blocks they locked for writing (as BlockId.md[]).</p>","text":""},{"location":"storage/BlockInfoManager/#registering-task-execution-attempt","title":"Registering Task (Execution Attempt) <pre><code>registerTask(\n  taskAttemptId: Long): Unit\n</code></pre> <p><code>registerTask</code> registers a new \"empty\" entry for the given task (by the task attempt ID) to the readLocksByTask internal registry.</p> <p><code>registerTask</code> is used when:</p> <ul> <li><code>BlockInfoManager</code> is created</li> <li><code>BlockManager</code> is requested to registerTask</li> </ul>","text":""},{"location":"storage/BlockInfoManager/#downgrading-exclusive-write-lock-to-shared-read-lock","title":"Downgrading Exclusive Write Lock to Shared Read Lock <pre><code>downgradeLock(\n  blockId: BlockId): Unit\n</code></pre> <p><code>downgradeLock</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] downgrading write lock for [blockId]\n</code></pre> <p><code>downgradeLock</code>...FIXME</p> <p><code>downgradeLock</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to doPut and downgradeLock</li> </ul>","text":""},{"location":"storage/BlockInfoManager/#obtaining-read-lock-for-block","title":"Obtaining Read Lock for Block <pre><code>lockForReading(\n  blockId: BlockId,\n  blocking: Boolean = true): Option[BlockInfo]\n</code></pre> <p><code>lockForReading</code> locks a given memory block for reading when the block was registered earlier and no writer tasks use it.</p> <p>When executed, <code>lockForReading</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] trying to acquire read lock for [blockId]\n</code></pre> <p><code>lockForReading</code> looks up the metadata of the <code>blockId</code> block (in the infos registry).</p> <p>If no metadata could be found, <code>lockForReading</code> returns <code>None</code> which means that the block does not exist or was removed (and anybody could acquire a write lock).</p> <p>Otherwise, when the metadata was found (i.e. registered) <code>lockForReading</code> checks so-called writerTask. Only when the block has no writer tasks, a read lock can be acquired. If so, the <code>readerCount</code> of the block metadata is incremented and the block is recorded (in the internal readLocksByTask registry). <code>lockForReading</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] acquired read lock for [blockId]\n</code></pre> <p>The <code>BlockInfo</code> for the <code>blockId</code> block is returned.</p>  <p>Note</p> <p><code>-1024</code> is a special <code>taskAttemptId</code> (NON_TASK_WRITER) used to mark a non-task thread, e.g. by a driver thread or by unit test code.</p>  <p>For blocks with <code>writerTask</code> other than NO_WRITER, when <code>blocking</code> is enabled, <code>lockForReading</code> waits (until another thread invokes the <code>Object.notify</code> method or the <code>Object.notifyAll</code> methods for this object).</p> <p>With <code>blocking</code> enabled, it will repeat the waiting-for-read-lock sequence until either <code>None</code> or the lock is obtained.</p> <p>When <code>blocking</code> is disabled and the lock could not be obtained, <code>None</code> is returned immediately.</p>  <p>Note</p> <p><code>lockForReading</code> is a <code>synchronized</code> method, i.e. no two objects can use this and other instance methods.</p>  <p><code>lockForReading</code> is used when:</p> <ul> <li><code>BlockInfoManager</code> is requested to downgradeLock and lockNewBlockForWriting</li> <li><code>BlockManager</code> is requested to getLocalValues, getLocalBytes and replicateBlock</li> <li><code>BlockManagerManagedBuffer</code> is requested to <code>retain</code></li> </ul>","text":""},{"location":"storage/BlockInfoManager/#obtaining-write-lock-for-block","title":"Obtaining Write Lock for Block <pre><code>lockForWriting(\n  blockId: BlockId,\n  blocking: Boolean = true): Option[BlockInfo]\n</code></pre> <p><code>lockForWriting</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] trying to acquire write lock for [blockId]\n</code></pre> <p><code>lockForWriting</code> finds the <code>blockId</code> (in the infos registry). When no BlockInfo could be found, <code>None</code> is returned. Otherwise, <code>blockId</code> block is checked for <code>writerTask</code> to be <code>BlockInfo.NO_WRITER</code> with no readers (i.e. <code>readerCount</code> is <code>0</code>) and only then the lock is returned.</p> <p>When the write lock can be returned, <code>BlockInfo.writerTask</code> is set to <code>currentTaskAttemptId</code> and a new binding is added to the internal writeLocksByTask registry. <code>lockForWriting</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] acquired write lock for [blockId]\n</code></pre> <p>If, for some reason, BlockInfo.md#writerTask[<code>blockId</code> has a writer] or the number of readers is positive (i.e. <code>BlockInfo.readerCount</code> is greater than <code>0</code>), the method will wait (based on the input <code>blocking</code> flag) and attempt the write lock acquisition process until it finishes with a write lock.</p> <p>NOTE: (deadlock possible) The method is <code>synchronized</code> and can block, i.e. <code>wait</code> that causes the current thread to wait until another thread invokes <code>Object.notify</code> or <code>Object.notifyAll</code> methods for this object.</p> <p><code>lockForWriting</code> returns <code>None</code> for no <code>blockId</code> in the internal infos registry or when <code>blocking</code> flag is disabled and the write lock could not be acquired.</p> <p><code>lockForWriting</code> is used when:</p> <ul> <li><code>BlockInfoManager</code> is requested to lockNewBlockForWriting</li> <li><code>BlockManager</code> is requested to removeBlock</li> <li><code>MemoryStore</code> is requested to evictBlocksToFreeSpace</li> </ul>","text":""},{"location":"storage/BlockInfoManager/#obtaining-write-lock-for-new-block","title":"Obtaining Write Lock for New Block <pre><code>lockNewBlockForWriting(\n  blockId: BlockId,\n  newBlockInfo: BlockInfo): Boolean\n</code></pre> <p><code>lockNewBlockForWriting</code> obtains a write lock for <code>blockId</code> but only when the method could register the block.</p>  <p>Note</p> <p><code>lockNewBlockForWriting</code> is similar to lockForWriting method but for brand new blocks.</p>  <p>When executed, <code>lockNewBlockForWriting</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] trying to put [blockId]\n</code></pre> <p>If some other thread has already created the block, <code>lockNewBlockForWriting</code> finishes returning <code>false</code>. Otherwise, when the block does not exist, <code>newBlockInfo</code> is recorded in the infos internal registry and the block is locked for this client for writing. <code>lockNewBlockForWriting</code> then returns <code>true</code>.</p>  <p>Note</p> <p><code>lockNewBlockForWriting</code> executes itself in <code>synchronized</code> block so once the <code>BlockInfoManager</code> is locked the other internal registries should be available for the current thread only.</p>  <p><code>lockNewBlockForWriting</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to doPut</li> </ul>","text":""},{"location":"storage/BlockInfoManager/#releasing-lock-on-block","title":"Releasing Lock on Block <pre><code>unlock(\n  blockId: BlockId,\n  taskAttemptId: Option[TaskAttemptId] = None): Unit\n</code></pre> <p><code>unlock</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] releasing lock for [blockId]\n</code></pre> <p><code>unlock</code> gets the metadata for <code>blockId</code> (and throws an <code>IllegalStateException</code> if the block was not found).</p> <p>If the writer task for the block is not NO_WRITER, it becomes so and the <code>blockId</code> block is removed from the internal writeLocksByTask registry for the current task attempt.</p> <p>Otherwise, if the writer task is indeed <code>NO_WRITER</code>, the block is assumed locked for reading. The <code>readerCount</code> counter is decremented for the <code>blockId</code> block and the read lock removed from the internal readLocksByTask registry for the task attempt.</p> <p>In the end, <code>unlock</code> wakes up all the threads waiting for the <code>BlockInfoManager</code>.</p> <p><code>unlock</code> is used when:</p> <ul> <li><code>BlockInfoManager</code> is requested to downgradeLock</li> <li><code>BlockManager</code> is requested to releaseLock and doPut</li> <li><code>BlockManagerManagedBuffer</code> is requested to <code>release</code></li> <li><code>MemoryStore</code> is requested to evictBlocksToFreeSpace</li> </ul>","text":""},{"location":"storage/BlockInfoManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.BlockInfoManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.BlockInfoManager=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/BlockManager/","title":"BlockManager","text":"<p><code>BlockManager</code> manages the storage for blocks (chunks of data) that can be stored in memory and on disk.</p> <p></p> <p><code>BlockManager</code> runs as part of the driver and executor processes.</p> <p><code>BlockManager</code> provides interface for uploading and fetching blocks both locally and remotely using various stores (i.e. memory, disk, and off-heap).</p> <p>Cached blocks are blocks with non-zero sum of memory and disk sizes.</p> <p>Tip</p> <p>Use Web UI (esp. Storage and Executors tabs) to monitor the memory used.</p> <p>Tip</p> <p>Use spark-submit's command-line options (i.e. --driver-memory for the driver and --executor-memory for executors) or their equivalents as Spark properties (i.e. spark.executor.memory and spark.driver.memory) to control the memory for storage memory.</p> <p>When External Shuffle Service is enabled, BlockManager uses ExternalShuffleClient to read shuffle files (of other executors).</p>"},{"location":"storage/BlockManager/#creating-instance","title":"Creating Instance","text":"<p><code>BlockManager</code> takes the following to be created:</p> <ul> <li>Executor ID</li> <li>RpcEnv</li> <li>BlockManagerMaster</li> <li> SerializerManager <li> SparkConf <li>MemoryManager</li> <li>MapOutputTracker</li> <li>ShuffleManager</li> <li>BlockTransferService</li> <li> <code>SecurityManager</code> <li> Optional ExternalBlockStoreClient <p>When created, <code>BlockManager</code> sets externalShuffleServiceEnabled internal flag based on spark.shuffle.service.enabled configuration property.</p> <p><code>BlockManager</code> then creates an instance of DiskBlockManager (requesting <code>deleteFilesOnStop</code> when an external shuffle service is not in use).</p> <p><code>BlockManager</code> creates block-manager-future daemon cached thread pool with 128 threads maximum (as <code>futureExecutionContext</code>).</p> <p><code>BlockManager</code> calculates the maximum memory to use (as <code>maxMemory</code>) by requesting the maximum on-heap and off-heap storage memory from the assigned <code>MemoryManager</code>.</p> <p><code>BlockManager</code> calculates the port used by the external shuffle service (as <code>externalShuffleServicePort</code>).</p> <p><code>BlockManager</code> creates a client to read other executors' shuffle files (as <code>shuffleClient</code>). If the external shuffle service is used...FIXME</p> <p><code>BlockManager</code> sets the maximum number of failures before this block manager refreshes the block locations from the driver (as <code>maxFailuresBeforeLocationRefresh</code>).</p> <p><code>BlockManager</code> registers a BlockManagerSlaveEndpoint with the input RpcEnv, itself, and MapOutputTracker (as <code>slaveEndpoint</code>).</p> <p><code>BlockManager</code> is created when <code>SparkEnv</code> is created (for the driver and executors) when a Spark application starts.</p> <p></p>"},{"location":"storage/BlockManager/#memorymanager","title":"MemoryManager <p><code>BlockManager</code> is given a MemoryManager when created.</p> <p><code>BlockManager</code> uses the <code>MemoryManager</code> for the following:</p> <ul> <li> <p>Create a MemoryStore (that is then assigned to MemoryManager as a \"circular dependency\")</p> </li> <li> <p>Initialize maxOnHeapMemory and maxOffHeapMemory (for reporting)</p> </li> </ul>","text":""},{"location":"storage/BlockManager/#diskblockmanager","title":"DiskBlockManager <p><code>BlockManager</code> creates a DiskBlockManager when created.</p> <p></p> <p><code>BlockManager</code> uses the <code>BlockManager</code> for the following:</p> <ul> <li>Creating a DiskStore</li> <li>Registering an executor with a local external shuffle service (when initialized on an executor with externalShuffleServiceEnabled)</li> </ul> <p>The <code>DiskBlockManager</code> is available as <code>diskBlockManager</code> reference to other Spark systems.</p> <pre><code>import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.diskBlockManager\n</code></pre>","text":""},{"location":"storage/BlockManager/#migratableresolver","title":"MigratableResolver <pre><code>migratableResolver: MigratableResolver\n</code></pre> <p><code>BlockManager</code> creates a reference to a MigratableResolver by requesting the ShuffleManager for the ShuffleBlockResolver (that is assumed a <code>MigratableResolver</code>).</p>  Lazy Value <p><code>migratableResolver</code> is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.</p>   <p>private[storage]</p> <p><code>migratableResolver</code> is a <code>private[storage]</code> so it is available to others in the <code>org.apache.spark.storage</code> package.</p>  <p><code>migratableResolver</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to putBlockDataAsStream</li> <li><code>ShuffleMigrationRunnable</code> is requested to run</li> <li><code>BlockManagerDecommissioner</code> is requested to refreshOffloadingShuffleBlocks</li> <li><code>FallbackStorage</code> is requested to copy</li> </ul>","text":""},{"location":"storage/BlockManager/#local-directories-for-block-storage","title":"Local Directories for Block Storage <pre><code>getLocalDiskDirs: Array[String]\n</code></pre> <p><code>getLocalDiskDirs</code>\u00a0requests the DiskBlockManager for the local directories for block storage.</p> <p><code>getLocalDiskDirs</code>\u00a0is part of the BlockDataManager abstraction.</p> <p><code>getLocalDiskDirs</code>\u00a0is also used by <code>BlockManager</code> when requested for the following:</p> <ul> <li>Register with a local external shuffle service</li> <li>Initialize</li> <li>Re-register</li> </ul>","text":""},{"location":"storage/BlockManager/#initializing-blockmanager","title":"Initializing BlockManager <pre><code>initialize(\n  appId: String): Unit\n</code></pre> <p><code>initialize</code> requests the BlockTransferService to initialize.</p> <p><code>initialize</code> requests the ExternalBlockStoreClient to initialize (if given).</p> <p><code>initialize</code> determines the BlockReplicationPolicy based on spark.storage.replication.policy configuration property and prints out the following INFO message to the logs:</p> <pre><code>Using [priorityClass] for block replication policy\n</code></pre> <p><code>initialize</code> creates a BlockManagerId and requests the BlockManagerMaster to registerBlockManager (with the <code>BlockManagerId</code>, the local directories of the DiskBlockManager, the maxOnHeapMemory, the maxOffHeapMemory and the slaveEndpoint).</p> <p><code>initialize</code> sets the internal BlockManagerId to be the response from the BlockManagerMaster (if available) or the <code>BlockManagerId</code> just created.</p> <p><code>initialize</code> initializes the External Shuffle Server's Address when enabled and prints out the following INFO message to the logs (with the externalShuffleServicePort):</p> <pre><code>external shuffle service port = [externalShuffleServicePort]\n</code></pre> <p>(only for executors and External Shuffle Service enabled) <code>initialize</code> registers with the External Shuffle Server.</p> <p><code>initialize</code> determines the hostLocalDirManager. With spark.shuffle.readHostLocalDisk configuration property enabled and spark.shuffle.useOldFetchProtocol disabled, <code>initialize</code> uses the ExternalBlockStoreClient to create a <code>HostLocalDirManager</code> (with spark.storage.localDiskByExecutors.cacheSize configuration property).</p> <p>In the end, <code>initialize</code> prints out the following INFO message to the logs (with the blockManagerId):</p> <pre><code>Initialized BlockManager: [blockManagerId]\n</code></pre> <p><code>initialize</code> is used when:</p> <ul> <li><code>SparkContext</code> is created (on the driver)</li> <li><code>Executor</code> is created (with <code>isLocal</code> flag disabled)</li> </ul>","text":""},{"location":"storage/BlockManager/#registering-executors-blockmanager-with-external-shuffle-server","title":"Registering Executor's BlockManager with External Shuffle Server <pre><code>registerWithExternalShuffleServer(): Unit\n</code></pre> <p><code>registerWithExternalShuffleServer</code> registers the <code>BlockManager</code> (for an executor) with External Shuffle Service.</p> <p><code>registerWithExternalShuffleServer</code> prints out the following INFO message to the logs:</p> <pre><code>Registering executor with local external shuffle service.\n</code></pre> <p><code>registerWithExternalShuffleServer</code> creates an ExecutorShuffleInfo (with the localDirs and subDirsPerLocalDir of the DiskBlockManager, and the class name of the ShuffleManager).</p> <p><code>registerWithExternalShuffleServer</code> uses spark.shuffle.registration.maxAttempts configuration property and <code>5</code> sleep time when requesting the ExternalBlockStoreClient to registerWithShuffleServer (using the BlockManagerId and the <code>ExecutorShuffleInfo</code>).</p> <p>In case of any exception that happen below the maximum number of attempts, <code>registerWithExternalShuffleServer</code> prints out the following ERROR message to the logs and sleeps 5 seconds:</p> <pre><code>Failed to connect to external shuffle server, will retry [attempts] more times after waiting 5 seconds...\n</code></pre>","text":""},{"location":"storage/BlockManager/#blockmanagerid","title":"BlockManagerId <p><code>BlockManager</code> uses a BlockManagerId for...FIXME</p>","text":""},{"location":"storage/BlockManager/#hostlocaldirmanager","title":"HostLocalDirManager <p><code>BlockManager</code> can use a <code>HostLocalDirManager</code>.</p> <p>Default: (undefined)</p>","text":""},{"location":"storage/BlockManager/#blockreplicationpolicy","title":"BlockReplicationPolicy <p><code>BlockManager</code> uses a BlockReplicationPolicy for...FIXME</p>","text":""},{"location":"storage/BlockManager/#external-shuffle-services-port","title":"External Shuffle Service's Port <p><code>BlockManager</code> determines the port of an external shuffle service when created.</p> <p>The port is used to create the shuffleServerId and a HostLocalDirManager.</p> <p>The port is also used for preferExecutors.</p>","text":""},{"location":"storage/BlockManager/#sparkdiskstoresubdirectories-configuration-property","title":"spark.diskStore.subDirectories Configuration Property <p><code>BlockManager</code> uses spark.diskStore.subDirectories configuration property to initialize a <code>subDirsPerLocalDir</code> local value.</p> <p><code>subDirsPerLocalDir</code> is used when:</p> <ul> <li><code>IndexShuffleBlockResolver</code> is requested to getDataFile and getIndexFile</li> <li><code>BlockManager</code> is requested to readDiskBlockFromSameHostExecutor</li> </ul>","text":""},{"location":"storage/BlockManager/#fetching-block-or-computing-and-storing-it","title":"Fetching Block or Computing (and Storing) it <pre><code>getOrElseUpdate[T](\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[T],\n  makeIterator: () =&gt; Iterator[T]): Either[BlockResult, Iterator[T]]\n</code></pre>  <p>Map.getOrElseUpdate</p> <p>I think it is fair to say that <code>getOrElseUpdate</code> is like getOrElseUpdate of scala.collection.mutable.Map in Scala.</p> <pre><code>getOrElseUpdate(key: K, op: \u21d2 V): V\n</code></pre> <p>Quoting the official scaladoc:</p>  <p>If given key <code>K</code> is already in this map, <code>getOrElseUpdate</code> returns the associated value <code>V</code>.</p> <p>Otherwise, <code>getOrElseUpdate</code> computes a value <code>V</code> from given expression <code>op</code>, stores with the key <code>K</code> in the map and returns that value.</p>  <p>Since <code>BlockManager</code> is a key-value store of blocks of data identified by a block ID that seems to fit so well.</p>  <p><code>getOrElseUpdate</code> first attempts to get the block by the <code>BlockId</code> (from the local block manager first and, if unavailable, requesting remote peers).</p> <p><code>getOrElseUpdate</code> gives the <code>BlockResult</code> of the block if found.</p> <p>If however the block was not found (in any block manager in a Spark cluster), <code>getOrElseUpdate</code> doPutIterator (for the input <code>BlockId</code>, the <code>makeIterator</code> function and the <code>StorageLevel</code>).</p> <p><code>getOrElseUpdate</code> branches off per the result:</p> <ul> <li>For <code>None</code>, <code>getOrElseUpdate</code> getLocalValues for the <code>BlockId</code> and eventually returns the <code>BlockResult</code> (unless terminated by a <code>SparkException</code> due to some internal error)</li> <li>For <code>Some(iter)</code>, <code>getOrElseUpdate</code> returns an iterator of <code>T</code> values</li> </ul> <p><code>getOrElseUpdate</code> is used when:</p> <ul> <li><code>RDD</code> is requested to get or compute an RDD partition (for an <code>RDDBlockId</code> with the RDD's id and partition index).</li> </ul>","text":""},{"location":"storage/BlockManager/#fetching-block","title":"Fetching Block <pre><code>get[T: ClassTag](\n  blockId: BlockId): Option[BlockResult]\n</code></pre> <p><code>get</code> attempts to fetch the block (BlockId) from a local block manager first before requesting it from remote block managers. <code>get</code> returns a BlockResult or <code>None</code> (to denote \"a block is not available\").</p>  <p>Internally, <code>get</code> tries to fetch the block from the local BlockManager. If found, <code>get</code> prints out the following INFO message to the logs and returns a <code>BlockResult</code>.</p> <pre><code>Found block [blockId] locally\n</code></pre> <p>If however the block was not found locally, <code>get</code> tries to fetch the block from remote BlockManagers. If fetched,  <code>get</code> prints out the following INFO message to the logs and returns a <code>BlockResult</code>.</p> <pre><code>Found block [blockId] remotely\n</code></pre>","text":""},{"location":"storage/BlockManager/#getremotevalues","title":"getRemoteValues <pre><code>getRemoteValues[T: ClassTag](\n  blockId: BlockId): Option[BlockResult]\n</code></pre> <p><code>getRemoteValues</code> getRemoteBlock with the <code>bufferTransformer</code> function that takes a <code>ManagedBuffer</code> and does the following:</p> <ul> <li>Requests the SerializerManager to deserialize values from an input stream from the <code>ManagedBuffer</code></li> <li>Creates a <code>BlockResult</code> with the values (and their total size, and <code>Network</code> read method)</li> </ul>","text":""},{"location":"storage/BlockManager/#fetching-block-bytes-from-remote-block-managers","title":"Fetching Block Bytes From Remote Block Managers <pre><code>getRemoteBytes(\n  blockId: BlockId): Option[ChunkedByteBuffer]\n</code></pre> <p><code>getRemoteBytes</code> getRemoteBlock with the <code>bufferTransformer</code> function that takes a <code>ManagedBuffer</code> and creates a <code>ChunkedByteBuffer</code>.</p> <p><code>getRemoteBytes</code> is used when:</p> <ul> <li><code>TorrentBroadcast</code> is requested to readBlocks</li> <li><code>TaskResultGetter</code> is requested to enqueueSuccessfulTask</li> </ul>","text":""},{"location":"storage/BlockManager/#fetching-remote-block","title":"Fetching Remote Block <pre><code>getRemoteBlock[T](\n  blockId: BlockId,\n  bufferTransformer: ManagedBuffer =&gt; T): Option[T]\n</code></pre> <p><code>getRemoteBlock</code>\u00a0is used for getRemoteValues and getRemoteBytes.</p> <p><code>getRemoteBlock</code> prints out the following DEBUG message to the logs:</p> <pre><code>Getting remote block [blockId]\n</code></pre> <p><code>getRemoteBlock</code> requests the BlockManagerMaster for locations and status of the input BlockId (with the host of BlockManagerId).</p> <p>With some locations, <code>getRemoteBlock</code> determines the size of the block (max of <code>diskSize</code> and <code>memSize</code>). <code>getRemoteBlock</code> tries to read the block from the local directories of another executor on the same host. <code>getRemoteBlock</code> prints out the following INFO message to the logs:</p> <pre><code>Read [blockId] from the disk of a same host executor is [successful|failed].\n</code></pre> <p>When a data block could not be found in any of the local directories, <code>getRemoteBlock</code> fetchRemoteManagedBuffer.</p> <p>For no locations from the BlockManagerMaster, <code>getRemoteBlock</code> prints out the following DEBUG message to the logs:</p>","text":""},{"location":"storage/BlockManager/#readdiskblockfromsamehostexecutor","title":"readDiskBlockFromSameHostExecutor <pre><code>readDiskBlockFromSameHostExecutor(\n  blockId: BlockId,\n  localDirs: Array[String],\n  blockSize: Long): Option[ManagedBuffer]\n</code></pre> <p><code>readDiskBlockFromSameHostExecutor</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#fetchremotemanagedbuffer","title":"fetchRemoteManagedBuffer <pre><code>fetchRemoteManagedBuffer(\n  blockId: BlockId,\n  blockSize: Long,\n  locationsAndStatus: BlockManagerMessages.BlockLocationsAndStatus): Option[ManagedBuffer]\n</code></pre> <p><code>fetchRemoteManagedBuffer</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#sortlocations","title":"sortLocations <pre><code>sortLocations(\n  locations: Seq[BlockManagerId]): Seq[BlockManagerId]\n</code></pre> <p><code>sortLocations</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#preferexecutors","title":"preferExecutors <pre><code>preferExecutors(\n  locations: Seq[BlockManagerId]): Seq[BlockManagerId]\n</code></pre> <p><code>preferExecutors</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#readdiskblockfromsamehostexecutor_1","title":"readDiskBlockFromSameHostExecutor <pre><code>readDiskBlockFromSameHostExecutor(\n  blockId: BlockId,\n  localDirs: Array[String],\n  blockSize: Long): Option[ManagedBuffer]\n</code></pre> <p><code>readDiskBlockFromSameHostExecutor</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#executioncontextexecutorservice","title":"ExecutionContextExecutorService <p><code>BlockManager</code> uses a Scala ExecutionContextExecutorService to execute FIXME asynchronously (on a thread pool with block-manager-future prefix and maximum of 128 threads).</p>","text":""},{"location":"storage/BlockManager/#blockevictionhandler","title":"BlockEvictionHandler <p><code>BlockManager</code> is a BlockEvictionHandler that can drop a block from memory (and store it on a disk when necessary).</p>","text":""},{"location":"storage/BlockManager/#shuffleclient-and-external-shuffle-service","title":"ShuffleClient and External Shuffle Service  <p>Danger</p> <p>FIXME <code>ShuffleClient</code> and <code>ExternalShuffleClient</code> are dead. Long live BlockStoreClient and ExternalBlockStoreClient.</p>  <p><code>BlockManager</code> manages the lifecycle of a <code>ShuffleClient</code>:</p> <ul> <li> <p>Creates when created</p> </li> <li> <p>Inits (and possibly registers with an external shuffle server) when requested to initialize</p> </li> <li> <p>Closes when requested to stop</p> </li> </ul> <p>The <code>ShuffleClient</code> can be an <code>ExternalShuffleClient</code> or the given BlockTransferService based on spark.shuffle.service.enabled configuration property. When enabled, BlockManager uses the <code>ExternalShuffleClient</code>.</p> <p>The <code>ShuffleClient</code> is available to other Spark services (using <code>shuffleClient</code> value) and is used when BlockStoreShuffleReader is requested to read combined key-value records for a reduce task.</p> <p>When requested for shuffle metrics, BlockManager simply requests them from the <code>ShuffleClient</code>.</p>","text":""},{"location":"storage/BlockManager/#blockmanager-and-rpcenv","title":"BlockManager and RpcEnv <p><code>BlockManager</code> is given a RpcEnv when created.</p> <p>The <code>RpcEnv</code> is used to set up a BlockManagerSlaveEndpoint.</p>","text":""},{"location":"storage/BlockManager/#blockinfomanager","title":"BlockInfoManager <p><code>BlockManager</code> creates a BlockInfoManager when created.</p> <p><code>BlockManager</code> requests the <code>BlockInfoManager</code> to clear when requested to stop.</p> <p><code>BlockManager</code> uses the <code>BlockInfoManager</code> to create a MemoryStore.</p> <p><code>BlockManager</code> uses the <code>BlockInfoManager</code> when requested for the following:</p> <ul> <li> <p>reportAllBlocks</p> </li> <li> <p>getStatus</p> </li> <li> <p>getMatchingBlockIds</p> </li> <li> <p>getLocalValues and getLocalBytes</p> </li> <li> <p>doPut</p> </li> <li> <p>replicateBlock</p> </li> <li> <p>dropFromMemory</p> </li> <li> <p>removeRdd, removeBroadcast, removeBlock, removeBlockInternal</p> </li> <li> <p>downgradeLock, releaseLock, registerTask, releaseAllLocksForTask</p> </li> </ul>","text":""},{"location":"storage/BlockManager/#blockmanager-and-blockmanagermaster","title":"BlockManager and BlockManagerMaster <p><code>BlockManager</code> is given a BlockManagerMaster when created.</p>","text":""},{"location":"storage/BlockManager/#blockmanager-as-blockdatamanager","title":"BlockManager as BlockDataManager <p><code>BlockManager</code> is a BlockDataManager.</p>","text":""},{"location":"storage/BlockManager/#blockmanager-and-mapoutputtracker","title":"BlockManager and MapOutputTracker <p><code>BlockManager</code> is given a MapOutputTracker when created.</p>","text":""},{"location":"storage/BlockManager/#executor-id","title":"Executor ID <p><code>BlockManager</code> is given an Executor ID when created.</p> <p>The Executor ID is one of the following:</p> <ul> <li> <p>driver (<code>SparkContext.DRIVER_IDENTIFIER</code>) for the driver</p> </li> <li> <p>Value of --executor-id command-line argument for CoarseGrainedExecutorBackend executors</p> </li> </ul>","text":""},{"location":"storage/BlockManager/#blockmanagerendpoint-rpc-endpoint","title":"BlockManagerEndpoint RPC Endpoint <p><code>BlockManager</code> requests the RpcEnv to register a BlockManagerSlaveEndpoint under the name <code>BlockManagerEndpoint[ID]</code>.</p> <p>The RPC endpoint is used when <code>BlockManager</code> is requested to initialize and reregister (to register the <code>BlockManager</code> on an executor with the BlockManagerMaster on the driver).</p> <p>The endpoint is stopped (by requesting the RpcEnv to stop the reference) when <code>BlockManager</code> is requested to stop.</p>","text":""},{"location":"storage/BlockManager/#accessing-blockmanager","title":"Accessing BlockManager <p><code>BlockManager</code> is available using SparkEnv on the driver and executors.</p> <pre><code>import org.apache.spark.SparkEnv\nval bm = SparkEnv.get.blockManager\n\nscala&gt; :type bm\norg.apache.spark.storage.BlockManager\n</code></pre>","text":""},{"location":"storage/BlockManager/#blockstoreclient","title":"BlockStoreClient <p><code>BlockManager</code> uses a BlockStoreClient to read other executors' blocks. This is an ExternalBlockStoreClient (when given and an external shuffle service is used) or a BlockTransferService (to directly connect to other executors).</p> <p>This <code>BlockStoreClient</code> is used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined key-values for a reduce task</li> <li>Create the HostLocalDirManager (when <code>BlockManager</code> is initialized)</li> <li>As the shuffleMetricsSource</li> <li>registerWithExternalShuffleServer (when an external shuffle server is used and the ExternalBlockStoreClient defined)</li> </ul>","text":""},{"location":"storage/BlockManager/#blocktransferservice","title":"BlockTransferService <p><code>BlockManager</code> is given a BlockTransferService when created.</p>  <p>Note</p> <p>There is only one concrete <code>BlockTransferService</code> that is NettyBlockTransferService and there seem to be no way to reconfigure Apache Spark to use a different implementation (if there were any).</p>  <p><code>BlockTransferService</code> is used when <code>BlockManager</code> is requested to fetch a block from and replicate a block to remote block managers.</p> <p><code>BlockTransferService</code> is used as the BlockStoreClient (unless an ExternalBlockStoreClient is specified).</p> <p><code>BlockTransferService</code> is initialized with this BlockManager.</p> <p><code>BlockTransferService</code> is closed when <code>BlockManager</code> is requested to stop.</p>","text":""},{"location":"storage/BlockManager/#shufflemanager","title":"ShuffleManager <p><code>BlockManager</code> is given a ShuffleManager when created.</p> <p><code>BlockManager</code> uses the <code>ShuffleManager</code> for the following:</p> <ul> <li> <p>Retrieving a block data (for shuffle blocks)</p> </li> <li> <p>Retrieving a non-shuffle block data (for shuffle blocks anyway)</p> </li> <li> <p>Registering an executor with a local external shuffle service (when initialized on an executor with externalShuffleServiceEnabled)</p> </li> </ul>","text":""},{"location":"storage/BlockManager/#memorystore","title":"MemoryStore <p><code>BlockManager</code> creates a MemoryStore when created (with the BlockInfoManager, the SerializerManager, the MemoryManager and itself as a BlockEvictionHandler).</p> <p></p> <p><code>BlockManager</code> requests the MemoryManager to use the <code>MemoryStore</code>.</p> <p><code>BlockManager</code> uses the <code>MemoryStore</code> for the following:</p> <ul> <li> <p>getStatus and getCurrentBlockStatus</p> </li> <li> <p>getLocalValues</p> </li> <li> <p>doGetLocalBytes</p> </li> <li> <p>doPutBytes and doPutIterator</p> </li> <li> <p>maybeCacheDiskBytesInMemory and maybeCacheDiskValuesInMemory</p> </li> <li> <p>dropFromMemory</p> </li> <li> <p>removeBlockInternal</p> </li> </ul> <p>The <code>MemoryStore</code> is requested to clear when <code>BlockManager</code> is requested to stop.</p> <p>The <code>MemoryStore</code> is available as <code>memoryStore</code> private reference to other Spark services.</p> <pre><code>import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.memoryStore\n</code></pre> <p>The <code>MemoryStore</code> is used (via <code>SparkEnv.get.blockManager.memoryStore</code> reference) when <code>Task</code> is requested to run (that has just finished execution and requests the <code>MemoryStore</code> to release unroll memory).</p>","text":""},{"location":"storage/BlockManager/#diskstore","title":"DiskStore <p><code>BlockManager</code> creates a DiskStore (with the DiskBlockManager) when created.</p> <p></p> <p><code>BlockManager</code> uses the <code>DiskStore</code> when requested for the following:</p> <ul> <li>getStatus</li> <li>getCurrentBlockStatus</li> <li>getLocalValues</li> <li>doGetLocalBytes</li> <li>doPutIterator</li> <li>dropFromMemory</li> <li>removeBlockInternal</li> </ul> <p><code>DiskStore</code> is used when:</p> <ul> <li><code>ByteBufferBlockStoreUpdater</code> is requested to saveToDiskStore</li> <li><code>TempFileBasedBlockStoreUpdater</code> is requested to blockData and saveToDiskStore</li> </ul>","text":""},{"location":"storage/BlockManager/#performance-metrics","title":"Performance Metrics <p>BlockManager uses BlockManagerSource to report metrics under the name BlockManager.</p>","text":""},{"location":"storage/BlockManager/#getpeers","title":"getPeers <pre><code>getPeers(\n  forceFetch: Boolean): Seq[BlockManagerId]\n</code></pre> <p><code>getPeers</code>...FIXME</p> <p><code>getPeers</code> is used when <code>BlockManager</code> is requested to replicateBlock and replicate.</p>","text":""},{"location":"storage/BlockManager/#releasing-all-locks-for-task","title":"Releasing All Locks For Task <pre><code>releaseAllLocksForTask(\n  taskAttemptId: Long): Seq[BlockId]\n</code></pre> <p><code>releaseAllLocksForTask</code>...FIXME</p> <p><code>releaseAllLocksForTask</code> is used when <code>TaskRunner</code> is requested to run (at the end of a task).</p>","text":""},{"location":"storage/BlockManager/#stopping-blockmanager","title":"Stopping BlockManager <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>...FIXME</p> <p><code>stop</code> is used when <code>SparkEnv</code> is requested to stop.</p>","text":""},{"location":"storage/BlockManager/#getting-ids-of-existing-blocks-for-a-given-filter","title":"Getting IDs of Existing Blocks (For a Given Filter) <pre><code>getMatchingBlockIds(\n  filter: BlockId =&gt; Boolean): Seq[BlockId]\n</code></pre> <p><code>getMatchingBlockIds</code>...FIXME</p> <p><code>getMatchingBlockIds</code> is used when <code>BlockManagerSlaveEndpoint</code> is requested to handle a GetMatchingBlockIds message.</p>","text":""},{"location":"storage/BlockManager/#getting-local-block","title":"Getting Local Block <pre><code>getLocalValues(\n  blockId: BlockId): Option[BlockResult]\n</code></pre> <p><code>getLocalValues</code> prints out the following DEBUG message to the logs:</p> <pre><code>Getting local block [blockId]\n</code></pre> <p><code>getLocalValues</code> obtains a read lock for <code>blockId</code>.</p> <p>When no <code>blockId</code> block was found, you should see the following DEBUG message in the logs and <code>getLocalValues</code> returns \"nothing\" (i.e. <code>NONE</code>).</p> <pre><code>Block [blockId] was not found\n</code></pre> <p>When the <code>blockId</code> block was found, you should see the following DEBUG message in the logs:</p> <pre><code>Level for block [blockId] is [level]\n</code></pre> <p>If <code>blockId</code> block has memory level and is registered in <code>MemoryStore</code>, <code>getLocalValues</code> returns a BlockResult as <code>Memory</code> read method and with a <code>CompletionIterator</code> for an interator:</p> <ol> <li>Values iterator from <code>MemoryStore</code> for <code>blockId</code> for \"deserialized\" persistence levels.</li> <li>Iterator from <code>SerializerManager</code> after the data stream has been deserialized for the <code>blockId</code> block and the bytes for <code>blockId</code> block for \"serialized\" persistence levels.</li> </ol> <p><code>getLocalValues</code> is used when:</p> <ul> <li> <p><code>TorrentBroadcast</code> is requested to readBroadcastBlock</p> </li> <li> <p><code>BlockManager</code> is requested to get and getOrElseUpdate</p> </li> </ul>","text":""},{"location":"storage/BlockManager/#maybecachediskvaluesinmemory","title":"maybeCacheDiskValuesInMemory <pre><code>maybeCacheDiskValuesInMemory[T](\n  blockInfo: BlockInfo,\n  blockId: BlockId,\n  level: StorageLevel,\n  diskIterator: Iterator[T]): Iterator[T]\n</code></pre> <p><code>maybeCacheDiskValuesInMemory</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#retrieving-block-data","title":"Retrieving Block Data <pre><code>getBlockData(\n  blockId: BlockId): ManagedBuffer\n</code></pre> <p><code>getBlockData</code> is part of the BlockDataManager abstraction.</p> <p>For a BlockId.md[] of a shuffle (a ShuffleBlockId), getBlockData requests the &lt;&gt; for the shuffle:ShuffleManager.md#shuffleBlockResolver[ShuffleBlockResolver] that is then requested for shuffle:ShuffleBlockResolver.md#getBlockData[getBlockData]. <p>Otherwise, getBlockData &lt;&gt; for the given BlockId. <p>If found, getBlockData creates a new BlockManagerManagedBuffer (with the &lt;&gt;, the input BlockId, the retrieved BlockData and the dispose flag enabled). <p>If not found, getBlockData &lt;&gt; that the block could not be found (and that the master should no longer assume the block is available on this executor) and throws a BlockNotFoundException. <p>NOTE: <code>getBlockData</code> is executed for shuffle blocks or local blocks that the BlockManagerMaster knows this executor really has (unless BlockManagerMaster is outdated).</p>","text":""},{"location":"storage/BlockManager/#retrieving-non-shuffle-local-block-data","title":"Retrieving Non-Shuffle Local Block Data <pre><code>getLocalBytes(\n  blockId: BlockId): Option[BlockData]\n</code></pre> <p><code>getLocalBytes</code>...FIXME</p> <p><code>getLocalBytes</code> is used when:</p> <ul> <li><code>TorrentBroadcast</code> is requested to readBlocks</li> <li><code>BlockManager</code> is requested for the block data (of a non-shuffle block)</li> </ul>","text":""},{"location":"storage/BlockManager/#storing-block-data-locally","title":"Storing Block Data Locally <pre><code>putBlockData(\n  blockId: BlockId,\n  data: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Boolean\n</code></pre> <p><code>putBlockData</code> is part of the BlockDataManager abstraction.</p> <p><code>putBlockData</code> putBytes with Java NIO's <code>ByteBuffer</code> of the given <code>ManagedBuffer</code>.</p>","text":""},{"location":"storage/BlockManager/#storing-block-bytebuffer-locally","title":"Storing Block (ByteBuffer) Locally <pre><code>putBytes(\n  blockId: BlockId,\n  bytes: ChunkedByteBuffer,\n  level: StorageLevel,\n  tellMaster: Boolean = true): Boolean\n</code></pre> <p><code>putBytes</code> creates a ByteBufferBlockStoreUpdater that is then requested to store the bytes.</p> <p><code>putBytes</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to puts a block data locally</li> <li><code>TaskRunner</code> is requested to run (and the result size is above maxDirectResultSize)</li> <li><code>TorrentBroadcast</code> is requested to writeBlocks and readBlocks</li> </ul>","text":""},{"location":"storage/BlockManager/#doputbytes","title":"doPutBytes <pre><code>doPutBytes[T](\n  blockId: BlockId,\n  bytes: ChunkedByteBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[T],\n  tellMaster: Boolean = true,\n  keepReadLock: Boolean = false): Boolean\n</code></pre> <p><code>doPutBytes</code> calls the internal helper &lt;&gt; with a function that accepts a <code>BlockInfo</code> and does the uploading. <p>Inside the function, if the StorageLevel.md[storage <code>level</code>]'s replication is greater than 1, it immediately starts &lt;&gt; of the <code>blockId</code> block on a separate thread (from <code>futureExecutionContext</code> thread pool). The replication uses the input <code>bytes</code> and <code>level</code> storage level. <p>For a memory storage level, the function checks whether the storage <code>level</code> is deserialized or not. For a deserialized storage <code>level</code>, <code>BlockManager</code>'s serializer:SerializerManager.md#dataDeserializeStream[<code>SerializerManager</code> deserializes <code>bytes</code> into an iterator of values] that MemoryStore.md#putIteratorAsValues[<code>MemoryStore</code> stores]. If however the storage <code>level</code> is not deserialized, the function requests MemoryStore.md#putBytes[<code>MemoryStore</code> to store the bytes]</p> <p>If the put did not succeed and the storage level is to use disk, you should see the following WARN message in the logs:</p> <pre><code>Persisting block [blockId] to disk instead.\n</code></pre> <p>And DiskStore.md#putBytes[<code>DiskStore</code> stores the bytes].</p> <p>NOTE: DiskStore.md[DiskStore] is requested to store the bytes of a block with memory and disk storage level only when MemoryStore.md[MemoryStore] has failed.</p> <p>If the storage level is to use disk only, DiskStore.md#putBytes[<code>DiskStore</code> stores the bytes].</p> <p><code>doPutBytes</code> requests &lt;&gt; and if the block was successfully stored, and the driver should know about it (<code>tellMaster</code>), the function &lt;&gt;. The executor:TaskMetrics.md#incUpdatedBlockStatuses[current <code>TaskContext</code> metrics are updated with the updated block status] (only when executed inside a task where <code>TaskContext</code> is available). <p>You should see the following DEBUG message in the logs:</p> <pre><code>Put block [blockId] locally took [time] ms\n</code></pre> <p>The function waits till the earlier asynchronous replication finishes for a block with replication level greater than <code>1</code>.</p> <p>The final result of <code>doPutBytes</code> is the result of storing the block successful or not (as computed earlier).</p> <p>NOTE: <code>doPutBytes</code> is used exclusively when BlockManager is requested to &lt;&gt;.","text":""},{"location":"storage/BlockManager/#putting-new-block","title":"Putting New Block <pre><code>doPut[T](\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[_],\n  tellMaster: Boolean,\n  keepReadLock: Boolean)(putBody: BlockInfo =&gt; Option[T]): Option[T]\n</code></pre> <p><code>doPut</code> requires that the given StorageLevel is valid.</p> <p><code>doPut</code> creates a new BlockInfo and requests the BlockInfoManager for a write lock for the block.</p> <p><code>doPut</code> executes the given <code>putBody</code> function (with the <code>BlockInfo</code>).</p> <p>If the result of <code>putBody</code> function is <code>None</code>, the block is considered saved successfully.</p> <p>For successful save, <code>doPut</code> requests the BlockInfoManager to downgradeLock or unlock based on the given <code>keepReadLock</code> flag (<code>true</code> and <code>false</code>, respectively).</p> <p>For unsuccessful save (when <code>putBody</code> returned some value), <code>doPut</code> removeBlockInternal and prints out the following WARN message to the logs:</p> <pre><code>Putting block [blockId] failed\n</code></pre> <p>In the end, <code>doPut</code> prints out the following DEBUG message to the logs:</p> <pre><code>Putting block [blockId] [withOrWithout] replication took [usedTime] ms\n</code></pre> <p><code>doPut</code> is used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save</li> <li><code>BlockManager</code> is requested to doPutIterator</li> </ul>","text":""},{"location":"storage/BlockManager/#removing-block","title":"Removing Block <pre><code>removeBlock(\n  blockId: BlockId,\n  tellMaster: Boolean = true): Unit\n</code></pre> <p><code>removeBlock</code> prints out the following DEBUG message to the logs:</p> <pre><code>Removing block [blockId]\n</code></pre> <p><code>removeBlock</code> requests the BlockInfoManager for write lock on the block.</p> <p>With a write lock on the block, <code>removeBlock</code> removeBlockInternal (with the <code>tellMaster</code> flag turned on when the input <code>tellMaster</code> flag and the tellMaster of the block itself are both turned on).</p> <p>In the end, <code>removeBlock</code> addUpdatedBlockStatusToTaskMetrics (with an empty <code>BlockStatus</code>).</p>  <p>In case the block is no longer available (<code>None</code>), <code>removeBlock</code> prints out the following WARN message to the logs:</p> <pre><code>Asked to remove block [blockId], which does not exist\n</code></pre>  <p><code>removeBlock</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to handleLocalReadFailure, removeRdd, removeBroadcast</li> <li><code>BlockManagerDecommissioner</code> is requested to migrate a block</li> <li><code>BlockManagerStorageEndpoint</code> is requested to handle a RemoveBlock message</li> </ul>","text":""},{"location":"storage/BlockManager/#removing-rdd-blocks","title":"Removing RDD Blocks <pre><code>removeRdd(\n  rddId: Int): Int\n</code></pre> <p><code>removeRdd</code> removes all the blocks that belong to the <code>rddId</code> RDD.</p> <p>It prints out the following INFO message to the logs:</p> <pre><code>Removing RDD [rddId]\n</code></pre> <p>It then requests RDD blocks from BlockInfoManager.md[] and &lt;&gt; (without informing the driver). <p>The number of blocks removed is the final result.</p> <p>NOTE: It is used by BlockManagerSlaveEndpoint.md#RemoveRdd[<code>BlockManagerSlaveEndpoint</code> while handling <code>RemoveRdd</code> messages].</p>","text":""},{"location":"storage/BlockManager/#removing-all-blocks-of-broadcast-variable","title":"Removing All Blocks of Broadcast Variable <pre><code>removeBroadcast(broadcastId: Long, tellMaster: Boolean): Int\n</code></pre> <p><code>removeBroadcast</code> removes all the blocks of the input <code>broadcastId</code> broadcast.</p> <p>Internally, it starts by printing out the following DEBUG message to the logs:</p> <pre><code>Removing broadcast [broadcastId]\n</code></pre> <p>It then requests all the BlockId.md#BroadcastBlockId[BroadcastBlockId] objects that belong to the <code>broadcastId</code> broadcast from BlockInfoManager.md[] and &lt;&gt;. <p>The number of blocks removed is the final result.</p> <p>NOTE: It is used by BlockManagerSlaveEndpoint.md#RemoveBroadcast[<code>BlockManagerSlaveEndpoint</code> while handling <code>RemoveBroadcast</code> messages].</p>","text":""},{"location":"storage/BlockManager/#external-shuffle-servers-address","title":"External Shuffle Server's Address <pre><code>shuffleServerId: BlockManagerId\n</code></pre> <p>When requested to initialize, <code>BlockManager</code> records the location (BlockManagerId) of External Shuffle Service if enabled or simply uses the non-external-shuffle-service BlockManagerId.</p> <p>The <code>BlockManagerId</code> is used to register an executor with a local external shuffle service.</p> <p>The <code>BlockManagerId</code> is used as the location of a shuffle map output when:</p> <ul> <li><code>BypassMergeSortShuffleWriter</code> is requested to write partition records to a shuffle file</li> <li><code>UnsafeShuffleWriter</code> is requested to close and write output</li> <li><code>SortShuffleWriter</code> is requested to write output</li> </ul>","text":""},{"location":"storage/BlockManager/#getstatus","title":"getStatus <pre><code>getStatus(\n  blockId: BlockId): Option[BlockStatus]\n</code></pre> <p><code>getStatus</code>...FIXME</p> <p><code>getStatus</code> is used when <code>BlockManagerSlaveEndpoint</code> is requested to handle GetBlockStatus message.</p>","text":""},{"location":"storage/BlockManager/#re-registering-blockmanager-with-driver","title":"Re-registering BlockManager with Driver <pre><code>reregister(): Unit\n</code></pre> <p><code>reregister</code> prints out the following INFO message to the logs:</p> <pre><code>BlockManager [blockManagerId] re-registering with master\n</code></pre> <p><code>reregister</code> requests the BlockManagerMaster to register this BlockManager.</p> <p>In the end, <code>reregister</code> reportAllBlocks.</p> <p><code>reregister</code> is used when:</p> <ul> <li><code>Executor</code> is requested to reportHeartBeat (and informed to re-register)</li> <li><code>BlockManager</code> is requested to asyncReregister</li> </ul>","text":""},{"location":"storage/BlockManager/#reporting-all-blocks","title":"Reporting All Blocks <pre><code>reportAllBlocks(): Unit\n</code></pre> <p><code>reportAllBlocks</code> prints out the following INFO message to the logs:</p> <pre><code>Reporting [n] blocks to the master.\n</code></pre> <p>For all the blocks in the BlockInfoManager, <code>reportAllBlocks</code> getCurrentBlockStatus and tryToReportBlockStatus (for blocks tracked by the master).</p> <p><code>reportAllBlocks</code> prints out the following ERROR message to the logs and exits when block status reporting fails for any block:</p> <pre><code>Failed to report [blockId] to master; giving up.\n</code></pre>","text":""},{"location":"storage/BlockManager/#calculate-current-block-status","title":"Calculate Current Block Status <pre><code>getCurrentBlockStatus(\n  blockId: BlockId,\n  info: BlockInfo): BlockStatus\n</code></pre> <p><code>getCurrentBlockStatus</code> gives the current <code>BlockStatus</code> of the <code>BlockId</code> block (with the block's current StorageLevel.md[StorageLevel], memory and disk sizes). It uses MemoryStore.md[MemoryStore] and DiskStore.md[DiskStore] for size and other information.</p> <p>NOTE: Most of the information to build <code>BlockStatus</code> is already in <code>BlockInfo</code> except that it may not necessarily reflect the current state per MemoryStore.md[MemoryStore] and DiskStore.md[DiskStore].</p> <p>Internally, it uses the input BlockInfo.md[] to know about the block's storage level. If the storage level is not set (i.e. <code>null</code>), the returned <code>BlockStatus</code> assumes the StorageLevel.md[default <code>NONE</code> storage level] and the memory and disk sizes being <code>0</code>.</p> <p>If however the storage level is set, <code>getCurrentBlockStatus</code> uses MemoryStore.md[MemoryStore] and DiskStore.md[DiskStore] to check whether the block is stored in the storages or not and request for their sizes in the storages respectively (using their <code>getSize</code> or assume <code>0</code>).</p> <p>NOTE: It is acceptable that the <code>BlockInfo</code> says to use memory or disk yet the block is not in the storages (yet or anymore). The method will give current status.</p> <p><code>getCurrentBlockStatus</code> is used when &lt;&gt;, &lt;&gt; or &lt;&gt; or &lt;&gt;.","text":""},{"location":"storage/BlockManager/#reporting-current-storage-status-of-block-to-driver","title":"Reporting Current Storage Status of Block to Driver <pre><code>reportBlockStatus(\n  blockId: BlockId,\n  status: BlockStatus,\n  droppedMemorySize: Long = 0L): Unit\n</code></pre> <p><code>reportBlockStatus</code> tryToReportBlockStatus.</p> <p>If told to re-register, <code>reportBlockStatus</code> prints out the following INFO message to the logs followed by asynchronous re-registration:</p> <pre><code>Got told to re-register updating block [blockId]\n</code></pre> <p>In the end, <code>reportBlockStatus</code> prints out the following DEBUG message to the logs:</p> <pre><code>Told master about block [blockId]\n</code></pre> <p><code>reportBlockStatus</code> is used when:</p> <ul> <li><code>IndexShuffleBlockResolver</code> is requested to </li> <li><code>BlockStoreUpdater</code> is requested to save</li> <li><code>BlockManager</code> is requested to getLocalBlockData, doPutIterator, dropFromMemory, removeBlockInternal</li> </ul>","text":""},{"location":"storage/BlockManager/#reporting-block-status-update-to-driver","title":"Reporting Block Status Update to Driver <pre><code>tryToReportBlockStatus(\n  blockId: BlockId,\n  status: BlockStatus,\n  droppedMemorySize: Long = 0L): Boolean\n</code></pre> <p><code>tryToReportBlockStatus</code> reports block status update to the BlockManagerMaster and returns its response.</p> <p><code>tryToReportBlockStatus</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to reportAllBlocks, reportBlockStatus</li> </ul>","text":""},{"location":"storage/BlockManager/#execution-context","title":"Execution Context <p>block-manager-future is the execution context for...FIXME</p>","text":""},{"location":"storage/BlockManager/#bytebuffer","title":"ByteBuffer <p>The underlying abstraction for blocks in Spark is a <code>ByteBuffer</code> that limits the size of a block to 2GB (<code>Integer.MAX_VALUE</code> - see Why does FileChannel.map take up to Integer.MAX_VALUE of data? and SPARK-1476 2GB limit in spark for blocks). This has implication not just for managed blocks in use, but also for shuffle blocks (memory mapped blocks are limited to 2GB, even though the API allows for <code>long</code>), ser-deser via byte array-backed output streams.</p>","text":""},{"location":"storage/BlockManager/#blockresult","title":"BlockResult <p><code>BlockResult</code> is a metadata of a fetched block:</p> <ul> <li> Data (<code>Iterator[Any]</code>) <li> DataReadMethod <li> Size (bytes)  <p><code>BlockResult</code> is created and returned when <code>BlockManager</code> is requested for the following:</p> <ul> <li>getOrElseUpdate</li> <li>get</li> <li>getLocalValues</li> <li>getRemoteValues</li> </ul>","text":""},{"location":"storage/BlockManager/#datareadmethod","title":"DataReadMethod <p><code>DataReadMethod</code> describes how block data was read.</p>    DataReadMethod Source     <code>Disk</code> DiskStore (while getLocalValues)   <code>Hadoop</code> seems unused   <code>Memory</code> MemoryStore (while getLocalValues)   <code>Network</code> Remote BlockManagers (aka network)","text":""},{"location":"storage/BlockManager/#registering-task","title":"Registering Task <pre><code>registerTask(\n  taskAttemptId: Long): Unit\n</code></pre> <p><code>registerTask</code> requests the BlockInfoManager to register a given task.</p> <p><code>registerTask</code> is used when <code>Task</code> is requested to run (at the start of a task).</p>","text":""},{"location":"storage/BlockManager/#creating-diskblockobjectwriter","title":"Creating DiskBlockObjectWriter <pre><code>getDiskWriter(\n  blockId: BlockId,\n  file: File,\n  serializerInstance: SerializerInstance,\n  bufferSize: Int,\n  writeMetrics: ShuffleWriteMetrics): DiskBlockObjectWriter\n</code></pre> <p>getDiskWriter creates a DiskBlockObjectWriter (with spark.shuffle.sync configuration property for <code>syncWrites</code> argument).</p> <p><code>getDiskWriter</code> uses the SerializerManager.</p> <p><code>getDiskWriter</code> is used when:</p> <ul> <li> <p><code>BypassMergeSortShuffleWriter</code> is requested to write records (of a partition)</p> </li> <li> <p><code>ShuffleExternalSorter</code> is requested to writeSortedFile</p> </li> <li> <p><code>ExternalAppendOnlyMap</code> is requested to spillMemoryIteratorToDisk</p> </li> <li> <p><code>ExternalSorter</code> is requested to spillMemoryIteratorToDisk and writePartitionedFile</p> </li> <li> <p>UnsafeSorterSpillWriter is created</p> </li> </ul>","text":""},{"location":"storage/BlockManager/#recording-updated-blockstatus-in-taskmetrics-of-current-task","title":"Recording Updated BlockStatus in TaskMetrics (of Current Task) <pre><code>addUpdatedBlockStatusToTaskMetrics(\n  blockId: BlockId,\n  status: BlockStatus): Unit\n</code></pre> <p><code>addUpdatedBlockStatusToTaskMetrics</code> takes an active <code>TaskContext</code> (if available) and records updated <code>BlockStatus</code> for <code>Block</code> (in the task's <code>TaskMetrics</code>).</p> <p><code>addUpdatedBlockStatusToTaskMetrics</code> is used when BlockManager doPutBytes (for a block that was successfully stored), doPut, doPutIterator, removes blocks from memory (possibly spilling it to disk) and removes block from memory and disk.</p>","text":""},{"location":"storage/BlockManager/#shuffle-metrics-source","title":"Shuffle Metrics Source <pre><code>shuffleMetricsSource: Source\n</code></pre> <p><code>shuffleMetricsSource</code> creates a ShuffleMetricsSource with the shuffleMetrics (of the  BlockStoreClient) and the source name as follows:</p> <ul> <li>ExternalShuffle when ExternalBlockStoreClient is specified</li> <li>NettyBlockTransfer otherwise</li> </ul> <p><code>shuffleMetricsSource</code> is available using SparkEnv:</p> <pre><code>env.blockManager.shuffleMetricsSource\n</code></pre> <p><code>shuffleMetricsSource</code> is used when:</p> <ul> <li>Executor is created (for non-local / cluster modes)</li> </ul>","text":""},{"location":"storage/BlockManager/#replicating-block-to-peers","title":"Replicating Block To Peers <pre><code>replicate(\n  blockId: BlockId,\n  data: BlockData,\n  level: StorageLevel,\n  classTag: ClassTag[_],\n  existingReplicas: Set[BlockManagerId] = Set.empty): Unit\n</code></pre> <p><code>replicate</code>...FIXME</p> <p><code>replicate</code> is used when <code>BlockManager</code> is requested to doPutBytes, doPutIterator and replicateBlock.</p>","text":""},{"location":"storage/BlockManager/#replicateblock","title":"replicateBlock <pre><code>replicateBlock(\n  blockId: BlockId,\n  existingReplicas: Set[BlockManagerId],\n  maxReplicas: Int): Unit\n</code></pre> <p><code>replicateBlock</code>...FIXME</p> <p><code>replicateBlock</code> is used when <code>BlockManagerSlaveEndpoint</code> is requested to handle a ReplicateBlock message.</p>","text":""},{"location":"storage/BlockManager/#putiterator","title":"putIterator <pre><code>putIterator[T: ClassTag](\n  blockId: BlockId,\n  values: Iterator[T],\n  level: StorageLevel,\n  tellMaster: Boolean = true): Boolean\n</code></pre> <p><code>putIterator</code>...FIXME</p> <p><code>putIterator</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to putSingle</li> </ul>","text":""},{"location":"storage/BlockManager/#putsingle","title":"putSingle <pre><code>putSingle[T: ClassTag](\n  blockId: BlockId,\n  value: T,\n  level: StorageLevel,\n  tellMaster: Boolean = true): Boolean\n</code></pre> <p><code>putSingle</code>...FIXME</p> <p><code>putSingle</code> is used when <code>TorrentBroadcast</code> is requested to write the blocks and readBroadcastBlock.</p>","text":""},{"location":"storage/BlockManager/#doputiterator","title":"doPutIterator <pre><code>doPutIterator[T](\n  blockId: BlockId,\n  iterator: () =&gt; Iterator[T],\n  level: StorageLevel,\n  classTag: ClassTag[T],\n  tellMaster: Boolean = true,\n  keepReadLock: Boolean = false): Option[PartiallyUnrolledIterator[T]]\n</code></pre> <p><code>doPutIterator</code> doPut with the putBody function.</p> <p><code>doPutIterator</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to getOrElseUpdate and putIterator</li> </ul>","text":""},{"location":"storage/BlockManager/#putbody","title":"putBody <pre><code>putBody: BlockInfo =&gt; Option[T]\n</code></pre> <p>For the given StorageLevel that indicates to use memory for storage, <code>putBody</code> requests the MemoryStore to putIteratorAsValues or putIteratorAsBytes based on the <code>StorageLevel</code> (that indicates to use deserialized format or not, respectively).</p> <p>In case storing the block in memory was not possible (due to lack of available memory), <code>putBody</code> prints out the following WARN message to the logs and falls back on the DiskStore to store the block.</p> <pre><code>Persisting block [blockId] to disk instead.\n</code></pre> <p>For the given StorageLevel that indicates to use disk storage only (useMemory flag is disabled), <code>putBody</code> requests the DiskStore to store the block.</p> <p><code>putBody</code> gets the current block status and checks whether the <code>StorageLevel</code> is valid (that indicates that the block was stored successfully).</p> <p>If the block was stored successfully, <code>putBody</code> reports the block status (only if indicated by the the given <code>tellMaster</code> flag and the tellMaster flag of the associated BlockInfo) and addUpdatedBlockStatusToTaskMetrics.</p> <p><code>putBody</code> prints out the following DEBUG message to the logs:</p> <pre><code>Put block [blockId] locally took [duration] ms\n</code></pre> <p>For the given StorageLevel with replication enabled (above <code>1</code>), <code>putBody</code> doGetLocalBytes and replicates the block (to other BlockManagers). <code>putBody</code> prints out the following DEBUG message to the logs:</p> <pre><code>Put block [blockId] remotely took [duration] ms\n</code></pre>","text":""},{"location":"storage/BlockManager/#dogetlocalbytes","title":"doGetLocalBytes <pre><code>doGetLocalBytes(\n  blockId: BlockId,\n  info: BlockInfo): BlockData\n</code></pre> <p><code>doGetLocalBytes</code>...FIXME</p> <p><code>doGetLocalBytes</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to getLocalBytes, doPutIterator and replicateBlock</li> </ul>","text":""},{"location":"storage/BlockManager/#dropping-block-from-memory","title":"Dropping Block from Memory <pre><code>dropFromMemory(\n  blockId: BlockId,\n  data: () =&gt; Either[Array[T], ChunkedByteBuffer]): StorageLevel\n</code></pre> <p><code>dropFromMemory</code> prints out the following INFO message to the logs:</p> <pre><code>Dropping block [blockId] from memory\n</code></pre> <p><code>dropFromMemory</code> requests the BlockInfoManager to assert that the block is locked for writing (that gives a BlockInfo or throws a <code>SparkException</code>).</p>  <p><code>dropFromMemory</code> drops to disk if the current storage level requires so (based on the given <code>BlockInfo</code>) and the block is not in the DiskStore already. <code>dropFromMemory</code> prints out the following INFO message to the logs:</p> <pre><code>Writing block [blockId] to disk\n</code></pre> <p><code>dropFromMemory</code> uses the given <code>data</code> to determine whether the DiskStore is requested to put or putBytes (<code>Array[T]</code> or <code>ChunkedByteBuffer</code>, respectively).</p>  <p><code>dropFromMemory</code> requests the MemoryStore to remove the block. <code>dropFromMemory</code> prints out the following WARN message to the logs if the block was not found in the MemoryStore:</p> <pre><code>Block [blockId] could not be dropped from memory as it does not exist\n</code></pre> <p><code>dropFromMemory</code> gets the current block status and reportBlockStatus when requested (when the tellMaster flag of the <code>BlockInfo</code> is turned on).</p> <p><code>dropFromMemory</code> addUpdatedBlockStatusToTaskMetrics when the block has been updated (dropped to disk or removed from the <code>MemoryStore</code>).</p> <p>In the end, <code>dropFromMemory</code> returns the current StorageLevel of the block (off the <code>BlockStatus</code>).</p>  <p><code>dropFromMemory</code> is part of the BlockEvictionHandler abstraction.</p>","text":""},{"location":"storage/BlockManager/#releaselock-method","title":"releaseLock Method <pre><code>releaseLock(\n  blockId: BlockId,\n  taskAttemptId: Option[Long] = None): Unit\n</code></pre> <p>releaseLock requests the BlockInfoManager to unlock the given block.</p> <p>releaseLock is part of the BlockDataManager abstraction.</p>","text":""},{"location":"storage/BlockManager/#putblockdataasstream","title":"putBlockDataAsStream <pre><code>putBlockDataAsStream(\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[_]): StreamCallbackWithID\n</code></pre> <p><code>putBlockDataAsStream</code> is part of the BlockDataManager abstraction.</p> <p><code>putBlockDataAsStream</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#maximum-memory","title":"Maximum Memory <p>Total maximum value that <code>BlockManager</code> can ever possibly use (that depends on MemoryManager and may vary over time).</p> <p>Total available on-heap and off-heap memory for storage (in bytes)</p>","text":""},{"location":"storage/BlockManager/#maximum-off-heap-memory","title":"Maximum Off-Heap Memory","text":""},{"location":"storage/BlockManager/#maximum-on-heap-memory","title":"Maximum On-Heap Memory","text":""},{"location":"storage/BlockManager/#decommissionself","title":"decommissionSelf <pre><code>decommissionSelf(): Unit\n</code></pre> <p><code>decommissionSelf</code>...FIXME</p> <p><code>decommissionSelf</code> is used when:</p> <ul> <li><code>BlockManagerStorageEndpoint</code> is requested to handle a DecommissionBlockManager message</li> </ul>","text":""},{"location":"storage/BlockManager/#decommissionblockmanager","title":"decommissionBlockManager <pre><code>decommissionBlockManager(): Unit\n</code></pre> <p><code>decommissionBlockManager</code> sends a <code>DecommissionBlockManager</code> message to the BlockManagerStorageEndpoint.</p> <p><code>decommissionBlockManager</code> is used when:</p> <ul> <li><code>CoarseGrainedExecutorBackend</code> is requested to decommissionSelf</li> </ul>","text":""},{"location":"storage/BlockManager/#blockmanagerstorageendpoint","title":"BlockManagerStorageEndpoint <pre><code>storageEndpoint: RpcEndpointRef\n</code></pre> <p><code>BlockManager</code> sets up a RpcEndpointRef (within the RpcEnv) under the name <code>BlockManagerEndpoint[ID]</code> with a BlockManagerStorageEndpoint message handler.</p>","text":""},{"location":"storage/BlockManager/#blockmanagerdecommissioner","title":"BlockManagerDecommissioner <pre><code>decommissioner: Option[BlockManagerDecommissioner]\n</code></pre> <p><code>BlockManager</code> defines <code>decommissioner</code> internal registry for a BlockManagerDecommissioner.</p> <p><code>decommissioner</code> is undefined (<code>None</code>) by default.</p> <p><code>BlockManager</code> creates and starts a <code>BlockManagerDecommissioner</code> when requested to decommissionSelf.</p> <p><code>decommissioner</code> is used for isDecommissioning and lastMigrationInfo.</p> <p><code>BlockManager</code> requests the <code>BlockManagerDecommissioner</code> to stop when stopped.</p>","text":""},{"location":"storage/BlockManager/#removing-block-from-memory-and-disk","title":"Removing Block from Memory and Disk <pre><code>removeBlockInternal(\n  blockId: BlockId,\n  tellMaster: Boolean): Unit\n</code></pre> <p>For <code>tellMaster</code> turned on, <code>removeBlockInternal</code> requests the BlockInfoManager to assert that the block is locked for writing and remembers the current block status. Otherwise, <code>removeBlockInternal</code> leaves the block status undetermined.</p> <p><code>removeBlockInternal</code> requests the MemoryStore to remove the block.</p> <p><code>removeBlockInternal</code> requests the DiskStore to remove the block.</p> <p><code>removeBlockInternal</code> requests the BlockInfoManager to remove the block metadata.</p> <p>In the end, <code>removeBlockInternal</code> reports the block status (to the master) with the storage level changed to <code>NONE</code>.</p>  <p><code>removeBlockInternal</code> prints out the following WARN message when the block was not stored in the MemoryStore and the DiskStore:</p> <pre><code>Block [blockId] could not be removed as it was not found on disk or in memory\n</code></pre>  <p><code>removeBlockInternal</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to put a new block and remove a block</li> </ul>","text":""},{"location":"storage/BlockManager/#maybecachediskbytesinmemory","title":"maybeCacheDiskBytesInMemory <pre><code>maybeCacheDiskBytesInMemory(\n  blockInfo: BlockInfo,\n  blockId: BlockId,\n  level: StorageLevel,\n  diskData: BlockData): Option[ChunkedByteBuffer]\n</code></pre> <p><code>maybeCacheDiskBytesInMemory</code>...FIXME</p> <p><code>maybeCacheDiskBytesInMemory</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to getLocalValues and doGetLocalBytes</li> </ul>","text":""},{"location":"storage/BlockManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.BlockManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.BlockManager=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/BlockManagerDecommissioner/","title":"BlockManagerDecommissioner","text":"<p><code>BlockManagerDecommissioner</code> is a decommissioning process used by BlockManager.</p>"},{"location":"storage/BlockManagerDecommissioner/#creating-instance","title":"Creating Instance","text":"<p><code>BlockManagerDecommissioner</code> takes the following to be created:</p> <ul> <li> SparkConf <li> BlockManager <p><code>BlockManagerDecommissioner</code> is created\u00a0when:</p> <ul> <li><code>BlockManager</code> is requested to decommissionSelf</li> </ul>"},{"location":"storage/BlockManagerId/","title":"BlockManagerId","text":"<p>BlockManagerId is a unique identifier (address) of a BlockManager.</p>"},{"location":"storage/BlockManagerInfo/","title":"BlockManagerInfo","text":"<p>= BlockManagerInfo</p> <p>BlockManagerInfo is...FIXME</p>"},{"location":"storage/BlockManagerMaster/","title":"BlockManagerMaster","text":"<p><code>BlockManagerMaster</code> runs on the driver and executors to exchange block metadata (status and locations) in a Spark application.</p> <p><code>BlockManagerMaster</code> uses BlockManagerMasterEndpoint (registered as BlockManagerMaster RPC endpoint on the driver with the endpoint references on executors) for executors to send block status updates and so let the driver keep track of block status and locations.</p>"},{"location":"storage/BlockManagerMaster/#creating-instance","title":"Creating Instance","text":"<p><code>BlockManagerMaster</code> takes the following to be created:</p> <ul> <li>Driver Endpoint</li> <li>Heartbeat Endpoint</li> <li> SparkConf <li> <code>isDriver</code> flag (whether it is created for the driver or executors) <p><code>BlockManagerMaster</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv (and create a BlockManager)</li> </ul>"},{"location":"storage/BlockManagerMaster/#driver-endpoint","title":"Driver Endpoint <p><code>BlockManagerMaster</code> is given a RpcEndpointRef of the BlockManagerMaster RPC Endpoint (on the driver) when created.</p>","text":""},{"location":"storage/BlockManagerMaster/#heartbeat-endpoint","title":"Heartbeat Endpoint <p><code>BlockManagerMaster</code> is given a RpcEndpointRef of the BlockManagerMasterHeartbeat RPC Endpoint (on the driver) when created.</p> <p>The endpoint is used (mainly) when:</p> <ul> <li><code>DAGScheduler</code> is requested to executorHeartbeatReceived</li> </ul>","text":""},{"location":"storage/BlockManagerMaster/#registering-blockmanager-on-executor-with-driver","title":"Registering BlockManager (on Executor) with Driver <pre><code>registerBlockManager(\n  id: BlockManagerId,\n  localDirs: Array[String],\n  maxOnHeapMemSize: Long,\n  maxOffHeapMemSize: Long,\n  storageEndpoint: RpcEndpointRef): BlockManagerId\n</code></pre> <p><code>registerBlockManager</code> prints out the following INFO message to the logs (with the given BlockManagerId):</p> <pre><code>Registering BlockManager [id]\n</code></pre> <p></p> <p><code>registerBlockManager</code> notifies the driver (using the BlockManagerMaster RPC endpoint) that the BlockManagerId wants to register (and sends a blocking RegisterBlockManager message).</p>  <p>Note</p> <p>The input <code>maxMemSize</code> is the total available on-heap and off-heap memory for storage on the <code>BlockManager</code>.</p>  <p><code>registerBlockManager</code> waits until a confirmation comes (as a possibly-updated BlockManagerId).</p> <p>In the end, <code>registerBlockManager</code> prints out the following INFO message to the logs and returns the BlockManagerId received.</p> <pre><code>Registered BlockManager [updatedId]\n</code></pre> <p><code>registerBlockManager</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to initialize and reregister</li> <li><code>FallbackStorage</code> utility is used to registerBlockManagerIfNeeded</li> </ul>","text":""},{"location":"storage/BlockManagerMaster/#finding-block-locations-for-single-block","title":"Finding Block Locations for Single Block <pre><code>getLocations(\n  blockId: BlockId): Seq[BlockManagerId]\n</code></pre> <p><code>getLocations</code> requests the driver (using the BlockManagerMaster RPC endpoint) for BlockManagerIds of the given BlockId (and sends a blocking GetLocations message).</p> <p><code>getLocations</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to fetchRemoteManagedBuffer</li> <li><code>BlockManagerMaster</code> is requested to contains a BlockId</li> </ul>","text":""},{"location":"storage/BlockManagerMaster/#finding-block-locations-for-multiple-blocks","title":"Finding Block Locations for Multiple Blocks <pre><code>getLocations(\n  blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]]\n</code></pre> <p><code>getLocations</code> requests the driver (using the BlockManagerMaster RPC endpoint) for BlockManagerIds of the given BlockIds (and sends a blocking GetLocationsMultipleBlockIds message).</p> <p><code>getLocations</code>\u00a0is used when:</p> <ul> <li><code>DAGScheduler</code> is requested for BlockManagers (executors) for cached RDD partitions</li> <li><code>BlockManager</code> is requested to getLocationBlockIds</li> <li><code>BlockManager</code> utility is used to blockIdsToLocations</li> </ul>","text":""},{"location":"storage/BlockManagerMaster/#contains","title":"contains <pre><code>contains(\n  blockId: BlockId): Boolean\n</code></pre> <p><code>contains</code> is positive (<code>true</code>) when there is at least one executor with the given BlockId.</p> <p><code>contains</code>\u00a0is used when:</p> <ul> <li><code>LocalRDDCheckpointData</code> is requested to doCheckpoint</li> </ul>","text":""},{"location":"storage/BlockManagerMaster/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.BlockManagerMaster</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.BlockManagerMaster=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/BlockManagerMasterEndpoint/","title":"BlockManagerMasterEndpoint","text":"<p>BlockManagerMasterEndpoint is a rpc:RpcEndpoint.md#ThreadSafeRpcEndpoint[ThreadSafeRpcEndpoint] for storage:BlockManagerMaster.md[BlockManagerMaster].</p> <p>BlockManagerMasterEndpoint is registered under BlockManagerMaster name.</p> <p>BlockManagerMasterEndpoint tracks status of the storage:BlockManager.md[BlockManagers] (on the executors) in a Spark application.</p> <p>== [[creating-instance]] Creating Instance</p> <p>BlockManagerMasterEndpoint takes the following to be created:</p> <ul> <li>[[rpcEnv]] rpc:RpcEnv.md[]</li> <li>[[isLocal]] Flag whether BlockManagerMasterEndpoint works in local or cluster mode</li> <li>[[conf]] SparkConf.md[]</li> <li>[[listenerBus]] scheduler:LiveListenerBus.md[]</li> </ul> <p>BlockManagerMasterEndpoint is created for the core:SparkEnv.md#create[SparkEnv] on the driver (to create a storage:BlockManagerMaster.md[] for a storage:BlockManager.md#master[BlockManager]).</p> <p>When created, BlockManagerMasterEndpoint prints out the following INFO message to the logs:</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blockmanagermasterendpoint-up","title":"BlockManagerMasterEndpoint up","text":"<p>== [[messages]][[receiveAndReply]] Messages</p> <p>As an rpc:RpcEndpoint.md[], BlockManagerMasterEndpoint handles RPC messages.</p> <p>=== [[BlockManagerHeartbeat]] BlockManagerHeartbeat</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala","title":"[source, scala]","text":"<p>BlockManagerHeartbeat(   blockManagerId: BlockManagerId)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[GetLocations]] GetLocations</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_1","title":"[source, scala]","text":"<p>GetLocations(   blockId: BlockId)</p> <p>When received, BlockManagerMasterEndpoint replies with the &lt;&gt; of <code>blockId</code>. <p>Posted when BlockManagerMaster.md#getLocations-block[<code>BlockManagerMaster</code> requests the block locations of a single block].</p> <p>=== [[GetLocationsAndStatus]] GetLocationsAndStatus</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_2","title":"[source, scala]","text":"<p>GetLocationsAndStatus(   blockId: BlockId)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[GetLocationsMultipleBlockIds]] GetLocationsMultipleBlockIds</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_3","title":"[source, scala]","text":"<p>GetLocationsMultipleBlockIds(   blockIds: Array[BlockId])</p> <p>When received, BlockManagerMasterEndpoint replies with the &lt;&gt; for the given storage:BlockId.md[]. <p>Posted when BlockManagerMaster.md#getLocations[<code>BlockManagerMaster</code> requests the block locations for multiple blocks].</p> <p>=== [[GetPeers]] GetPeers</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_4","title":"[source, scala]","text":"<p>GetPeers(   blockManagerId: BlockManagerId)</p> <p>When received, BlockManagerMasterEndpoint replies with the &lt;&gt; of <code>blockManagerId</code>. <p>Peers of a storage:BlockManager.md[BlockManager] are the other BlockManagers in a cluster (except the driver's BlockManager). Peers are used to know the available executors in a Spark application.</p> <p>Posted when BlockManagerMaster.md#getPeers[<code>BlockManagerMaster</code> requests the peers of a <code>BlockManager</code>].</p> <p>=== [[GetExecutorEndpointRef]] GetExecutorEndpointRef</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_5","title":"[source, scala]","text":"<p>GetExecutorEndpointRef(   executorId: String)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[GetMemoryStatus]] GetMemoryStatus</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_6","title":"[source, scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#getmemorystatus","title":"GetMemoryStatus","text":"<p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[GetStorageStatus]] GetStorageStatus</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_7","title":"[source, scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#getstoragestatus","title":"GetStorageStatus","text":"<p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[GetBlockStatus]] GetBlockStatus</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_8","title":"[source, scala]","text":"<p>GetBlockStatus(   blockId: BlockId,   askSlaves: Boolean = true)</p> <p>When received, BlockManagerMasterEndpoint is requested to &lt;&gt;. <p>Posted when...FIXME</p> <p>=== [[GetMatchingBlockIds]] GetMatchingBlockIds</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_9","title":"[source, scala]","text":"<p>GetMatchingBlockIds(   filter: BlockId =&gt; Boolean,   askSlaves: Boolean = true)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[HasCachedBlocks]] HasCachedBlocks</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_10","title":"[source, scala]","text":"<p>HasCachedBlocks(   executorId: String)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[RegisterBlockManager]] RegisterBlockManager</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala","title":"[source,scala]","text":"<p>RegisterBlockManager(   blockManagerId: BlockManagerId,   maxOnHeapMemSize: Long,   maxOffHeapMemSize: Long,   sender: RpcEndpointRef)</p> <p>When received, BlockManagerMasterEndpoint is requested to &lt;&gt; (by the given storage:BlockManagerId.md[]). <p>Posted when BlockManagerMaster is requested to storage:BlockManagerMaster.md#registerBlockManager[register a BlockManager]</p> <p>=== [[RemoveRdd]] RemoveRdd</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_11","title":"[source, scala]","text":"<p>RemoveRdd(   rddId: Int)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[RemoveShuffle]] RemoveShuffle</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_12","title":"[source, scala]","text":"<p>RemoveShuffle(   shuffleId: Int)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[RemoveBroadcast]] RemoveBroadcast</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_13","title":"[source, scala]","text":"<p>RemoveBroadcast(   broadcastId: Long,   removeFromDriver: Boolean = true)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[RemoveBlock]] RemoveBlock</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_14","title":"[source, scala]","text":"<p>RemoveBlock(   blockId: BlockId)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[RemoveExecutor]] RemoveExecutor</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_15","title":"[source, scala]","text":"<p>RemoveExecutor(   execId: String)</p> <p>When received, BlockManagerMasterEndpoint &lt;execId is removed&gt;&gt; and the response <code>true</code> sent back. <p>Posted when BlockManagerMaster.md#removeExecutor[<code>BlockManagerMaster</code> removes an executor].</p> <p>=== [[StopBlockManagerMaster]] StopBlockManagerMaster</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_16","title":"[source, scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#stopblockmanagermaster","title":"StopBlockManagerMaster","text":"<p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[UpdateBlockInfo]] UpdateBlockInfo</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_17","title":"[source, scala]","text":"<p>UpdateBlockInfo(   blockManagerId: BlockManagerId,   blockId: BlockId,   storageLevel: StorageLevel,   memSize: Long,   diskSize: Long)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when BlockManagerMaster is requested to storage:BlockManagerMaster.md#updateBlockInfo[handle a block status update (from BlockManager on an executor)].</p> <p>== [[storageStatus]] storageStatus Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_1","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#storagestatus-arraystoragestatus","title":"storageStatus: Array[StorageStatus]","text":"<p>storageStatus...FIXME</p> <p>storageStatus is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; message. <p>== [[getLocationsMultipleBlockIds]] getLocationsMultipleBlockIds Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_2","title":"[source,scala]","text":"<p>getLocationsMultipleBlockIds(   blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]]</p> <p>getLocationsMultipleBlockIds...FIXME</p> <p>getLocationsMultipleBlockIds is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; message. <p>== [[removeShuffle]] removeShuffle Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_3","title":"[source,scala]","text":"<p>removeShuffle(   shuffleId: Int): Future[Seq[Boolean]]</p> <p>removeShuffle...FIXME</p> <p>removeShuffle is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; message. <p>== [[getPeers]] getPeers Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_18","title":"[source, scala]","text":"<p>getPeers(   blockManagerId: BlockManagerId): Seq[BlockManagerId]</p> <p>getPeers finds all the registered <code>BlockManagers</code> (using &lt;&gt; internal registry) and checks if the input <code>blockManagerId</code> is amongst them. <p>If the input <code>blockManagerId</code> is registered, getPeers returns all the registered <code>BlockManagers</code> but the one on the driver and <code>blockManagerId</code>.</p> <p>Otherwise, getPeers returns no <code>BlockManagers</code>.</p> <p>NOTE: Peers of a storage:BlockManager.md[BlockManager] are the other BlockManagers in a cluster (except the driver's BlockManager). Peers are used to know the available executors in a Spark application.</p> <p>getPeers is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; message. <p>== [[register]] register Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_19","title":"[source, scala]","text":"<p>register(   idWithoutTopologyInfo: BlockManagerId,   maxOnHeapMemSize: Long,   maxOffHeapMemSize: Long,   slaveEndpoint: RpcEndpointRef): BlockManagerId</p> <p>register registers a storage:BlockManager.md[] (based on the given storage:BlockManagerId.md[]) in the &lt;&gt; and &lt;&gt; registries and posts a SparkListenerBlockManagerAdded message (to the &lt;&gt;). <p>NOTE: The input <code>maxMemSize</code> is the storage:BlockManager.md#maxMemory[total available on-heap and off-heap memory for storage on a <code>BlockManager</code>].</p> <p>NOTE: Registering a <code>BlockManager</code> can only happen once for an executor (identified by <code>BlockManagerId.executorId</code> in &lt;&gt; internal registry). <p>If another <code>BlockManager</code> has earlier been registered for the executor, you should see the following ERROR message in the logs:</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#got-two-different-block-manager-registrations-on-same-executor-will-replace-old-one-oldid-with-new-one-id","title":"Got two different block manager registrations on same executor - will replace old one [oldId] with new one [id]","text":"<p>And then &lt;&gt;. <p>register prints out the following INFO message to the logs:</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_2","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#registering-block-manager-hostport-with-bytes-ram-id","title":"Registering block manager [hostPort] with [bytes] RAM, [id]","text":"<p>The <code>BlockManager</code> is recorded in the internal registries:</p> <ul> <li>&lt;&gt; <li>&lt;&gt; <p>In the end, register requests the &lt;&gt; to scheduler:LiveListenerBus.md#post[post] a SparkListener.md#SparkListenerBlockManagerAdded[SparkListenerBlockManagerAdded] message. <p>register is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; message. <p>== [[removeExecutor]] removeExecutor Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_20","title":"[source, scala]","text":"<p>removeExecutor(   execId: String): Unit</p> <p>removeExecutor prints the following INFO message to the logs:</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#trying-to-remove-executor-execid-from-blockmanagermaster","title":"Trying to remove executor [execId] from BlockManagerMaster.","text":"<p>If the <code>execId</code> executor is registered (in the internal &lt;&gt; internal registry), removeExecutor &lt;BlockManager&gt;&gt;. <p>removeExecutor is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; or &lt;&gt; messages. <p>== [[removeBlockManager]] removeBlockManager Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_21","title":"[source, scala]","text":"<p>removeBlockManager(   blockManagerId: BlockManagerId): Unit</p> <p>removeBlockManager looks up <code>blockManagerId</code> and removes the executor it was working on from the internal registries:</p> <ul> <li>&lt;&gt; <li>&lt;&gt; <p>It then goes over all the blocks for the <code>BlockManager</code>, and removes the executor for each block from <code>blockLocations</code> registry.</p> <p>SparkListener.md#SparkListenerBlockManagerRemoved[SparkListenerBlockManagerRemoved(System.currentTimeMillis(), blockManagerId)] is posted to SparkContext.md#listenerBus[listenerBus].</p> <p>You should then see the following INFO message in the logs:</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_4","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#removing-block-manager-blockmanagerid","title":"Removing block manager [blockManagerId]","text":"<p>removeBlockManager is used when BlockManagerMasterEndpoint is requested to &lt;&gt; (to handle &lt;&gt; or &lt;&gt; messages). <p>== [[getLocations]] getLocations Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_22","title":"[source, scala]","text":"<p>getLocations(   blockId: BlockId): Seq[BlockManagerId]</p> <p>getLocations looks up the given storage:BlockId.md[] in the <code>blockLocations</code> internal registry and returns the locations (as a collection of <code>BlockManagerId</code>) or an empty collection.</p> <p>getLocations is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; and &lt;&gt; messages. <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.BlockManagerMasterEndpoint</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source","title":"[source]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#log4jloggerorgapachesparkstorageblockmanagermasterendpointall","title":"log4j.logger.org.apache.spark.storage.BlockManagerMasterEndpoint=ALL","text":"<p>Refer to spark-logging.md[Logging].</p> <p>== [[internal-properties]] Internal Properties</p> <p>=== [[blockManagerIdByExecutor]] blockManagerIdByExecutor Lookup Table</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_4","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blockmanageridbyexecutor-mapstring-blockmanagerid","title":"blockManagerIdByExecutor: Map[String, BlockManagerId]","text":"<p>Lookup table of storage:BlockManagerId.md[]s by executor ID</p> <p>A new executor is added when BlockManagerMasterEndpoint is requested to handle a &lt;&gt; message (and &lt;&gt;). <p>An executor is removed when BlockManagerMasterEndpoint is requested to handle a &lt;&gt; and a &lt;&gt; messages (via &lt;&gt;) <p>Used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; message, &lt;&gt;, &lt;&gt; and &lt;&gt;. <p>=== [[blockManagerInfo]] blockManagerInfo Lookup Table</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_5","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blockmanageridbyexecutor-mapstring-blockmanagerid_1","title":"blockManagerIdByExecutor: Map[String, BlockManagerId]","text":"<p>Lookup table of storage:BlockManagerInfo.md[] by storage:BlockManagerId.md[]</p> <p>A new BlockManagerInfo is added when BlockManagerMasterEndpoint is requested to handle a &lt;&gt; message (and &lt;&gt;). <p>A BlockManagerInfo is removed when BlockManagerMasterEndpoint is requested to &lt;&gt; (to handle &lt;&gt; and &lt;&gt; messages). <p>=== [[blockLocations]] blockLocations</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_6","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blocklocations-mapblockid-setblockmanagerid","title":"blockLocations: Map[BlockId, Set[BlockManagerId]]","text":"<p>Collection of storage:BlockId.md[] and their locations (as <code>BlockManagerId</code>).</p> <p>Used in <code>removeRdd</code> to remove blocks for a RDD, removeBlockManager to remove blocks after a BlockManager gets removed, <code>removeBlockFromWorkers</code>, <code>updateBlockInfo</code>, and &lt;&gt;."},{"location":"storage/BlockManagerMasterHeartbeatEndpoint/","title":"BlockManagerMasterHeartbeatEndpoint","text":"<p><code>BlockManagerMasterHeartbeatEndpoint</code> is...FIXME</p>"},{"location":"storage/BlockManagerSlaveEndpoint/","title":"BlockManagerSlaveEndpoint","text":"<p>BlockManagerSlaveEndpoint is a ThreadSafeRpcEndpoint for BlockManager.</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#creating-instance","title":"Creating Instance","text":"<p>BlockManagerSlaveEndpoint takes the following to be created:</p> <ul> <li>[[rpcEnv]] rpc:RpcEnv.md[]</li> <li>[[blockManager]] Parent BlockManager.md[]</li> <li>[[mapOutputTracker]] scheduler:MapOutputTracker.md[]</li> </ul> <p>BlockManagerSlaveEndpoint is created for BlockManager.md#slaveEndpoint[BlockManager] (and registered under the name BlockManagerEndpoint[ID]).</p> <p>== [[messages]] Messages</p> <p>=== [[GetBlockStatus]] GetBlockStatus</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala","title":"[source, scala]","text":"<p>GetBlockStatus(   blockId: BlockId,   askSlaves: Boolean = true)</p> <p>When received, BlockManagerSlaveEndpoint requests the &lt;&gt; for the BlockManager.md#getStatus[status of a given block] (by BlockId.md[]) and sends it back to a sender. <p>Posted when...FIXME</p> <p>=== [[GetMatchingBlockIds]] GetMatchingBlockIds</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_1","title":"[source, scala]","text":"<p>GetMatchingBlockIds(   filter: BlockId =&gt; Boolean,   askSlaves: Boolean = true)</p> <p>When received, BlockManagerSlaveEndpoint requests the &lt;&gt; to storage:BlockManager.md#getMatchingBlockIds[find IDs of existing blocks for a given filter] and sends them back to a sender. <p>Posted when...FIXME</p> <p>=== [[RemoveBlock]] RemoveBlock</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_2","title":"[source, scala]","text":"<p>RemoveBlock(   blockId: BlockId)</p> <p>When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#removing-block-blockid","title":"removing block [blockId]","text":"<p>BlockManagerSlaveEndpoint then &lt;blockId block&gt;&gt;. <p>When the computation is successful, you should see the following DEBUG in the logs:</p> <pre><code>Done removing block [blockId], response is [response]\n</code></pre> <p>And <code>true</code> response is sent back. You should see the following DEBUG in the logs:</p> <pre><code>Sent response: true to [senderAddress]\n</code></pre> <p>In case of failure, you should see the following ERROR in the logs and the stack trace.</p> <pre><code>Error in removing block [blockId]\n</code></pre> <p>=== [[RemoveBroadcast]] RemoveBroadcast</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_3","title":"[source, scala]","text":"<p>RemoveBroadcast(   broadcastId: Long,   removeFromDriver: Boolean = true)</p> <p>When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#removing-broadcast-broadcastid","title":"removing broadcast [broadcastId]","text":"<p>It then calls &lt;broadcastId broadcast&gt;&gt;. <p>When the computation is successful, you should see the following DEBUG in the logs:</p> <pre><code>Done removing broadcast [broadcastId], response is [response]\n</code></pre> <p>And the result is sent back. You should see the following DEBUG in the logs:</p> <pre><code>Sent response: [response] to [senderAddress]\n</code></pre> <p>In case of failure, you should see the following ERROR in the logs and the stack trace.</p> <pre><code>Error in removing broadcast [broadcastId]\n</code></pre> <p>=== [[RemoveRdd]] RemoveRdd</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_4","title":"[source, scala]","text":"<p>RemoveRdd(   rddId: Int)</p> <p>When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:</p> <pre><code>removing RDD [rddId]\n</code></pre> <p>It then calls &lt;rddId RDD&gt;&gt;. <p>NOTE: Handling <code>RemoveRdd</code> messages happens on a separate thread. See &lt;&gt;. <p>When the computation is successful, you should see the following DEBUG in the logs:</p> <pre><code>Done removing RDD [rddId], response is [response]\n</code></pre> <p>And the number of blocks removed is sent back. You should see the following DEBUG in the logs:</p> <pre><code>Sent response: [#blocks] to [senderAddress]\n</code></pre> <p>In case of failure, you should see the following ERROR in the logs and the stack trace.</p> <pre><code>Error in removing RDD [rddId]\n</code></pre> <p>=== [[RemoveShuffle]] RemoveShuffle</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_5","title":"[source, scala]","text":"<p>RemoveShuffle(   shuffleId: Int)</p> <p>When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:</p> <pre><code>removing shuffle [shuffleId]\n</code></pre> <p>If scheduler:MapOutputTracker.md[MapOutputTracker] was given (when the RPC endpoint was created), it calls scheduler:MapOutputTracker.md#unregisterShuffle[MapOutputTracker to unregister the <code>shuffleId</code> shuffle].</p> <p>It then calls shuffle:ShuffleManager.md#unregisterShuffle[ShuffleManager to unregister the <code>shuffleId</code> shuffle].</p> <p>NOTE: Handling <code>RemoveShuffle</code> messages happens on a separate thread. See &lt;&gt;. <p>When the computation is successful, you should see the following DEBUG in the logs:</p> <pre><code>Done removing shuffle [shuffleId], response is [response]\n</code></pre> <p>And the result is sent back. You should see the following DEBUG in the logs:</p> <pre><code>Sent response: [response] to [senderAddress]\n</code></pre> <p>In case of failure, you should see the following ERROR in the logs and the stack trace.</p> <pre><code>Error in removing shuffle [shuffleId]\n</code></pre> <p>Posted when BlockManagerMaster.md#removeShuffle[BlockManagerMaster] and storage:BlockManagerMasterEndpoint.md#removeShuffle[BlockManagerMasterEndpoint] are requested to remove all blocks of a shuffle.</p> <p>=== [[ReplicateBlock]] ReplicateBlock</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_6","title":"[source, scala]","text":"<p>ReplicateBlock(   blockId: BlockId,   replicas: Seq[BlockManagerId],   maxReplicas: Int)</p> <p>When received, BlockManagerSlaveEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[TriggerThreadDump]] TriggerThreadDump</p> <p>When received, BlockManagerSlaveEndpoint is requested for the thread info for all live threads with stack trace and synchronization information.</p> <p>== [[asyncThreadPool]][[asyncExecutionContext]] block-manager-slave-async-thread-pool Thread Pool</p> <p>BlockManagerSlaveEndpoint creates a thread pool of maximum 100 daemon threads with block-manager-slave-async-thread-pool thread prefix (using {java-javadoc-url}/java/util/concurrent/ThreadPoolExecutor.html[java.util.concurrent.ThreadPoolExecutor]).</p> <p>BlockManagerSlaveEndpoint uses the thread pool (as a Scala implicit value) when requested to &lt;&gt; to communicate in a non-blocking, asynchronous way. <p>The thread pool is shut down when BlockManagerSlaveEndpoint is requested to &lt;&gt;. <p>The reason for the async thread pool is that the block-related operations might take quite some time and to release the main RPC thread other threads are spawned to talk to the external services and pass responses on to the clients.</p> <p>== [[doAsync]] doAsync Internal Method</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#sourcescala","title":"[source,scala]","text":"<p>doAsyncT(   body: =&gt; T)</p> <p>doAsync creates a Scala Future to execute the following asynchronously (i.e. on a separate thread from the &lt;&gt;): <p>. Prints out the given <code>actionMessage</code> as a DEBUG message to the logs</p> <p>. Executes the given <code>body</code></p> <p>When completed successfully, doAsync prints out the following DEBUG messages to the logs and requests the given RpcCallContext to reply the response to the sender.</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext_2","title":"[source,plaintext]","text":"<p>Done [actionMessage], response is [response] Sent response: [response] to [senderAddress]</p> <p>In case of a failure, doAsync prints out the following ERROR message to the logs and requests the given RpcCallContext to send the failure to the sender.</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#error-in-actionmessage","title":"Error in [actionMessage]","text":"<p>doAsync is used when BlockManagerSlaveEndpoint is requested to handle &lt;&gt;, &lt;&gt;, &lt;&gt; and &lt;&gt; messages. <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.BlockManagerSlaveEndpoint</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source","title":"[source]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#log4jloggerorgapachesparkstorageblockmanagerslaveendpointall","title":"log4j.logger.org.apache.spark.storage.BlockManagerSlaveEndpoint=ALL","text":"<p>Refer to spark-logging.md[Logging].</p>"},{"location":"storage/BlockManagerSource/","title":"BlockManagerSource -- Metrics Source for BlockManager","text":"<p><code>BlockManagerSource</code> is the spark-metrics-Source.md[metrics source] of a storage:BlockManager.md[BlockManager].</p> <p>[[sourceName]] <code>BlockManagerSource</code> is registered under the name BlockManager (when SparkContext is created).</p> <p>[[metrics]] .BlockManagerSource's Gauge Metrics (in alphabetical order) [width=\"100%\",cols=\"1,1,2\",options=\"header\"] |=== | Name | Type | Description</p> <p>| <code>disk.diskSpaceUsed_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their disk space used (<code>diskUsed</code>).</p> <p>| <code>memory.maxMem_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their maximum memory limit (<code>maxMem</code>).</p> <p>| <code>memory.maxOffHeapMem_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their off-heap memory remaining (<code>maxOffHeapMem</code>).</p> <p>| <code>memory.maxOnHeapMem_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their on-heap memory remaining (<code>maxOnHeapMem</code>).</p> <p>| <code>memory.memUsed_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their memory used (<code>memUsed</code>).</p> <p>| <code>memory.offHeapMemUsed_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their off-heap memory used (<code>offHeapMemUsed</code>).</p> <p>| <code>memory.onHeapMemUsed_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their on-heap memory used (<code>onHeapMemUsed</code>).</p> <p>| <code>memory.remainingMem_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their memory remaining (<code>memRemaining</code>).</p> <p>| <code>memory.remainingOffHeapMem_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their off-heap memory remaining (<code>offHeapMemRemaining</code>).</p> <p>| <code>memory.remainingOnHeapMem_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their on-heap memory remaining (<code>onHeapMemRemaining</code>). |===</p> <p>You can access the <code>BlockManagerSource</code> &lt;&gt; using the web UI's port (as spark-webui-properties.md#spark.ui.port[spark.ui.port] configuration property). <pre><code>$ http --follow http://localhost:4040/metrics/json \\\n    | jq '.gauges | keys | .[] | select(test(\".driver.BlockManager\"; \"g\"))'\n\"local-1528725411625.driver.BlockManager.disk.diskSpaceUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.maxMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.maxOffHeapMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.maxOnHeapMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.memUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.offHeapMemUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.onHeapMemUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.remainingMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.remainingOffHeapMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.remainingOnHeapMem_MB\"\n</code></pre> <p>[[creating-instance]] [[blockManager]] <code>BlockManagerSource</code> takes a storage:BlockManager.md[BlockManager] when created.</p> <p><code>BlockManagerSource</code> is created when SparkContext is created.</p>"},{"location":"storage/BlockManagerStorageEndpoint/","title":"BlockManagerStorageEndpoint","text":"<p><code>BlockManagerStorageEndpoint</code> is an IsolatedRpcEndpoint.</p>"},{"location":"storage/BlockManagerStorageEndpoint/#creating-instance","title":"Creating Instance","text":"<p><code>BlockManagerStorageEndpoint</code> takes the following to be created:</p> <ul> <li> RpcEnv <li> BlockManager <li> MapOutputTracker <p><code>BlockManagerStorageEndpoint</code> is created\u00a0when:</p> <ul> <li><code>BlockManager</code> is created</li> </ul>"},{"location":"storage/BlockManagerStorageEndpoint/#messages","title":"Messages","text":""},{"location":"storage/BlockManagerStorageEndpoint/#decommissionblockmanager","title":"DecommissionBlockManager <p>When received, <code>receiveAndReply</code> requests the BlockManager to decommissionSelf.</p> <p><code>DecommissionBlockManager</code> is sent out when <code>BlockManager</code> is requested to decommissionBlockManager.</p>","text":""},{"location":"storage/BlockReplicationPolicy/","title":"BlockReplicationPolicy","text":"<p><code>BlockReplicationPolicy</code> is...FIXME</p>"},{"location":"storage/BlockStoreClient/","title":"BlockStoreClient","text":"<p><code>BlockStoreClient</code> is an abstraction of block clients that can fetch blocks from a remote node (an executor or an external service).</p> <p><code>BlockStoreClient</code> is a Java Closeable.</p> <p>Note</p> <p><code>BlockStoreClient</code> was known previously as <code>ShuffleClient</code> (SPARK-28593).</p>"},{"location":"storage/BlockStoreClient/#contract","title":"Contract","text":""},{"location":"storage/BlockStoreClient/#fetching-blocks","title":"Fetching Blocks <pre><code>void fetchBlocks(\n  String host,\n  int port,\n  String execId,\n  String[] blockIds,\n  BlockFetchingListener listener,\n  DownloadFileManager downloadFileManager)\n</code></pre> <p>Fetches blocks from a remote node (using DownloadFileManager)</p> <p>Used when:</p> <ul> <li><code>BlockTransferService</code> is requested to fetchBlockSync</li> <li><code>ShuffleBlockFetcherIterator</code> is requested to sendRequest</li> </ul>","text":""},{"location":"storage/BlockStoreClient/#shuffle-metrics","title":"Shuffle Metrics <pre><code>MetricSet shuffleMetrics()\n</code></pre> <p>Shuffle <code>MetricsSet</code></p> <p>Default: (empty)</p> <p>Used when:</p> <ul> <li><code>BlockManager</code> is requested for the Shuffle Metrics Source</li> </ul>","text":""},{"location":"storage/BlockStoreClient/#implementations","title":"Implementations","text":"<ul> <li>BlockTransferService</li> <li>ExternalBlockStoreClient</li> </ul>"},{"location":"storage/BlockStoreUpdater/","title":"BlockStoreUpdater","text":"<p><code>BlockStoreUpdater</code> is an abstraction of block store updaters that store blocks (from bytes, whether they start in memory or on disk).</p> <p><code>BlockStoreUpdater</code> is an internal class of BlockManager.</p>"},{"location":"storage/BlockStoreUpdater/#contract","title":"Contract","text":""},{"location":"storage/BlockStoreUpdater/#block-data","title":"Block Data <pre><code>blockData(): BlockData\n</code></pre> <p>BlockData</p> <p>Used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save</li> <li><code>TempFileBasedBlockStoreUpdater</code> is requested to readToByteBuffer</li> </ul>","text":""},{"location":"storage/BlockStoreUpdater/#readtobytebuffer","title":"readToByteBuffer <pre><code>readToByteBuffer(): ChunkedByteBuffer\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save</li> </ul>","text":""},{"location":"storage/BlockStoreUpdater/#storing-block-to-disk","title":"Storing Block to Disk <pre><code>saveToDiskStore(): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save</li> </ul>","text":""},{"location":"storage/BlockStoreUpdater/#implementations","title":"Implementations","text":"<ul> <li>ByteBufferBlockStoreUpdater</li> <li>TempFileBasedBlockStoreUpdater</li> </ul>"},{"location":"storage/BlockStoreUpdater/#creating-instance","title":"Creating Instance","text":"<p><code>BlockStoreUpdater</code> takes the following to be created:</p> <ul> <li> Block Size <li> BlockId <li> StorageLevel <li> Scala's <code>ClassTag</code> <li> <code>tellMaster</code> flag <li> <code>keepReadLock</code> flag Abstract Class <p><code>BlockStoreUpdater</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete BlockStoreUpdaters.</p>"},{"location":"storage/BlockStoreUpdater/#saving-block-to-block-store","title":"Saving Block to Block Store <pre><code>save(): Boolean\n</code></pre> <p><code>save</code> doPut with the putBody function.</p> <p><code>save</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to putBlockDataAsStream and store block bytes locally</li> </ul>","text":""},{"location":"storage/BlockStoreUpdater/#putbody-function","title":"putBody Function <p>With the StorageLevel with replication (above <code>1</code>), the <code>putBody</code> function triggers replication concurrently (using a <code>Future</code> (Scala) on a separate thread from the ExecutionContextExecutorService).</p> <p>In general, <code>putBody</code> stores the block in the MemoryStore first (if requested based on useMemory of the StorageLevel). <code>putBody</code> saves to a DiskStore (if useMemory is not specified or storing to the <code>MemoryStore</code> failed).</p>  <p>Note</p> <p><code>putBody</code> stores the block in the <code>MemoryStore</code> only even if the useMemory and useDisk flags could both be turned on (<code>true</code>).</p> <p>Spark drops the block to disk later if the memory store can't hold it.</p>  <p>With the useMemory of the StorageLevel set, <code>putBody</code> saveDeserializedValuesToMemoryStore for deserialized storage level or saveSerializedValuesToMemoryStore otherwise.</p> <p><code>putBody</code> saves to a DiskStore when either of the following happens:</p> <ol> <li>Storing in memory fails and the useDisk (of the StorageLevel) is set</li> <li>useMemory of the StorageLevel is not set yet the useDisk is</li> </ol> <p><code>putBody</code> getCurrentBlockStatus and checks if it is in either the memory or disk store.</p> <p>In the end, <code>putBody</code> reportBlockStatus (if the given tellMaster flag and the tellMaster flag of the <code>BlockInfo</code> are both enabled) and addUpdatedBlockStatusToTaskMetrics.</p> <p><code>putBody</code> prints out the following DEBUG message to the logs:</p> <pre><code>Put block [blockId] locally took [timeUsed] ms\n</code></pre>  <p><code>putBody</code> prints out the following WARN message to the logs when an attempt to store a block in memory fails and the useDisk is set:</p> <pre><code>Persisting block [blockId] to disk instead.\n</code></pre>","text":""},{"location":"storage/BlockStoreUpdater/#saving-deserialized-values-to-memorystore","title":"Saving Deserialized Values to MemoryStore <pre><code>saveDeserializedValuesToMemoryStore(\n  inputStream: InputStream): Boolean\n</code></pre> <p><code>saveDeserializedValuesToMemoryStore</code>...FIXME</p> <p><code>saveDeserializedValuesToMemoryStore</code>\u00a0is used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save a block (with memory deserialized storage level)</li> </ul>","text":""},{"location":"storage/BlockStoreUpdater/#saving-serialized-values-to-memorystore","title":"Saving Serialized Values to MemoryStore <pre><code>saveSerializedValuesToMemoryStore(\n  bytes: ChunkedByteBuffer): Boolean\n</code></pre> <p><code>saveSerializedValuesToMemoryStore</code>...FIXME</p> <p><code>saveSerializedValuesToMemoryStore</code>\u00a0is used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save a block (with memory serialized storage level)</li> </ul>","text":""},{"location":"storage/BlockStoreUpdater/#logging","title":"Logging <p><code>BlockStoreUpdater</code> is an abstract class and logging is configured using the logger of the implementations.</p>","text":""},{"location":"storage/BlockTransferService/","title":"BlockTransferService","text":"<p><code>BlockTransferService</code> is an extension of the BlockStoreClient abstraction for shuffle clients that can fetch and upload blocks of data (synchronously or asynchronously).</p> <p><code>BlockTransferService</code> is a network service available by a host name and a port.</p> <p><code>BlockTransferService</code> was introduced in SPARK-3019 Pluggable block transfer interface (BlockTransferService).</p>"},{"location":"storage/BlockTransferService/#contract","title":"Contract","text":""},{"location":"storage/BlockTransferService/#host-name","title":"Host Name <pre><code>hostName: String\n</code></pre> <p>Host name this service is listening on</p> <p>Used when:</p> <ul> <li><code>BlockManager</code> is requested to initialize</li> </ul>","text":""},{"location":"storage/BlockTransferService/#initializing","title":"Initializing <pre><code>init(\n  blockDataManager: BlockDataManager): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockManager</code> is requested to initialize</li> </ul>","text":""},{"location":"storage/BlockTransferService/#port","title":"Port <pre><code>port: Int\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockManager</code> is requested to initialize</li> </ul>","text":""},{"location":"storage/BlockTransferService/#uploading-block-asynchronously","title":"Uploading Block Asynchronously <pre><code>uploadBlock(\n  hostname: String,\n  port: Int,\n  execId: String,\n  blockId: BlockId,\n  blockData: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Future[Unit]\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockTransferService</code> is requested to uploadBlockSync</li> </ul>","text":""},{"location":"storage/BlockTransferService/#implementations","title":"Implementations","text":"<ul> <li>NettyBlockTransferService</li> </ul>"},{"location":"storage/BlockTransferService/#uploading-block-synchronously","title":"Uploading Block Synchronously <pre><code>uploadBlockSync(\n  hostname: String,\n  port: Int,\n  execId: String,\n  blockId: BlockId,\n  blockData: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Unit\n</code></pre> <p><code>uploadBlockSync</code> uploadBlock and waits till it finishes.</p> <p><code>uploadBlockSync</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to replicate</li> <li><code>ShuffleMigrationRunnable</code> is requested to run</li> </ul>","text":""},{"location":"storage/ByteBufferBlockStoreUpdater/","title":"ByteBufferBlockStoreUpdater","text":"<p><code>ByteBufferBlockStoreUpdater</code> is a BlockStoreUpdater (that BlockManager uses for storing a block from bytes already in memory).</p>"},{"location":"storage/ByteBufferBlockStoreUpdater/#creating-instance","title":"Creating Instance","text":"<p><code>ByteBufferBlockStoreUpdater</code> takes the following to be created:</p> <ul> <li> BlockId <li> StorageLevel <li> <code>ClassTag</code> (Scala) <li> <code>ChunkedByteBuffer</code> <li> <code>tellMaster</code> flag (default: <code>true</code>) <li> <code>keepReadLock</code> flag (default: <code>false</code>) <p><code>ByteBufferBlockStoreUpdater</code> is created\u00a0when:</p> <ul> <li><code>BlockManager</code> is requested to store a block (bytes) locally</li> </ul>"},{"location":"storage/ByteBufferBlockStoreUpdater/#block-data","title":"Block Data <pre><code>blockData(): BlockData\n</code></pre> <p><code>blockData</code> creates a <code>ByteBufferBlockData</code> (with the ChunkedByteBuffer).</p> <p><code>blockData</code>\u00a0is part of the BlockStoreUpdater abstraction.</p>","text":""},{"location":"storage/ByteBufferBlockStoreUpdater/#readtobytebuffer","title":"readToByteBuffer <pre><code>readToByteBuffer(): ChunkedByteBuffer\n</code></pre> <p><code>readToByteBuffer</code> simply gives the ChunkedByteBuffer (it was created with).</p> <p><code>readToByteBuffer</code>\u00a0is part of the BlockStoreUpdater abstraction.</p>","text":""},{"location":"storage/ByteBufferBlockStoreUpdater/#storing-block-to-disk","title":"Storing Block to Disk <pre><code>saveToDiskStore(): Unit\n</code></pre> <p><code>saveToDiskStore</code> requests the DiskStore (of the parent BlockManager) to putBytes (with the BlockId and the ChunkedByteBuffer).</p> <p><code>saveToDiskStore</code>\u00a0is part of the BlockStoreUpdater abstraction.</p>","text":""},{"location":"storage/DiskBlockManager/","title":"DiskBlockManager","text":"<p><code>DiskBlockManager</code> manages a logical mapping of logical blocks and their physical on-disk locations for a BlockManager.</p> <p></p> <p>By default, one block is mapped to one file with a name given by BlockId. It is however possible to have a block to be mapped to a segment of a file only.</p> <p>Block files are hashed among the local directories.</p> <p><code>DiskBlockManager</code> is used to create a DiskStore.</p>"},{"location":"storage/DiskBlockManager/#creating-instance","title":"Creating Instance","text":"<p><code>DiskBlockManager</code> takes the following to be created:</p> <ul> <li> SparkConf <li> <code>deleteFilesOnStop</code> flag <p>When created, <code>DiskBlockManager</code> creates the local directories for block storage and initializes the internal subDirs collection of locks for every local directory.</p> <p><code>DiskBlockManager</code> createLocalDirsForMergedShuffleBlocks.</p> <p>In the end, <code>DiskBlockManager</code> registers a shutdown hook to clean up the local directories for blocks.</p> <p><code>DiskBlockManager</code> is created for BlockManager.</p>"},{"location":"storage/DiskBlockManager/#createlocaldirsformergedshuffleblocks","title":"createLocalDirsForMergedShuffleBlocks <pre><code>createLocalDirsForMergedShuffleBlocks(): Unit\n</code></pre> <p><code>createLocalDirsForMergedShuffleBlocks</code> is a noop with isPushBasedShuffleEnabled disabled (YARN mode only).</p> <p><code>createLocalDirsForMergedShuffleBlocks</code>...FIXME</p>","text":""},{"location":"storage/DiskBlockManager/#accessing-diskblockmanager","title":"Accessing DiskBlockManager","text":"<p><code>DiskBlockManager</code> is available using SparkEnv.</p> <pre><code>org.apache.spark.SparkEnv.get.blockManager.diskBlockManager\n</code></pre>"},{"location":"storage/DiskBlockManager/#local-directories-for-block-storage","title":"Local Directories for Block Storage <p><code>DiskBlockManager</code> creates blockmgr directory in every local root directory when created.</p> <p><code>DiskBlockManager</code> uses <code>localDirs</code> internal registry of all the <code>blockmgr</code> directories.</p> <p><code>DiskBlockManager</code> expects at least one local directory or prints out the following ERROR message to the logs and exits the JVM (with exit code 53):</p> <pre><code>Failed to create any local dir.\n</code></pre> <p><code>localDirs</code> is used when:</p> <ul> <li><code>DiskBlockManager</code> is created (and creates localDirsString and subDirs), requested to look up a file (among local subdirectories) and doStop</li> <li><code>BlockManager</code> is requested to register with an external shuffle server</li> <li><code>BasePythonRunner</code> (PySpark) is requested to <code>compute</code></li> </ul>","text":""},{"location":"storage/DiskBlockManager/#localdirsstring","title":"localDirsString <p><code>DiskBlockManager</code> uses <code>localDirsString</code> internal registry of the paths of the local blockmgr directories.</p> <p><code>localDirsString</code> is used by <code>BlockManager</code> when requested for getLocalDiskDirs.</p>","text":""},{"location":"storage/DiskBlockManager/#creating-blockmgr-directory-in-every-local-root-directory","title":"Creating blockmgr Directory in Every Local Root Directory <pre><code>createLocalDirs(\n  conf: SparkConf): Array[File]\n</code></pre> <p><code>createLocalDirs</code> creates <code>blockmgr</code> local directories for storing block data.</p>  <p><code>createLocalDirs</code> creates a <code>blockmgr-[randomUUID]</code> directory under every root directory for local storage and prints out the following INFO message to the logs:</p> <pre><code>Created local directory at [localDir]\n</code></pre>  <p>In case of an exception, <code>createLocalDirs</code> prints out the following ERROR message to the logs and ignore the directory:</p> <pre><code>Failed to create local dir in [rootDir]. Ignoring this directory.\n</code></pre>","text":""},{"location":"storage/DiskBlockManager/#file-locks-for-local-block-store-directories","title":"File Locks for Local Block Store Directories <pre><code>subDirs: Array[Array[File]]\n</code></pre> <p><code>subDirs</code> is a lookup table for file locks of every local block directory (with the first dimension for local directories and the second for locks).</p> <p>The number of block subdirectories is controlled by spark.diskStore.subDirectories configuration property.</p> <p><code>subDirs(dirId)(subDirId)</code> is used to access <code>subDirId</code> subdirectory in <code>dirId</code> local directory.</p> <p><code>subDirs</code> is used when:</p> <ul> <li><code>DiskBlockManager</code> is requested for a block file and all the block files</li> </ul>","text":""},{"location":"storage/DiskBlockManager/#finding-block-file-and-creating-parent-directories","title":"Finding Block File (and Creating Parent Directories) <pre><code>getFile(\n  blockId: BlockId): File\ngetFile(\n  filename: String): File\n</code></pre> <p><code>getFile</code> computes a hash of the file name of the input BlockId that is used for the name of the parent directory and subdirectory.</p> <p><code>getFile</code> creates the subdirectory unless it already exists.</p> <p><code>getFile</code> is used when:</p> <ul> <li> <p><code>DiskBlockManager</code> is requested to containsBlock, createTempLocalBlock, createTempShuffleBlock</p> </li> <li> <p><code>DiskStore</code> is requested to getBytes, remove, contains, and put</p> </li> <li> <p><code>IndexShuffleBlockResolver</code> is requested to getDataFile and getIndexFile</p> </li> </ul>","text":""},{"location":"storage/DiskBlockManager/#createtempshuffleblock","title":"createTempShuffleBlock <pre><code>createTempShuffleBlock(): (TempShuffleBlockId, File)\n</code></pre> <p><code>createTempShuffleBlock</code> creates a temporary <code>TempShuffleBlockId</code> block.</p> <p><code>createTempShuffleBlock</code>...FIXME</p>","text":""},{"location":"storage/DiskBlockManager/#registering-shutdown-hook","title":"Registering Shutdown Hook <pre><code>addShutdownHook(): AnyRef\n</code></pre> <p><code>addShutdownHook</code> registers a shutdown hook to execute doStop at shutdown.</p> <p>When executed, you should see the following DEBUG message in the logs:</p> <pre><code>Adding shutdown hook\n</code></pre> <p><code>addShutdownHook</code> adds the shutdown hook so it prints the following INFO message and executes doStop:</p> <pre><code>Shutdown hook called\n</code></pre>","text":""},{"location":"storage/DiskBlockManager/#getting-writable-directories-in-yarn","title":"Getting Writable Directories in YARN <pre><code>getYarnLocalDirs(\n  conf: SparkConf): String\n</code></pre> <p><code>getYarnLocalDirs</code> uses <code>conf</code> SparkConf to read <code>LOCAL_DIRS</code> environment variable with comma-separated local directories (that have already been created and secured so that only the user has access to them).</p> <p><code>getYarnLocalDirs</code> throws an <code>Exception</code> when <code>LOCAL_DIRS</code> environment variable was not set:</p> <pre><code>Yarn Local dirs can't be empty\n</code></pre>","text":""},{"location":"storage/DiskBlockManager/#checking-whether-spark-runs-on-yarn","title":"Checking Whether Spark Runs on YARN <pre><code>isRunningInYarnContainer(\n  conf: SparkConf): Boolean\n</code></pre> <p><code>isRunningInYarnContainer</code> uses <code>conf</code> SparkConf to read Hadoop YARN's CONTAINER_ID environment variable to find out if Spark runs in a YARN container (that is exported by YARN NodeManager).</p>","text":""},{"location":"storage/DiskBlockManager/#getting-all-blocks-from-files-stored-on-disk","title":"Getting All Blocks (From Files Stored On Disk) <pre><code>getAllBlocks(): Seq[BlockId]\n</code></pre> <p><code>getAllBlocks</code> gets all the blocks stored on disk.</p> <p>Internally, <code>getAllBlocks</code> takes the block files and returns their names (as <code>BlockId</code>).</p> <p><code>getAllBlocks</code> is used when <code>BlockManager</code> is requested to find IDs of existing blocks for a given filter.</p>","text":""},{"location":"storage/DiskBlockManager/#all-block-files","title":"All Block Files <pre><code>getAllFiles(): Seq[File]\n</code></pre> <p><code>getAllFiles</code> uses the subDirs registry to list all the files (in all the directories) that are currently stored on disk by this disk manager.</p>","text":""},{"location":"storage/DiskBlockManager/#stopping","title":"Stopping <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>...FIXME</p> <p><code>stop</code> is used when <code>BlockManager</code> is requested to stop.</p>","text":""},{"location":"storage/DiskBlockManager/#stopping-diskblockmanager-removing-local-directories-for-blocks","title":"Stopping DiskBlockManager (Removing Local Directories for Blocks) <pre><code>doStop(): Unit\n</code></pre> <p><code>doStop</code> deletes the local directories recursively (only when the constructor's <code>deleteFilesOnStop</code> is enabled and the parent directories are not registered to be removed at shutdown).</p> <p><code>doStop</code> is used when:</p> <ul> <li><code>DiskBlockManager</code> is requested to shut down or stop</li> </ul>","text":""},{"location":"storage/DiskBlockManager/#demo","title":"Demo <p>Demo: DiskBlockManager and Block Data</p>","text":""},{"location":"storage/DiskBlockManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.DiskBlockManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.DiskBlockManager=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/DiskBlockObjectWriter/","title":"DiskBlockObjectWriter","text":"<p><code>DiskBlockObjectWriter</code> is a disk writer of BlockManager.</p> <p><code>DiskBlockObjectWriter</code> is an <code>OutputStream</code> (Java) that BlockManager offers for writing data blocks to disk.</p> <p><code>DiskBlockObjectWriter</code> is used when:</p> <ul> <li> <p><code>BypassMergeSortShuffleWriter</code> is requested for partition writers</p> </li> <li> <p><code>UnsafeSorterSpillWriter</code> is requested for a partition writer</p> </li> <li> <p><code>ShuffleExternalSorter</code> is requested to writeSortedFile</p> </li> <li> <p><code>ExternalSorter</code> is requested to spillMemoryIteratorToDisk</p> </li> </ul>"},{"location":"storage/DiskBlockObjectWriter/#creating-instance","title":"Creating Instance","text":"<p><code>DiskBlockObjectWriter</code> takes the following to be created:</p> <ul> <li> <code>File</code> (Java) <li> SerializerManager <li> SerializerInstance <li>Buffer size</li> <li> <code>syncWrites</code> flag (based on spark.shuffle.sync configuration property) <li> ShuffleWriteMetricsReporter <li> BlockId (default: <code>null</code>) <p><code>DiskBlockObjectWriter</code> is created when:</p> <ul> <li><code>BlockManager</code> is requested for a disk writer</li> </ul>"},{"location":"storage/DiskBlockObjectWriter/#buffer-size","title":"Buffer Size <p><code>DiskBlockObjectWriter</code> is given a buffer size when created.</p> <p>The buffer size is specified by BlockManager and is based on spark.shuffle.file.buffer configuration property (in most cases) or is hardcoded to <code>32k</code> (in some cases but is in fact the default value).</p> <p>The buffer size is exactly the buffer size of the BufferedOutputStream.</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#serializationstream","title":"SerializationStream <p><code>DiskBlockObjectWriter</code> manages a SerializationStream for writing a key-value record:</p> <ul> <li> <p>Opens it when requested to open</p> </li> <li> <p>Closes it when requested to commitAndGet</p> </li> <li> <p>Dereferences it (<code>null</code>s it) when closeResources</p> </li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#states","title":"States <p><code>DiskBlockObjectWriter</code> can be in one of the following states (that match the state of the underlying output streams):</p> <ul> <li>Initialized</li> <li>Open</li> <li>Closed</li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#writing-out-record","title":"Writing Out Record <pre><code>write(\n  key: Any,\n  value: Any): Unit\n</code></pre> <p><code>write</code> opens the underlying stream unless open already.</p> <p><code>write</code> requests the SerializationStream to write the key and then the value.</p> <p>In the end, <code>write</code> updates the write metrics.</p>  <p><code>write</code> is used when:</p> <ul> <li> <p><code>BypassMergeSortShuffleWriter</code> is requested to write records of a partition</p> </li> <li> <p><code>ExternalAppendOnlyMap</code> is requested to spillMemoryIteratorToDisk</p> </li> <li> <p><code>ExternalSorter</code> is requested to write all records into a partitioned file</p> <ul> <li><code>SpillableIterator</code> is requested to <code>spill</code></li> </ul> </li> <li> <p><code>WritablePartitionedPairCollection</code> is requested for a <code>destructiveSortedWritablePartitionedIterator</code></p> </li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#commitandget","title":"commitAndGet <pre><code>commitAndGet(): FileSegment\n</code></pre> <p>With streamOpen enabled, <code>commitAndGet</code>...FIXME</p> <p>Otherwise, <code>commitAndGet</code> returns a new <code>FileSegment</code> (with the File, committedPosition and <code>0</code> length).</p>  <p><code>commitAndGet</code> is used when:</p> <ul> <li><code>BypassMergeSortShuffleWriter</code> is requested to write</li> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile</li> <li><code>DiskBlockObjectWriter</code> is requested to close</li> <li><code>ExternalAppendOnlyMap</code> is requested to spillMemoryIteratorToDisk</li> <li><code>ExternalSorter</code> is requested to spillMemoryIteratorToDisk, writePartitionedFile</li> <li><code>UnsafeSorterSpillWriter</code> is requested to close</li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#committing-writes-and-closing-resources","title":"Committing Writes and Closing Resources <pre><code>close(): Unit\n</code></pre> <p>Only if initialized, <code>close</code> commitAndGet followed by closeResources. Otherwise, <code>close</code> does nothing.</p>  <p><code>close</code> is used when:</p> <ul> <li>FIXME</li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#revertpartialwritesandclose","title":"revertPartialWritesAndClose <pre><code>revertPartialWritesAndClose(): File\n</code></pre> <p><code>revertPartialWritesAndClose</code>...FIXME</p> <p><code>revertPartialWritesAndClose</code> is used when...FIXME</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#writing-bytes-from-byte-array-starting-from-offset","title":"Writing Bytes (From Byte Array Starting From Offset) <pre><code>write(\n  kvBytes: Array[Byte],\n  offs: Int,\n  len: Int): Unit\n</code></pre> <p><code>write</code>...FIXME</p> <p><code>write</code> is used when...FIXME</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#opening-diskblockobjectwriter","title":"Opening DiskBlockObjectWriter <pre><code>open(): DiskBlockObjectWriter\n</code></pre> <p><code>open</code> opens the <code>DiskBlockObjectWriter</code>, i.e. initializes and re-sets bs and objOut internal output streams.</p> <p>Internally, <code>open</code> makes sure that <code>DiskBlockObjectWriter</code> is not closed (hasBeenClosed flag is disabled). If it was, <code>open</code> throws a <code>IllegalStateException</code>:</p> <pre><code>Writer already closed. Cannot be reopened.\n</code></pre> <p>Unless <code>DiskBlockObjectWriter</code> has already been initialized (initialized flag is enabled), <code>open</code> initializes it (and turns initialized flag on).</p> <p>Regardless of whether <code>DiskBlockObjectWriter</code> was already initialized or not, <code>open</code> requests <code>SerializerManager</code> to wrap <code>mcs</code> output stream for encryption and compression (for blockId) and sets it as bs.</p> <p><code>open</code> requests the SerializerInstance to serialize <code>bs</code> output stream and sets it as objOut.</p>  <p>Note</p> <p><code>open</code> uses the SerializerInstance that was used to create the <code>DiskBlockObjectWriter</code>.</p>  <p>In the end, <code>open</code> turns streamOpen flag on.</p> <p><code>open</code> is used when <code>DiskBlockObjectWriter</code> writes out a record or bytes from a specified byte array and the stream is not open yet.</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#initialization","title":"Initialization <pre><code>initialize(): Unit\n</code></pre> <p><code>initialize</code> creates a FileOutputStream to write to the file (with the<code>append</code> enabled) and takes the FileChannel associated with this file output stream.</p> <p><code>initialize</code> creates a TimeTrackingOutputStream (with the ShuffleWriteMetricsReporter and the FileOutputStream).</p> <p>With checksumEnabled, <code>initialize</code>...FIXME</p> <p>In the end, <code>initialize</code> creates a BufferedOutputStream.</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#checksumenabled-flag","title":"checksumEnabled Flag <p><code>DiskBlockObjectWriter</code> defines <code>checksumEnabled</code> flag to...FIXME</p> <p><code>checksumEnabled</code> is <code>false</code> by default and can be enabled using setChecksum.</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#setchecksum","title":"setChecksum <pre><code>setChecksum(\n  checksum: Checksum): Unit\n</code></pre> <p><code>setChecksum</code>...FIXME</p>  <p><code>setChecksum</code> is used when:</p> <ul> <li><code>BypassMergeSortShuffleWriter</code> is requested to write records (with spark.shuffle.checksum.enabled enabled)</li> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile (with spark.shuffle.checksum.enabled enabled)</li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#recording-bytes-written","title":"Recording Bytes Written <pre><code>recordWritten(): Unit\n</code></pre> <p><code>recordWritten</code> increases the numRecordsWritten counter.</p> <p><code>recordWritten</code> requests the ShuffleWriteMetricsReporter to incRecordsWritten.</p> <p><code>recordWritten</code> updates the bytes written metric every <code>16384</code> bytes written (based on the numRecordsWritten counter).</p>  <p><code>recordWritten</code> is used when:</p> <ul> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile</li> <li><code>DiskBlockObjectWriter</code> is requested to write</li> <li><code>UnsafeSorterSpillWriter</code> is requested to write</li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#updating-bytes-written-metric","title":"Updating Bytes Written Metric <pre><code>updateBytesWritten(): Unit\n</code></pre> <p><code>updateBytesWritten</code> requests the FileChannel for the file position (i.e., the number of bytes from the beginning of the file to the current position) that is used to incBytesWritten (using the ShuffleWriteMetricsReporter and the reportedPosition counter).</p> <p>In the end, <code>updateBytesWritten</code> updates the reportedPosition counter to the current file position (so it can report incBytesWritten properly).</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#bufferedoutputstream","title":"BufferedOutputStream <pre><code>mcs: ManualCloseOutputStream\n</code></pre> <p><code>DiskBlockObjectWriter</code> creates a custom <code>BufferedOutputStream</code> (Java) when requested to initialize.</p> <p>The <code>BufferedOutputStream</code> is closed (and dereferenced) in closeResources.</p> <p>The <code>BufferedOutputStream</code> is used to create the OutputStream when requested to open.</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#outputstream","title":"OutputStream <pre><code>bs: OutputStream\n</code></pre> <p><code>DiskBlockObjectWriter</code> creates an OutputStream when requested to open. The <code>OutputStream</code> can be encrypted and compressed if enabled.</p> <p>The <code>OutputStream</code> is closed (and dereferenced) in closeResources.</p> <p>The <code>OutputStream</code> is used to create the SerializationStream when requested to open.</p> <p>The <code>OutputStream</code> is requested for the following:</p> <ul> <li>Write bytes out in write</li> <li>Flush in flush (and commitAndGet)</li> </ul>","text":""},{"location":"storage/DiskStore/","title":"DiskStore","text":"<p><code>DiskStore</code> manages data blocks on disk for BlockManager.</p> <p></p>"},{"location":"storage/DiskStore/#creating-instance","title":"Creating Instance","text":"<p><code>DiskStore</code> takes the following to be created:</p> <ul> <li> SparkConf <li> DiskBlockManager <li> <code>SecurityManager</code> <p><code>DiskStore</code> is created\u00a0for BlockManager.</p>"},{"location":"storage/DiskStore/#block-sizes","title":"Block Sizes <pre><code>blockSizes: ConcurrentHashMap[BlockId, Long]\n</code></pre> <p><code>DiskStore</code> uses <code>ConcurrentHashMap</code> (Java) as a registry of blocks and the data size (on disk).</p> <p>A new entry is added when put and moveFileToBlock.</p> <p>An entry is removed when remove.</p>","text":""},{"location":"storage/DiskStore/#putbytes","title":"putBytes <pre><code>putBytes(\n  blockId: BlockId,\n  bytes: ChunkedByteBuffer): Unit\n</code></pre> <p><code>putBytes</code> put the block and writes the buffer out (to the given channel).</p> <p><code>putBytes</code>\u00a0is used when:</p> <ul> <li><code>ByteBufferBlockStoreUpdater</code> is requested to saveToDiskStore</li> <li><code>BlockManager</code> is requested to dropFromMemory</li> </ul>","text":""},{"location":"storage/DiskStore/#getbytes","title":"getBytes <pre><code>getBytes(\n  blockId: BlockId): BlockData\ngetBytes(\n  f: File,\n  blockSize: Long): BlockData\n</code></pre> <p><code>getBytes</code> requests the DiskBlockManager for the block file and the size.</p> <p><code>getBytes</code> requests the SecurityManager for <code>getIOEncryptionKey</code> and returns a <code>EncryptedBlockData</code> if available or a <code>DiskBlockData</code> otherwise.</p> <p><code>getBytes</code>\u00a0is used when:</p> <ul> <li><code>TempFileBasedBlockStoreUpdater</code> is requested to blockData</li> <li><code>BlockManager</code> is requested to getLocalValues, doGetLocalBytes</li> </ul>","text":""},{"location":"storage/DiskStore/#getsize","title":"getSize <pre><code>getSize(\n  blockId: BlockId): Long\n</code></pre> <p><code>getSize</code> looks up the block in the blockSizes registry.</p> <p><code>getSize</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to getStatus, getCurrentBlockStatus, doPutIterator</li> <li><code>DiskStore</code> is requested for the block bytes</li> </ul>","text":""},{"location":"storage/DiskStore/#movefiletoblock","title":"moveFileToBlock <pre><code>moveFileToBlock(\n  sourceFile: File,\n  blockSize: Long,\n  targetBlockId: BlockId): Unit\n</code></pre> <p><code>moveFileToBlock</code>...FIXME</p> <p><code>moveFileToBlock</code>\u00a0is used when:</p> <ul> <li><code>TempFileBasedBlockStoreUpdater</code> is requested to saveToDiskStore</li> </ul>","text":""},{"location":"storage/DiskStore/#checking-if-block-file-exists","title":"Checking if Block File Exists <pre><code>contains(\n  blockId: BlockId): Boolean\n</code></pre> <p><code>contains</code> requests the DiskBlockManager for the block file and checks whether the file actually exists or not.</p> <p><code>contains</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to getStatus, getCurrentBlockStatus, getLocalValues, doGetLocalBytes, dropFromMemory</li> <li><code>DiskStore</code> is requested to put</li> </ul>","text":""},{"location":"storage/DiskStore/#persisting-block-to-disk","title":"Persisting Block to Disk <pre><code>put(\n  blockId: BlockId)(\n  writeFunc: WritableByteChannel =&gt; Unit): Unit\n</code></pre> <p><code>put</code> prints out the following DEBUG message to the logs:</p> <pre><code>Attempting to put block [blockId]\n</code></pre> <p><code>put</code> requests the DiskBlockManager for the block file for the input BlockId.</p> <p><code>put</code> opens the block file for writing (wrapped into a <code>CountingWritableChannel</code> to count the bytes written). <code>put</code> executes the given <code>writeFunc</code> function (with the <code>WritableByteChannel</code> of the block file) and saves the bytes written (to the blockSizes registry).</p> <p>In the end, <code>put</code> prints out the following DEBUG message to the logs:</p> <pre><code>Block [fileName] stored as [size] file on disk in [time] ms\n</code></pre> <p>In case of any exception, <code>put</code> deletes the block file.</p> <p><code>put</code> throws an <code>IllegalStateException</code> when the block is already stored:</p> <pre><code>Block [blockId] is already present in the disk store\n</code></pre> <p><code>put</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to doPutIterator and dropFromMemory</li> <li><code>DiskStore</code> is requested to putBytes</li> </ul>","text":""},{"location":"storage/DiskStore/#removing-block","title":"Removing Block <pre><code>remove(\n  blockId: BlockId): Boolean\n</code></pre> <p><code>remove</code>...FIXME</p> <p><code>remove</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to removeBlockInternal</li> <li><code>DiskStore</code> is requested to put (and an <code>IOException</code> is thrown)</li> </ul>","text":""},{"location":"storage/DiskStore/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.DiskStore</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.DiskStore=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/ExternalBlockStoreClient/","title":"ExternalBlockStoreClient","text":"<p><code>ExternalBlockStoreClient</code> is a BlockStoreClient that the driver and executors use when spark.shuffle.service.enabled configuration property is enabled.</p>"},{"location":"storage/ExternalBlockStoreClient/#creating-instance","title":"Creating Instance","text":"<p><code>ExternalBlockStoreClient</code> takes the following to be created:</p> <ul> <li> TransportConf <li> <code>SecretKeyHolder</code> <li> <code>authEnabled</code> flag <li> <code>registrationTimeoutMs</code> <p><code>ExternalBlockStoreClient</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is requested to create a SparkEnv (for the driver and executors) with spark.shuffle.service.enabled configuration property enabled</li> </ul>"},{"location":"storage/FallbackStorage/","title":"FallbackStorage","text":"<p><code>FallbackStorage</code> is...FIXME</p>"},{"location":"storage/MemoryStore/","title":"MemoryStore","text":"<p><code>MemoryStore</code> manages blocks of data in memory for BlockManager.</p> <p></p>"},{"location":"storage/MemoryStore/#creating-instance","title":"Creating Instance","text":"<p><code>MemoryStore</code> takes the following to be created:</p> <ul> <li> SparkConf <li>BlockInfoManager</li> <li> SerializerManager <li> MemoryManager <li> BlockEvictionHandler <p><code>MemoryStore</code> is created\u00a0when:</p> <ul> <li><code>BlockManager</code> is created</li> </ul> <p></p>"},{"location":"storage/MemoryStore/#blocks","title":"Blocks <pre><code>entries: LinkedHashMap[BlockId, MemoryEntry[_]]\n</code></pre> <p><code>MemoryStore</code> creates a <code>LinkedHashMap</code> (Java) of blocks (as <code>MemoryEntries</code> per BlockId) when created.</p> <p><code>entries</code> uses access-order ordering mode where the order of iteration is the order in which the entries were last accessed (from least-recently accessed to most-recently). That gives LRU cache behaviour when <code>MemoryStore</code> is requested to evict blocks.</p> <p><code>MemoryEntries</code> are added in putBytes and putIterator.</p> <p><code>MemoryEntries</code> are removed in remove, clear, and while evicting blocks to free up memory.</p>","text":""},{"location":"storage/MemoryStore/#deserializedmemoryentry","title":"DeserializedMemoryEntry <p><code>DeserializedMemoryEntry</code> is a <code>MemoryEntry</code> for block values with the following:</p> <ul> <li><code>Array[T]</code> (for the values)</li> <li><code>size</code></li> <li><code>ON_HEAP</code> memory mode</li> </ul>","text":""},{"location":"storage/MemoryStore/#serializedmemoryentry","title":"SerializedMemoryEntry <p><code>SerializedMemoryEntry</code> is a <code>MemoryEntry</code> for block bytes with the following:</p> <ul> <li><code>ChunkedByteBuffer</code> (for the serialized values)</li> <li><code>size</code></li> <li><code>MemoryMode</code></li> </ul>","text":""},{"location":"storage/MemoryStore/#sparkstorageunrollmemorythreshold","title":"spark.storage.unrollMemoryThreshold <p><code>MemoryStore</code> uses spark.storage.unrollMemoryThreshold configuration property when requested for the following:</p> <ul> <li>putIterator</li> <li>putIteratorAsBytes</li> </ul>","text":""},{"location":"storage/MemoryStore/#evicting-blocks","title":"Evicting Blocks <pre><code>evictBlocksToFreeSpace(\n  blockId: Option[BlockId],\n  space: Long,\n  memoryMode: MemoryMode): Long\n</code></pre> <p><code>evictBlocksToFreeSpace</code> finds blocks to evict in the entries registry (based on least-recently accessed order and until the required <code>space</code> to free up is met or there are no more blocks).</p> <p>Once done, <code>evictBlocksToFreeSpace</code> returns the memory freed up.</p> <p>When there is enough blocks to drop to free up memory, <code>evictBlocksToFreeSpace</code> prints out the following INFO message to the logs:</p> <pre><code>[n] blocks selected for dropping ([freedMemory]) bytes)\n</code></pre> <p><code>evictBlocksToFreeSpace</code> drops the blocks one by one.</p> <p><code>evictBlocksToFreeSpace</code> prints out the following INFO message to the logs:</p> <pre><code>After dropping [n] blocks, free memory is [memory]\n</code></pre> <p>When there is not enough blocks to drop to make room for the given block (if any), <code>evictBlocksToFreeSpace</code> prints out the following INFO message to the logs:</p> <pre><code>Will not store [blockId]\n</code></pre> <p><code>evictBlocksToFreeSpace</code>\u00a0is used when:</p> <ul> <li><code>StorageMemoryPool</code> is requested to acquire memory and free up space to shrink pool</li> </ul>","text":""},{"location":"storage/MemoryStore/#dropping-block","title":"Dropping Block <pre><code>dropBlock[T](\n  blockId: BlockId,\n  entry: MemoryEntry[T]): Unit\n</code></pre> <p><code>dropBlock</code> requests the BlockEvictionHandler to drop the block from memory.</p> <p>If the block is no longer available in any other store, <code>dropBlock</code> requests the BlockInfoManager to remove the block (info).</p>","text":""},{"location":"storage/MemoryStore/#blockinfomanager","title":"BlockInfoManager <p><code>MemoryStore</code> is given a BlockInfoManager when created.</p> <p><code>MemoryStore</code> uses the <code>BlockInfoManager</code> when requested to evict blocks.</p>","text":""},{"location":"storage/MemoryStore/#accessing-memorystore","title":"Accessing MemoryStore <p><code>MemoryStore</code> is available to other Spark services using BlockManager.memoryStore.</p> <pre><code>import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.memoryStore\n</code></pre>","text":""},{"location":"storage/MemoryStore/#serialized-block-bytes","title":"Serialized Block Bytes <pre><code>getBytes(\n  blockId: BlockId): Option[ChunkedByteBuffer]\n</code></pre> <p><code>getBytes</code> returns the bytes of the SerializedMemoryEntry of a block (if found in the entries registry).</p> <p><code>getBytes</code> is used (for blocks with a serialized and in-memory storage level) when:</p> <ul> <li><code>BlockManager</code> is requested for the serialized bytes of a block (from a local block manager), getLocalValues, maybeCacheDiskBytesInMemory</li> </ul>","text":""},{"location":"storage/MemoryStore/#fetching-deserialized-block-values","title":"Fetching Deserialized Block Values <pre><code>getValues(\n  blockId: BlockId): Option[Iterator[_]]\n</code></pre> <p><code>getValues</code> returns the values of the DeserializedMemoryEntry of the given block (if available in the entries registry).</p> <p><code>getValues</code> is used (for blocks with a deserialized and in-memory storage level) when:</p> <ul> <li><code>BlockManager</code> is requested for the serialized bytes of a block (from a local block manager), getLocalValues, maybeCacheDiskBytesInMemory</li> </ul>","text":""},{"location":"storage/MemoryStore/#putiteratorasbytes","title":"putIteratorAsBytes <pre><code>putIteratorAsBytes[T](\n  blockId: BlockId,\n  values: Iterator[T],\n  classTag: ClassTag[T],\n  memoryMode: MemoryMode): Either[PartiallySerializedBlock[T], Long]\n</code></pre> <p><code>putIteratorAsBytes</code> requires that the block is not already stored.</p> <p><code>putIteratorAsBytes</code> putIterator (with the given BlockId, the values, the <code>MemoryMode</code> and a new <code>SerializedValuesHolder</code>).</p> <p>If successful, <code>putIteratorAsBytes</code> returns the estimated size of the block. Otherwise, a <code>PartiallySerializedBlock</code>.</p>  <p><code>putIteratorAsBytes</code> prints out the following WARN message to the logs when the initial memory threshold is too large:</p> <pre><code>Initial memory threshold of [initialMemoryThreshold] is too large to be set as chunk size.\nChunk size has been capped to \"MAX_ROUNDED_ARRAY_LENGTH\"\n</code></pre>  <p><code>putIteratorAsBytes</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to doPutIterator (for a block with StorageLevel with useMemory and serialized)</li> </ul>","text":""},{"location":"storage/MemoryStore/#putiteratorasvalues","title":"putIteratorAsValues <pre><code>putIteratorAsValues[T](\n  blockId: BlockId,\n  values: Iterator[T],\n  memoryMode: MemoryMode,\n  classTag: ClassTag[T]): Either[PartiallyUnrolledIterator[T], Long]\n</code></pre> <p><code>putIteratorAsValues</code> putIterator (with the given BlockId, the values, the <code>MemoryMode</code> and a new <code>DeserializedValuesHolder</code>).</p> <p>If successful, <code>putIteratorAsValues</code> returns the estimated size of the block. Otherwise, a <code>PartiallyUnrolledIterator</code>.</p> <p><code>putIteratorAsValues</code>\u00a0is used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to saveDeserializedValuesToMemoryStore</li> <li><code>BlockManager</code> is requested to doPutIterator and maybeCacheDiskValuesInMemory</li> </ul>","text":""},{"location":"storage/MemoryStore/#putiterator","title":"putIterator <pre><code>putIterator[T](\n  blockId: BlockId,\n  values: Iterator[T],\n  classTag: ClassTag[T],\n  memoryMode: MemoryMode,\n  valuesHolder: ValuesHolder[T]): Either[Long, Long]\n</code></pre> <p><code>putIterator</code> returns the (estimated) size of the block (as <code>Right</code>) or the <code>unrollMemoryUsedByThisBlock</code> (as <code>Left</code>).</p> <p><code>putIterator</code> requires that the block is not already in the MemoryStore.</p> <p><code>putIterator</code> reserveUnrollMemoryForThisTask (with the spark.storage.unrollMemoryThreshold for the initial memory threshold).</p> <p>If <code>putIterator</code> did not manage to reserve the memory for unrolling (computing block in memory), it prints out the following WARN message to the logs:</p> <pre><code>Failed to reserve initial memory threshold of [initialMemoryThreshold]\nfor computing block [blockId] in memory.\n</code></pre> <p><code>putIterator</code> requests the <code>ValuesHolder</code> to <code>storeValue</code> for every value in the given <code>values</code> iterator. <code>putIterator</code> checks memory usage regularly (whether it may have exceeded the threshold) and reserveUnrollMemoryForThisTask when needed.</p> <p><code>putIterator</code> requests the <code>ValuesHolder</code> for a <code>MemoryEntryBuilder</code> (<code>getBuilder</code>) that in turn is requested to <code>build</code> a <code>MemoryEntry</code>.</p> <p><code>putIterator</code> releaseUnrollMemoryForThisTask.</p> <p><code>putIterator</code> requests the MemoryManager to acquireStorageMemory and stores the block (in the entries registry).</p> <p>In the end, <code>putIterator</code> prints out the following INFO message to the logs:</p> <pre><code>Block [blockId] stored as values in memory (estimated size [size], free [free])\n</code></pre>  <p>In case of <code>putIterator</code> not having enough memory to store the block, <code>putIterator</code> logUnrollFailureMessage and returns the <code>unrollMemoryUsedByThisBlock</code>.</p>  <p><code>putIterator</code>\u00a0is used when:</p> <ul> <li><code>MemoryStore</code> is requested to putIteratorAsValues and putIteratorAsBytes</li> </ul>","text":""},{"location":"storage/MemoryStore/#logunrollfailuremessage","title":"logUnrollFailureMessage <pre><code>logUnrollFailureMessage(\n  blockId: BlockId,\n  finalVectorSize: Long): Unit\n</code></pre> <p><code>logUnrollFailureMessage</code> prints out the following WARN message to the logs and logMemoryUsage.</p> <pre><code>Not enough space to cache [blockId] in memory! (computed [size] so far)\n</code></pre>","text":""},{"location":"storage/MemoryStore/#logmemoryusage","title":"logMemoryUsage <pre><code>logMemoryUsage(): Unit\n</code></pre> <p><code>logMemoryUsage</code> prints out the following INFO message to the logs (with the blocksMemoryUsed, currentUnrollMemory, numTasksUnrolling, memoryUsed, and maxMemory):</p> <pre><code>Memory use = [blocksMemoryUsed] (blocks) + [currentUnrollMemory]\n(scratch space shared across [numTasksUnrolling] tasks(s)) = [memoryUsed].\nStorage limit = [maxMemory].\n</code></pre>","text":""},{"location":"storage/MemoryStore/#storing-block","title":"Storing Block <pre><code>putBytes[T: ClassTag](\n  blockId: BlockId,\n  size: Long,\n  memoryMode: MemoryMode,\n  _bytes: () =&gt; ChunkedByteBuffer): Boolean\n</code></pre> <p><code>putBytes</code> returns <code>true</code> only after there was enough memory to store the block (BlockId) in entries registry.</p>  <p><code>putBytes</code> asserts that the block is not stored yet.</p> <p><code>putBytes</code> requests the MemoryManager for memory (to store the block) and, when successful, adds the block to the entries registry (as a SerializedMemoryEntry with the <code>_bytes</code> and the <code>MemoryMode</code>).</p> <p>In the end, <code>putBytes</code> prints out the following INFO message to the logs:</p> <pre><code>Block [blockId] stored as bytes in memory (estimated size [size], free [size])\n</code></pre> <p><code>putBytes</code> is used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save serialized values (to MemoryStore)</li> <li><code>BlockManager</code> is requested to maybeCacheDiskBytesInMemory</li> </ul>","text":""},{"location":"storage/MemoryStore/#memory-used-for-caching-blocks","title":"Memory Used for Caching Blocks <pre><code>blocksMemoryUsed: Long\n</code></pre> <p><code>blocksMemoryUsed</code> is the total memory used without (minus) the memory used for unrolling.</p> <p><code>blocksMemoryUsed</code> is used for logging purposes (when <code>MemoryStore</code> is requested to putBytes, putIterator, remove, evictBlocksToFreeSpace and logMemoryUsage).</p>","text":""},{"location":"storage/MemoryStore/#total-storage-memory-in-use","title":"Total Storage Memory in Use <pre><code>memoryUsed: Long\n</code></pre> <p><code>memoryUsed</code> requests the MemoryManager for the total storage memory.</p> <p><code>memoryUsed</code> is used when:</p> <ul> <li><code>MemoryStore</code> is requested for blocksMemoryUsed and to logMemoryUsage</li> </ul>","text":""},{"location":"storage/MemoryStore/#maximum-storage-memory","title":"Maximum Storage Memory <pre><code>maxMemory: Long\n</code></pre> <p><code>maxMemory</code> is the total amount of memory available for storage (in bytes) and is the sum of the maxOnHeapStorageMemory and maxOffHeapStorageMemory of the MemoryManager.</p>  <p>Tip</p> <p>Enable INFO logging for <code>MemoryStore</code> to print out the <code>maxMemory</code> to the logs when created:</p> <pre><code>MemoryStore started with capacity [maxMemory] MB\n</code></pre>  <p><code>maxMemory</code> is used when:</p> <ul> <li><code>MemoryStore</code> is requested for the blocksMemoryUsed and to logMemoryUsage</li> </ul>","text":""},{"location":"storage/MemoryStore/#dropping-block-from-memory","title":"Dropping Block from Memory <pre><code>remove(\n  blockId: BlockId): Boolean\n</code></pre> <p><code>remove</code> returns <code>true</code> when the given block (BlockId) was (found and) removed from the entries registry successfully and the memory released (from the MemoryManager).</p>  <p><code>remove</code> removes (drops) the block (BlockId) from the entries registry.</p> <p>If found and removed, <code>remove</code> requests the MemoryManager to releaseStorageMemory and prints out the following DEBUG message to the logs (with the maxMemory and blocksMemoryUsed):</p> <pre><code>Block [blockId] of size [size] dropped from memory (free [memory])\n</code></pre> <p><code>remove</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to dropFromMemory and removeBlockInternal</li> </ul>","text":""},{"location":"storage/MemoryStore/#releasing-unroll-memory-for-task","title":"Releasing Unroll Memory for Task <pre><code>releaseUnrollMemoryForThisTask(\n  memoryMode: MemoryMode,\n  memory: Long = Long.MaxValue): Unit\n</code></pre> <p><code>releaseUnrollMemoryForThisTask</code> finds the task attempt ID of the current task.</p> <p><code>releaseUnrollMemoryForThisTask</code> uses the onHeapUnrollMemoryMap or offHeapUnrollMemoryMap based on the given <code>MemoryMode</code>.</p> <p>(Only when the unroll memory map contains the task attempt ID) <code>releaseUnrollMemoryForThisTask</code> descreases the memory registered in the unroll memory map by the given memory amount and requests the MemoryManager to releaseUnrollMemory. In the end, <code>releaseUnrollMemoryForThisTask</code> removes the task attempt ID (entry) from the unroll memory map if the memory used is <code>0</code>.</p> <p><code>releaseUnrollMemoryForThisTask</code>\u00a0is used when:</p> <ul> <li><code>Task</code> is requested to run (and is about to finish)</li> <li><code>MemoryStore</code> is requested to putIterator</li> <li><code>PartiallyUnrolledIterator</code> is requested to <code>releaseUnrollMemory</code></li> <li><code>PartiallySerializedBlock</code> is requested to <code>discard</code> and <code>finishWritingToStream</code></li> </ul>","text":""},{"location":"storage/MemoryStore/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.memory.MemoryStore</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.memory.MemoryStore=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/NettyBlockRpcServer/","title":"NettyBlockRpcServer","text":"<p><code>NettyBlockRpcServer</code> is a <code>RpcHandler</code> to handle messages for NettyBlockTransferService.</p> <p></p>"},{"location":"storage/NettyBlockRpcServer/#creating-instance","title":"Creating Instance","text":"<p><code>NettyBlockRpcServer</code> takes the following to be created:</p> <ul> <li> Application ID <li> Serializer <li> BlockDataManager <p><code>NettyBlockRpcServer</code> is created\u00a0when:</p> <ul> <li><code>NettyBlockTransferService</code> is requested to initialize</li> </ul>"},{"location":"storage/NettyBlockRpcServer/#oneforonestreammanager","title":"OneForOneStreamManager <p><code>NettyBlockRpcServer</code> uses a <code>OneForOneStreamManager</code>.</p>","text":""},{"location":"storage/NettyBlockRpcServer/#receiving-rpc-messages","title":"Receiving RPC Messages <pre><code>receive(\n  client: TransportClient,\n  rpcMessage: ByteBuffer,\n  responseContext: RpcResponseCallback): Unit\n</code></pre> <p><code>receive</code> deserializes the incoming RPC message (from <code>ByteBuffer</code> to <code>BlockTransferMessage</code>) and prints out the following TRACE message to the logs:</p> <pre><code>Received request: [message]\n</code></pre> <p><code>receive</code> handles the message.</p> <p><code>receive</code>\u00a0is part of the <code>RpcHandler</code> abstraction.</p>","text":""},{"location":"storage/NettyBlockRpcServer/#fetchshuffleblocks","title":"FetchShuffleBlocks <p><code>FetchShuffleBlocks</code> carries the following:</p> <ul> <li>Application ID</li> <li>Executor ID</li> <li>Shuffle ID</li> <li>Map IDs (<code>long[]</code>)</li> <li>Reduce IDs (<code>long[][]</code>)</li> <li><code>batchFetchEnabled</code> flag</li> </ul> <p>When received, <code>receive</code>...FIXME</p> <p><code>receive</code> prints out the following TRACE message in the logs:</p> <pre><code>Registered streamId [streamId] with [numBlockIds] buffers\n</code></pre> <p>In the end, <code>receive</code> responds with a <code>StreamHandle</code> (with the <code>streamId</code> and the number of blocks). The response is serialized to a <code>ByteBuffer</code>.</p> <p><code>FetchShuffleBlocks</code> is posted when:</p> <ul> <li><code>OneForOneBlockFetcher</code> is requested to createFetchShuffleBlocksMsgAndBuildBlockIds</li> </ul>","text":""},{"location":"storage/NettyBlockRpcServer/#getlocaldirsforexecutors","title":"GetLocalDirsForExecutors","text":""},{"location":"storage/NettyBlockRpcServer/#openblocks","title":"OpenBlocks <p><code>OpenBlocks</code> carries the following:</p> <ul> <li>Application ID</li> <li>Executor ID</li> <li>Block IDs</li> </ul> <p>When received, <code>receive</code>...FIXME</p> <p><code>receive</code> prints out the following TRACE message in the logs:</p> <pre><code>Registered streamId [streamId] with [blocksNum] buffers\n</code></pre> <p>In the end, <code>receive</code> responds with a <code>StreamHandle</code> (with the <code>streamId</code> and the number of blocks). The response is serialized to a <code>ByteBuffer</code>.</p> <p><code>OpenBlocks</code> is posted when:</p> <ul> <li><code>OneForOneBlockFetcher</code> is requested to start</li> </ul>","text":""},{"location":"storage/NettyBlockRpcServer/#uploadblock","title":"UploadBlock <p><code>UploadBlock</code> carries the following:</p> <ul> <li>Application ID</li> <li>Executor ID</li> <li>Block ID</li> <li>Metadata (<code>byte[]</code>)</li> <li>Block Data (<code>byte[]</code>)</li> </ul> <p>When received, <code>receive</code> deserializes the <code>metadata</code> to get the StorageLevel and <code>ClassTag</code> of the block being uploaded.</p> <p><code>receive</code>...FIXME</p> <p><code>UploadBlock</code> is posted when:</p> <ul> <li><code>NettyBlockTransferService</code> is requested to upload a block</li> </ul>","text":""},{"location":"storage/NettyBlockRpcServer/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.network.netty.NettyBlockRpcServer</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.network.netty.NettyBlockRpcServer=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/NettyBlockTransferService/","title":"NettyBlockTransferService","text":"<p><code>NettyBlockTransferService</code> is a BlockTransferService that uses Netty for uploading and fetching blocks of data.</p> <p></p>"},{"location":"storage/NettyBlockTransferService/#creating-instance","title":"Creating Instance","text":"<p><code>NettyBlockTransferService</code> takes the following to be created:</p> <ul> <li> SparkConf <li> <code>SecurityManager</code> <li> Bind Address <li> Host Name <li> Port <li> Number of CPU Cores <li> Driver RpcEndpointRef <p><code>NettyBlockTransferService</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv (for the driver and executors and creates a BlockManager)</li> </ul>"},{"location":"storage/NettyBlockTransferService/#initializing","title":"Initializing <pre><code>init(\n  blockDataManager: BlockDataManager): Unit\n</code></pre> <p><code>init</code>\u00a0is part of the BlockTransferService abstraction.</p> <p><code>init</code> creates a NettyBlockRpcServer (with the application ID, a <code>JavaSerializer</code> and the given BlockDataManager).</p> <p><code>init</code> creates a TransportContext (with the <code>NettyBlockRpcServer</code> just created) and requests it for a TransportClientFactory.</p> <p><code>init</code> createServer.</p> <p>In the end, <code>init</code> prints out the following INFO message to the logs:</p> <pre><code>Server created on [hostName]:[port]\n</code></pre>","text":""},{"location":"storage/NettyBlockTransferService/#fetching-blocks","title":"Fetching Blocks <pre><code>fetchBlocks(\n  host: String,\n  port: Int,\n  execId: String,\n  blockIds: Array[String],\n  listener: BlockFetchingListener,\n  tempFileManager: DownloadFileManager): Unit\n</code></pre> <p><code>fetchBlocks</code> prints out the following TRACE message to the logs:</p> <pre><code>Fetch blocks from [host]:[port] (executor id [execId])\n</code></pre> <p><code>fetchBlocks</code> requests the TransportConf for the maxIORetries.</p> <p><code>fetchBlocks</code> creates a BlockTransferStarter.</p> <p>With the <code>maxIORetries</code> above zero, <code>fetchBlocks</code> creates a RetryingBlockFetcher (with the <code>BlockFetchStarter</code>, the <code>blockIds</code> and the BlockFetchingListener) and starts it.</p> <p>Otherwise, <code>fetchBlocks</code> requests the <code>BlockFetchStarter</code> to createAndStart (with the <code>blockIds</code> and the <code>BlockFetchingListener</code>).</p> <p>In case of any <code>Exception</code>, <code>fetchBlocks</code> prints out the following ERROR message to the logs and the given <code>BlockFetchingListener</code> gets notified.</p> <pre><code>Exception while beginning fetchBlocks\n</code></pre> <p><code>fetchBlocks</code>\u00a0is part of the BlockStoreClient abstraction.</p>","text":""},{"location":"storage/NettyBlockTransferService/#blocktransferstarter","title":"BlockTransferStarter <p><code>fetchBlocks</code> creates a <code>BlockTransferStarter</code>. When requested to <code>createAndStart</code>, the <code>BlockTransferStarter</code> requests the TransportClientFactory to create a TransportClient.</p> <p><code>createAndStart</code> creates an OneForOneBlockFetcher and requests it to start.</p>","text":""},{"location":"storage/NettyBlockTransferService/#ioexception","title":"IOException <p>In case of an <code>IOException</code>, <code>createAndStart</code> requests the driver RpcEndpointRef to send an <code>IsExecutorAlive</code> message synchronously (with the given <code>execId</code>).</p> <p>If the driver <code>RpcEndpointRef</code> replied <code>false</code>, <code>createAndStart</code> throws an ExecutorDeadException:</p> <pre><code>The relative remote executor(Id: [execId]),\nwhich maintains the block data to fetch is dead.\n</code></pre> <p>Otherwise, <code>createAndStart</code> (re)throws the <code>IOException</code>.</p>","text":""},{"location":"storage/NettyBlockTransferService/#uploading-block","title":"Uploading Block <pre><code>uploadBlock(\n  hostname: String,\n  port: Int,\n  execId: String,\n  blockId: BlockId,\n  blockData: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Future[Unit]\n</code></pre> <p><code>uploadBlock</code>\u00a0is part of the BlockTransferService abstraction.</p> <p><code>uploadBlock</code> creates a <code>TransportClient</code> (with the given <code>hostname</code> and <code>port</code>).</p> <p><code>uploadBlock</code> serializes the given StorageLevel and <code>ClassTag</code> (using a <code>JavaSerializer</code>).</p> <p><code>uploadBlock</code> uses a stream to transfer shuffle blocks when one of the following holds:</p> <ol> <li>The size of the block data (<code>ManagedBuffer</code>) is above spark.network.maxRemoteBlockSizeFetchToMem configuration property</li> <li>The given BlockId is a shuffle block</li> </ol> <p>For stream transfer <code>uploadBlock</code> requests the <code>TransportClient</code> to <code>uploadStream</code>. Otherwise, <code>uploadBlock</code> requests the <code>TransportClient</code> to <code>sendRpc</code> a <code>UploadBlock</code> message.</p>  <p>Note</p> <p><code>UploadBlock</code> message is processed by NettyBlockRpcServer.</p>  <p>With the upload successful, <code>uploadBlock</code> prints out the following TRACE message to the logs:</p> <pre><code>Successfully uploaded block [blockId] [as stream]\n</code></pre> <p>With the upload failed, <code>uploadBlock</code> prints out the following ERROR message to the logs:</p> <pre><code>Error while uploading block [blockId] [as stream]\n</code></pre>","text":""},{"location":"storage/NettyBlockTransferService/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.network.netty.NettyBlockTransferService</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.network.netty.NettyBlockTransferService=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/OneForOneBlockFetcher/","title":"OneForOneBlockFetcher","text":""},{"location":"storage/OneForOneBlockFetcher/#creating-instance","title":"Creating Instance","text":"<p><code>OneForOneBlockFetcher</code> takes the following to be created:</p> <ul> <li> <code>TransportClient</code> <li> Application ID <li> Executor ID <li> Block IDs (<code>String[]</code>) <li> BlockFetchingListener <li> TransportConf <li> DownloadFileManager <p><code>OneForOneBlockFetcher</code> is created\u00a0when:</p> <ul> <li><code>NettyBlockTransferService</code> is requested to fetch blocks</li> <li><code>ExternalBlockStoreClient</code> is requested to fetch blocks</li> </ul>"},{"location":"storage/OneForOneBlockFetcher/#createfetchshuffleblocksmsg","title":"createFetchShuffleBlocksMsg <pre><code>FetchShuffleBlocks createFetchShuffleBlocksMsg(\n  String appId,\n  String execId,\n  String[] blockIds)\n</code></pre> <p><code>createFetchShuffleBlocksMsg</code>...FIXME</p>","text":""},{"location":"storage/OneForOneBlockFetcher/#starting","title":"Starting <pre><code>void start()\n</code></pre> <p><code>start</code> requests the TransportClient to <code>sendRpc</code> the BlockTransferMessage</p> <p><code>start</code>...FIXME</p> <p><code>start</code>\u00a0is used when:</p> <ul> <li><code>ExternalBlockStoreClient</code> is requested to fetchBlocks</li> <li><code>NettyBlockTransferService</code> is requested to fetchBlocks</li> </ul>","text":""},{"location":"storage/OneForOneBlockFetcher/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.network.shuffle.OneForOneBlockFetcher</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.network.shuffle.OneForOneBlockFetcher=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/RDDInfo/","title":"RDDInfo","text":"<p><code>RDDInfo</code> is...FIXME</p>"},{"location":"storage/RandomBlockReplicationPolicy/","title":"RandomBlockReplicationPolicy","text":"<p><code>RandomBlockReplicationPolicy</code> is...FIXME</p>"},{"location":"storage/ShuffleBlockFetcherIterator/","title":"ShuffleBlockFetcherIterator","text":"<p><code>ShuffleBlockFetcherIterator</code> is an <code>Iterator[(BlockId, InputStream)]</code> (Scala) that fetches shuffle blocks from local or remote BlockManagers (and makes them available as an <code>InputStream</code>).</p> <p><code>ShuffleBlockFetcherIterator</code> allows for a synchronous iteration over shuffle blocks so a caller can handle them in a pipelined fashion as they are received.</p> <p><code>ShuffleBlockFetcherIterator</code> is exhausted (and can provide no elements) when the number of blocks already processed is at least the total number of blocks to fetch.</p> <p><code>ShuffleBlockFetcherIterator</code> throttles the remote fetches to avoid consuming too much memory.</p>"},{"location":"storage/ShuffleBlockFetcherIterator/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleBlockFetcherIterator</code> takes the following to be created:</p> <ul> <li> TaskContext <li> BlockStoreClient <li> BlockManager <li> Blocks to Fetch by Address (<code>Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]</code>) <li> Stream Wrapper Function (<code>(BlockId, InputStream) =&gt; InputStream</code>) <li> spark.reducer.maxSizeInFlight <li> spark.reducer.maxReqsInFlight <li> spark.reducer.maxBlocksInFlightPerAddress <li> spark.network.maxRemoteBlockSizeFetchToMem <li> spark.shuffle.detectCorrupt <li> spark.shuffle.detectCorrupt.useExtraMemory <li> <code>ShuffleReadMetricsReporter</code> <li> <code>doBatchFetch</code> flag <p>While being created, <code>ShuffleBlockFetcherIterator</code> initializes itself.</p> <p><code>ShuffleBlockFetcherIterator</code> is created\u00a0when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined key-value records for a reduce task</li> </ul>"},{"location":"storage/ShuffleBlockFetcherIterator/#initializing","title":"Initializing <pre><code>initialize(): Unit\n</code></pre> <p><code>initialize</code> registers a task cleanup and fetches shuffle blocks from remote and local storage:BlockManager.md[BlockManagers].</p> <p>Internally, <code>initialize</code> uses the TaskContext to register the ShuffleFetchCompletionListener (to cleanup).</p> <p><code>initialize</code> partitionBlocksByFetchMode.</p> <p><code>initialize</code>...FIXME</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#partitionblocksbyfetchmode","title":"partitionBlocksByFetchMode <pre><code>partitionBlocksByFetchMode(): ArrayBuffer[FetchRequest]\n</code></pre> <p><code>partitionBlocksByFetchMode</code>...FIXME</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#collectfetchrequests","title":"collectFetchRequests <pre><code>collectFetchRequests(\n  address: BlockManagerId,\n  blockInfos: Seq[(BlockId, Long, Int)],\n  collectedRemoteRequests: ArrayBuffer[FetchRequest]): Unit\n</code></pre> <p><code>collectFetchRequests</code>...FIXME</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#createfetchrequests","title":"createFetchRequests <pre><code>createFetchRequests(\n  curBlocks: Seq[FetchBlockInfo],\n  address: BlockManagerId,\n  isLast: Boolean,\n  collectedRemoteRequests: ArrayBuffer[FetchRequest]): Seq[FetchBlockInfo]\n</code></pre> <p><code>createFetchRequests</code>...FIXME</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#fetchuptomaxbytes","title":"fetchUpToMaxBytes <pre><code>fetchUpToMaxBytes(): Unit\n</code></pre> <p><code>fetchUpToMaxBytes</code>...FIXME</p> <p><code>fetchUpToMaxBytes</code> is used when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to initialize and next</li> </ul>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#sending-remote-shuffle-block-fetch-request","title":"Sending Remote Shuffle Block Fetch Request <pre><code>sendRequest(\n  req: FetchRequest): Unit\n</code></pre> <p><code>sendRequest</code> prints out the following DEBUG message to the logs:</p> <pre><code>Sending request for [n] blocks ([size]) from [hostPort]\n</code></pre> <p><code>sendRequest</code> add the size of the blocks in the <code>FetchRequest</code> to bytesInFlight and increments the reqsInFlight internal counters.</p> <p><code>sendRequest</code> requests the ShuffleClient to fetch the blocks with a new BlockFetchingListener (and this <code>ShuffleBlockFetcherIterator</code> when the size of the blocks in the <code>FetchRequest</code> is higher than the maxReqSizeShuffleToMem).</p> <p><code>sendRequest</code> is used when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to fetch remote shuffle blocks</li> </ul>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#blockfetchinglistener","title":"BlockFetchingListener <p><code>sendRequest</code> creates a new BlockFetchingListener to be notified about successes or failures of shuffle block fetch requests.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#onblockfetchsuccess","title":"onBlockFetchSuccess <p>On onBlockFetchSuccess the <code>BlockFetchingListener</code> adds a <code>SuccessFetchResult</code> to the results registry and prints out the following DEBUG message to the logs (when not a zombie):</p> <pre><code>remainingBlocks: [remainingBlocks]\n</code></pre> <p>In the end, <code>onBlockFetchSuccess</code> prints out the following TRACE message to the logs:</p> <pre><code>Got remote block [blockId] after [time]\n</code></pre>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#onblockfetchfailure","title":"onBlockFetchFailure <p>On onBlockFetchFailure the <code>BlockFetchingListener</code> adds a <code>FailureFetchResult</code> to the results registry and prints out the following ERROR message to the logs:</p> <pre><code>Failed to get block(s) from [host]:[port]\n</code></pre>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#fetchresults","title":"FetchResults <pre><code>results: LinkedBlockingQueue[FetchResult]\n</code></pre> <p><code>ShuffleBlockFetcherIterator</code> uses an internal FIFO blocking queue (Java) of <code>FetchResult</code>s.</p> <p><code>results</code> is used for fetching the next element.</p> <p>For remote blocks, <code>FetchResult</code>s are added in sendRequest:</p> <ul> <li><code>SuccessFetchResult</code>s after a <code>BlockFetchingListener</code> is notified about onBlockFetchSuccess</li> <li><code>FailureFetchResult</code>s after a <code>BlockFetchingListener</code> is notified about onBlockFetchFailure</li> </ul> <p>For local blocks, <code>FetchResult</code>s are added in fetchLocalBlocks:</p> <ul> <li><code>SuccessFetchResult</code>s after the BlockManager has successfully getLocalBlockData</li> <li><code>FailureFetchResult</code>s otherwise</li> </ul> <p>For local blocks, <code>FetchResult</code>s are added in fetchHostLocalBlock:</p> <ul> <li><code>SuccessFetchResult</code>s after the BlockManager has successfully getHostLocalShuffleData</li> <li><code>FailureFetchResult</code>s otherwise</li> </ul> <p><code>FailureFetchResult</code>s can also be added in fetchHostLocalBlocks.</p> <p>Cleaned up in cleanup</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#hasnext","title":"hasNext <pre><code>hasNext: Boolean\n</code></pre> <p><code>hasNext</code>\u00a0is part of the <code>Iterator</code> (Scala) abstraction (to test whether this iterator can provide another element).</p> <p><code>hasNext</code> is <code>true</code> when numBlocksProcessed is below numBlocksToFetch.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#retrieving-next-element","title":"Retrieving Next Element <pre><code>next(): (BlockId, InputStream)\n</code></pre> <p><code>next</code> increments the numBlocksProcessed registry.</p> <p><code>next</code> takes (and removes) the head of the results queue.</p> <p><code>next</code> requests the ShuffleReadMetricsReporter to <code>incFetchWaitTime</code>.</p> <p><code>next</code>...FIXME</p> <p><code>next</code> throws a <code>NoSuchElementException</code> if there is no element left.</p> <p><code>next</code> is part of the <code>Iterator</code> (Scala) abstraction (to produce the next element of this iterator).</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#numblocksprocessed","title":"numBlocksProcessed <p>The number of blocks fetched and consumed</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#numblockstofetch","title":"numBlocksToFetch <p>Total number of blocks to fetch and consume</p> <p><code>ShuffleBlockFetcherIterator</code> can produce up to <code>numBlocksToFetch</code> elements.</p> <p><code>numBlocksToFetch</code> is increased every time <code>ShuffleBlockFetcherIterator</code> is requested to partitionBlocksByFetchMode that prints it out as the INFO message to the logs:</p> <pre><code>Getting [numBlocksToFetch] non-empty blocks out of [totalBlocks] blocks\n</code></pre>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#releasecurrentresultbuffer","title":"releaseCurrentResultBuffer <pre><code>releaseCurrentResultBuffer(): Unit\n</code></pre> <p><code>releaseCurrentResultBuffer</code>...FIXME</p> <p><code>releaseCurrentResultBuffer</code>\u00a0is used when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to cleanup</li> <li><code>BufferReleasingInputStream</code> is requested to <code>close</code></li> </ul>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#shufflefetchcompletionlistener","title":"ShuffleFetchCompletionListener <p><code>ShuffleBlockFetcherIterator</code> creates a ShuffleFetchCompletionListener when created.</p> <p><code>ShuffleFetchCompletionListener</code> is used when initialize and toCompletionIterator.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#cleaning-up","title":"Cleaning Up <pre><code>cleanup(): Unit\n</code></pre> <p><code>cleanup</code> marks this <code>ShuffleBlockFetcherIterator</code> a zombie.</p> <p><code>cleanup</code> releases the current result buffer.</p> <p><code>cleanup</code> iterates over results internal queue and for every <code>SuccessFetchResult</code>, increments remote bytes read and blocks fetched shuffle task metrics, and eventually releases the managed buffer.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#bytesinflight","title":"bytesInFlight <p>The bytes of fetched remote shuffle blocks in flight</p> <p>Starts at <code>0</code> when <code>ShuffleBlockFetcherIterator</code> is created</p> <p>Incremented every sendRequest and decremented every next.</p> <p><code>ShuffleBlockFetcherIterator</code> makes sure that the invariant of <code>bytesInFlight</code> is below maxBytesInFlight every remote shuffle block fetch.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#reqsinflight","title":"reqsInFlight <p>The number of remote shuffle block fetch requests in flight.</p> <p>Starts at <code>0</code> when <code>ShuffleBlockFetcherIterator</code> is created</p> <p>Incremented every sendRequest and decremented every next.</p> <p><code>ShuffleBlockFetcherIterator</code> makes sure that the invariant of <code>reqsInFlight</code> is below maxReqsInFlight every remote shuffle block fetch.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#iszombie","title":"isZombie <p>Controls whether <code>ShuffleBlockFetcherIterator</code> is still active and records <code>SuccessFetchResult</code>s on successful shuffle block fetches.</p> <p>Starts <code>false</code> when <code>ShuffleBlockFetcherIterator</code> is created</p> <p>Enabled (<code>true</code>) in cleanup.</p> <p>When enabled, registerTempFileToClean is a noop.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#downloadfilemanager","title":"DownloadFileManager <p><code>ShuffleBlockFetcherIterator</code> is a DownloadFileManager.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#throwfetchfailedexception","title":"throwFetchFailedException <pre><code>throwFetchFailedException(\n  blockId: BlockId,\n  mapIndex: Int,\n  address: BlockManagerId,\n  e: Throwable,\n  message: Option[String] = None): Nothing\n</code></pre> <p><code>throwFetchFailedException</code> takes the <code>message</code> (if defined) or uses the message of the given <code>Throwable</code>.</p> <p>In the end, <code>throwFetchFailedException</code> throws a FetchFailedException if the BlockId is either a <code>ShuffleBlockId</code> or a <code>ShuffleBlockBatchId</code>. Otherwise, <code>throwFetchFailedException</code> throws a <code>SparkException</code>:</p> <pre><code>Failed to get block [blockId], which is not a shuffle block\n</code></pre> <p><code>throwFetchFailedException</code>\u00a0is used when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to next</li> <li><code>BufferReleasingInputStream</code> is requested to <code>tryOrFetchFailedException</code></li> </ul>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.ShuffleBlockFetcherIterator</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.ShuffleBlockFetcherIterator=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/ShuffleFetchCompletionListener/","title":"ShuffleFetchCompletionListener","text":"<p><code>ShuffleFetchCompletionListener</code> is a TaskCompletionListener (that ShuffleBlockFetcherIterator uses to clean up after the owning task is completed).</p>"},{"location":"storage/ShuffleFetchCompletionListener/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleFetchCompletionListener</code> takes the following to be created:</p> <ul> <li> ShuffleBlockFetcherIterator <p><code>ShuffleFetchCompletionListener</code> is created\u00a0when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is created</li> </ul>"},{"location":"storage/ShuffleFetchCompletionListener/#ontaskcompletion","title":"onTaskCompletion <pre><code>onTaskCompletion(\n  context: TaskContext): Unit\n</code></pre> <p><code>onTaskCompletion</code>\u00a0is part of the TaskCompletionListener abstraction.</p> <p><code>onTaskCompletion</code> requests the ShuffleBlockFetcherIterator (if available) to cleanup.</p> <p>In the end, <code>onTaskCompletion</code> nulls out the reference to the ShuffleBlockFetcherIterator (to make it available for garbage collection).</p>","text":""},{"location":"storage/ShuffleMetricsSource/","title":"ShuffleMetricsSource","text":"<p>= ShuffleMetricsSource</p> <p>ShuffleMetricsSource is the metrics:spark-metrics-Source.md[metrics source] of a storage:BlockManager.md[] for &lt;&gt;. <p>ShuffleMetricsSource lives on a Spark executor and is executor:Executor.md#creating-instance-BlockManager-shuffleMetricsSource[registered only when a Spark application runs in a non-local / cluster mode].</p> <p>.Registering ShuffleMetricsSource with \"executor\" MetricsSystem image::ShuffleMetricsSource.png[align=\"center\"]</p> <p>== [[creating-instance]] Creating Instance</p> <p>ShuffleMetricsSource takes the following to be created:</p> <ul> <li>&lt;&gt; <li>&lt;&gt; <p>ShuffleMetricsSource is created when BlockManager is requested for the storage:BlockManager.md#shuffleMetricsSource[shuffle metrics source].</p> <p>== [[sourceName]] Source Name</p> <p>ShuffleMetricsSource is given a name when &lt;&gt; that is one of the following: <ul> <li> <p>NettyBlockTransfer when spark.shuffle.service.enabled configuration property is off (<code>false</code>)</p> </li> <li> <p>ExternalShuffle when spark.shuffle.service.enabled configuration property is on (<code>true</code>)</p> </li> </ul>"},{"location":"storage/ShuffleMigrationRunnable/","title":"ShuffleMigrationRunnable","text":"<p><code>ShuffleMigrationRunnable</code> is...FIXME</p>"},{"location":"storage/StorageLevel/","title":"StorageLevel","text":"<p><code>StorageLevel</code> is the following flags for controlling the storage of an RDD.</p> Flag Default Value <code>useDisk</code> <code>false</code> <code>useMemory</code> <code>true</code> <code>useOffHeap</code> <code>false</code> <code>deserialized</code> <code>false</code> <code>replication</code> 1","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#restrictions","title":"Restrictions","text":"<ol> <li>The replication is restricted to be less than <code>40</code> (for calculating the hash code)</li> <li>Off-heap storage level does not support deserialized storage</li> </ol>","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#validation","title":"Validation <pre><code>isValid: Boolean\n</code></pre> <p><code>StorageLevel</code> is considered valid when the following all hold:</p> <ol> <li>Uses memory or disk</li> <li>Replication is non-zero positive number (between the default <code>1</code> and 40)</li> </ol>","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#externalizable","title":"Externalizable <p><code>DirectTaskResult</code> is an <code>Externalizable</code> (Java).</p>","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#writeexternal","title":"writeExternal <pre><code>writeExternal(\n  out: ObjectOutput): Unit\n</code></pre> <p><code>writeExternal</code>\u00a0is part of the <code>Externalizable</code> (Java) abstraction.</p> <p><code>writeExternal</code> writes the bitwise representation out followed by the replication of this <code>StorageLevel</code>.</p>","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#bitwise-integer-representation","title":"Bitwise Integer Representation <pre><code>toInt: Int\n</code></pre> <p><code>toInt</code> converts this <code>StorageLevel</code> to numeric (binary) representation by turning the corresponding bits on for the following (if used and in that order):</p> <ol> <li>deserialized</li> <li>useOffHeap</li> <li>useMemory</li> <li>useDisk</li> </ol> <p>In other words, the following number in bitwise representation says the <code>StorageLevel</code> is deserialized and useMemory:</p> <pre><code>import org.apache.spark.storage.StorageLevel.MEMORY_ONLY\nassert(MEMORY_ONLY.toInt == (0 | 1 | 4))\n\nscala&gt; println(MEMORY_ONLY.toInt.toBinaryString)\n101\n</code></pre> <p><code>toInt</code>\u00a0is used when:</p> <ul> <li><code>StorageLevel</code> is requested to writeExternal and hashCode</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageStatus/","title":"StorageStatus","text":"<p>== [[StorageStatus]] StorageStatus</p> <p><code>StorageStatus</code> is a developer API that Spark uses to pass \"just enough\" information about registered storage:BlockManager.md[BlockManagers] in a Spark application between Spark services (mostly for monitoring purposes like spark-webui.md[web UI] or SparkListener.md[]s).</p>"},{"location":"storage/StorageStatus/#note","title":"[NOTE]","text":"<p>There are two ways to access <code>StorageStatus</code> about all the known <code>BlockManagers</code> in a Spark application:</p> <ul> <li>SparkContext.md#getExecutorStorageStatus[SparkContext.getExecutorStorageStatus]</li> </ul>"},{"location":"storage/StorageStatus/#being-a-sparklistenermd-and-intercepting-sparklistenermdonblockmanageraddedonblockmanageradded-and-sparklistenermdonblockmanagerremovedonblockmanagerremoved-events","title":"* Being a SparkListener.md[] and intercepting SparkListener.md#onBlockManagerAdded[onBlockManagerAdded] and SparkListener.md#onBlockManagerRemoved[onBlockManagerRemoved] events","text":"<p><code>StorageStatus</code> is &lt;&gt; when: <ul> <li><code>BlockManagerMasterEndpoint</code> storage:BlockManagerMasterEndpoint.md#storageStatus[is requested for storage status] (of every storage:BlockManager.md[BlockManager] in a Spark application)</li> </ul> <p>[[internal-registries]] .StorageStatus's Internal Registries and Counters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| [[_nonRddBlocks]] <code>_nonRddBlocks</code> | Lookup table of <code>BlockIds</code> per <code>BlockId</code>.</p> <p>Used when...FIXME</p> <p>| [[_rddBlocks]] <code>_rddBlocks</code> | Lookup table of <code>BlockIds</code> with <code>BlockStatus</code> per RDD id.</p> <p>Used when...FIXME |===</p> <p>=== [[updateStorageInfo]] <code>updateStorageInfo</code> Internal Method</p>"},{"location":"storage/StorageStatus/#source-scala","title":"[source, scala]","text":"<p>updateStorageInfo(   blockId: BlockId,   newBlockStatus: BlockStatus): Unit</p> <p><code>updateStorageInfo</code>...FIXME</p> <p>NOTE: <code>updateStorageInfo</code> is used when...FIXME</p> <p>=== [[creating-instance]] Creating StorageStatus Instance</p> <p><code>StorageStatus</code> takes the following when created:</p> <ul> <li>[[blockManagerId]] storage:BlockManagerId.md[]</li> <li>[[maxMem]] Maximum memory -- storage:BlockManager.md#maxMemory[total available on-heap and off-heap memory for storage on the <code>BlockManager</code>]</li> </ul> <p><code>StorageStatus</code> initializes the &lt;&gt;. <p>=== [[rddBlocksById]] Getting RDD Blocks For RDD -- <code>rddBlocksById</code> Method</p>"},{"location":"storage/StorageStatus/#source-scala_1","title":"[source, scala]","text":""},{"location":"storage/StorageStatus/#rddblocksbyidrddid-int-mapblockid-blockstatus","title":"rddBlocksById(rddId: Int): Map[BlockId, BlockStatus]","text":"<p><code>rddBlocksById</code> gives the blocks (as <code>BlockId</code> with their status as <code>BlockStatus</code>) that belong to <code>rddId</code> RDD.</p> <p>=== [[removeBlock]] Removing Block (From Internal Registries) -- <code>removeBlock</code> Internal Method</p>"},{"location":"storage/StorageStatus/#source-scala_2","title":"[source, scala]","text":""},{"location":"storage/StorageStatus/#removeblockblockid-blockid-optionblockstatus","title":"removeBlock(blockId: BlockId): Option[BlockStatus]","text":"<p><code>removeBlock</code> removes <code>blockId</code> from &lt;&lt;_rddBlocks, _rddBlocks&gt;&gt; registry and returns it.</p> <p>Internally, <code>removeBlock</code> &lt;&gt; of <code>blockId</code> (to be empty, i.e. removed). <p><code>removeBlock</code> branches off per the type of storage:BlockId.md[], i.e. <code>RDDBlockId</code> or not.</p> <p>For a <code>RDDBlockId</code>, <code>removeBlock</code> finds the RDD in &lt;&lt;_rddBlocks, _rddBlocks&gt;&gt; and removes the <code>blockId</code>. <code>removeBlock</code> removes the RDD (from &lt;&lt;_rddBlocks, _rddBlocks&gt;&gt;) completely, if there are no more blocks registered.</p> <p>For a non-<code>RDDBlockId</code>, <code>removeBlock</code> removes <code>blockId</code> from &lt;&lt;_nonRddBlocks, _nonRddBlocks&gt;&gt; registry.</p> <p>=== [[addBlock]] Registering Status of Data Block -- <code>addBlock</code> Method</p>"},{"location":"storage/StorageStatus/#source-scala_3","title":"[source, scala]","text":"<p>addBlock(   blockId: BlockId,   blockStatus: BlockStatus): Unit</p> <p><code>addBlock</code>...FIXME</p> <p>NOTE: <code>addBlock</code> is used when...FIXME</p> <p>=== [[getBlock]] <code>getBlock</code> Method</p>"},{"location":"storage/StorageStatus/#source-scala_4","title":"[source, scala]","text":""},{"location":"storage/StorageStatus/#getblockblockid-blockid-optionblockstatus","title":"getBlock(blockId: BlockId): Option[BlockStatus]","text":"<p><code>getBlock</code>...FIXME</p> <p>NOTE: <code>getBlock</code> is used when...FIXME</p>"},{"location":"storage/StorageUtils/","title":"StorageUtils","text":""},{"location":"storage/StorageUtils/#port-of-external-shuffle-service","title":"Port of External Shuffle Service <pre><code>externalShuffleServicePort(\n  conf: SparkConf): Int\n</code></pre> <p><code>externalShuffleServicePort</code>...FIXME</p> <p><code>externalShuffleServicePort</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is created</li> <li><code>BlockManagerMasterEndpoint</code> is created</li> </ul>","text":""},{"location":"storage/TempFileBasedBlockStoreUpdater/","title":"TempFileBasedBlockStoreUpdater","text":"<p><code>TempFileBasedBlockStoreUpdater</code> is a BlockStoreUpdater (that BlockManager uses for storing a block from bytes in a local temporary file).</p>"},{"location":"storage/TempFileBasedBlockStoreUpdater/#creating-instance","title":"Creating Instance","text":"<p><code>TempFileBasedBlockStoreUpdater</code> takes the following to be created:</p> <ul> <li> BlockId <li> StorageLevel <li> <code>ClassTag</code> (Scala) <li> Temporary File <li> Block Size <li> <code>tellMaster</code> flag (default: <code>true</code>) <li> <code>keepReadLock</code> flag (default: <code>false</code>) <p><code>TempFileBasedBlockStoreUpdater</code> is created\u00a0when:</p> <ul> <li><code>BlockManager</code> is requested to putBlockDataAsStream</li> <li><code>PythonBroadcast</code> is requested to <code>readObject</code></li> </ul>"},{"location":"storage/TempFileBasedBlockStoreUpdater/#block-data","title":"Block Data <pre><code>blockData(): BlockData\n</code></pre> <p><code>blockData</code> requests the DiskStore (of the parent BlockManager) to getBytes (with the temp file and the block size).</p> <p><code>blockData</code>\u00a0is part of the BlockStoreUpdater abstraction.</p>","text":""},{"location":"storage/TempFileBasedBlockStoreUpdater/#storing-block-to-disk","title":"Storing Block to Disk <pre><code>saveToDiskStore(): Unit\n</code></pre> <p><code>saveToDiskStore</code> requests the DiskStore (of the parent BlockManager) to moveFileToBlock.</p> <p><code>saveToDiskStore</code>\u00a0is part of the BlockStoreUpdater abstraction.</p>","text":""},{"location":"tools/","title":"Spark Tools","text":"<p>Main abstractions:</p> <ul> <li>AbstractCommandBuilder</li> </ul>"},{"location":"tools/AbstractCommandBuilder/","title":"AbstractCommandBuilder","text":"<p><code>AbstractCommandBuilder</code> is an abstraction of launch command builders.</p>"},{"location":"tools/AbstractCommandBuilder/#contract","title":"Contract","text":""},{"location":"tools/AbstractCommandBuilder/#buildCommand","title":"Building Command","text":"<pre><code>List&lt;String&gt; buildCommand(\n  Map&lt;String, String&gt; env)\n</code></pre> <p>Builds a command to launch a script on command line</p> <p>See:</p> <ul> <li>SparkClassCommandBuilder</li> <li>SparkSubmitCommandBuilder</li> </ul> <p>Used when:</p> <ul> <li><code>Main</code> is requested to build a command</li> </ul>"},{"location":"tools/AbstractCommandBuilder/#implementations","title":"Implementations","text":"<ul> <li>SparkClassCommandBuilder</li> <li>SparkSubmitCommandBuilder</li> <li>WorkerCommandBuilder</li> </ul>"},{"location":"tools/AbstractCommandBuilder/#buildjavacommand","title":"buildJavaCommand <pre><code>List&lt;String&gt; buildJavaCommand(\n  String extraClassPath)\n</code></pre> <p><code>buildJavaCommand</code> builds the Java command for a Spark application (which is a collection of elements with the path to <code>java</code> executable, JVM options from <code>java-opts</code> file, and a class path).</p> <p>If <code>javaHome</code> is set, <code>buildJavaCommand</code> adds <code>[javaHome]/bin/java</code> to the result Java command. Otherwise, it uses <code>JAVA_HOME</code> or, when no earlier checks succeeded, falls through to <code>java.home</code> Java's system property.</p> <p>CAUTION: FIXME Who sets <code>javaHome</code> internal property and when?</p> <p><code>buildJavaCommand</code> loads extra Java options from the <code>java-opts</code> file in configuration directory if the file exists and adds them to the result Java command.</p> <p>Eventually, <code>buildJavaCommand</code> builds the class path (with the extra class path if non-empty) and adds it as <code>-cp</code> to the result Java command.</p>","text":""},{"location":"tools/AbstractCommandBuilder/#buildclasspath","title":"buildClassPath <pre><code>List&lt;String&gt; buildClassPath(\n  String appClassPath)\n</code></pre> <p><code>buildClassPath</code> builds the classpath for a Spark application.</p>  <p>Note</p> <p>Directories always end up with the OS-specific file separator at the end of their paths.</p>  <p><code>buildClassPath</code> adds the following in that order:</p> <ol> <li><code>SPARK_CLASSPATH</code> environment variable</li> <li>The input <code>appClassPath</code></li> <li>The configuration directory</li> <li> <p>(only with <code>SPARK_PREPEND_CLASSES</code> set or <code>SPARK_TESTING</code> being <code>1</code>) Locally compiled Spark classes in <code>classes</code>, <code>test-classes</code> and Core's jars. + CAUTION: FIXME Elaborate on \"locally compiled Spark classes\".</p> </li> <li> <p>(only with <code>SPARK_SQL_TESTING</code> being <code>1</code>) ... + CAUTION: FIXME Elaborate on the SQL testing case</p> </li> <li> <p><code>HADOOP_CONF_DIR</code> environment variable</p> </li> <li> <p><code>YARN_CONF_DIR</code> environment variable</p> </li> <li> <p><code>SPARK_DIST_CLASSPATH</code> environment variable</p> </li> </ol> <p>NOTE: <code>childEnv</code> is queried first before System properties. It is always empty for <code>AbstractCommandBuilder</code> (and <code>SparkSubmitCommandBuilder</code>, too).</p>","text":""},{"location":"tools/AbstractCommandBuilder/#loading-properties-file","title":"Loading Properties File <pre><code>Properties loadPropertiesFile()\n</code></pre> <p><code>loadPropertiesFile</code> loads Spark settings from a properties file (when specified on the command line) or spark-defaults.conf in the configuration directory.</p> <p><code>loadPropertiesFile</code> loads the settings from the following files starting from the first and checking every location until the first properties file is found:</p> <ol> <li><code>propertiesFile</code> (if specified using <code>--properties-file</code> command-line option or set by <code>AbstractCommandBuilder.setPropertiesFile</code>).</li> <li><code>[SPARK_CONF_DIR]/spark-defaults.conf</code></li> <li><code>[SPARK_HOME]/conf/spark-defaults.conf</code></li> </ol>","text":""},{"location":"tools/AbstractCommandBuilder/#sparks-configuration-directory","title":"Spark's Configuration Directory <p><code>AbstractCommandBuilder</code> uses <code>getConfDir</code> to compute the current configuration directory of a Spark application.</p> <p>It uses <code>SPARK_CONF_DIR</code> (from <code>childEnv</code> which is always empty anyway or as a environment variable) and falls through to <code>[SPARK_HOME]/conf</code> (with <code>SPARK_HOME</code> from getSparkHome).</p>","text":""},{"location":"tools/AbstractCommandBuilder/#sparks-home-directory","title":"Spark's Home Directory <p><code>AbstractCommandBuilder</code> uses <code>getSparkHome</code> to compute Spark's home directory for a Spark application.</p> <p>It uses <code>SPARK_HOME</code> (from <code>childEnv</code> which is always empty anyway or as a environment variable).</p> <p>If <code>SPARK_HOME</code> is not set, Spark throws a <code>IllegalStateException</code>:</p> <pre><code>Spark home not found; set it explicitly or use the SPARK_HOME environment variable.\n</code></pre>","text":""},{"location":"tools/AbstractCommandBuilder/#appResource","title":"Application Resource <pre><code>String appResource\n</code></pre> <p><code>AbstractCommandBuilder</code> uses <code>appResource</code> variable for the name of an application resource.</p> <p><code>appResource</code> can be one of the following application resource names:</p>    Identifier appResource     <code>pyspark-shell-main</code> <code>pyspark-shell-main</code>   <code>sparkr-shell-main</code> <code>sparkr-shell-main</code>   <code>run-example</code> findExamplesAppJar   <code>pyspark-shell</code> buildPySparkShellCommand   <code>sparkr-shell</code> buildSparkRCommand    <p><code>appResource</code> can be specified when:</p> <ul> <li><code>AbstractLauncher</code> is requested to setAppResource</li> <li><code>SparkSubmitCommandBuilder</code> is created</li> <li><code>SparkSubmitCommandBuilder.OptionParser</code> is requested to handle known or unknown options</li> </ul> <p><code>appResource</code> is used when:</p> <ul> <li><code>SparkLauncher</code> is requested to startApplication</li> <li><code>SparkSubmitCommandBuilder</code> is requested to build a command, buildSparkSubmitArgs</li> </ul>","text":""},{"location":"tools/AbstractLauncher/","title":"AbstractLauncher","text":"<p><code>AbstractLauncher</code> is...FIXME</p>"},{"location":"tools/DependencyUtils/","title":"DependencyUtils Utilities","text":""},{"location":"tools/DependencyUtils/#resolveglobpaths","title":"resolveGlobPaths <pre><code>resolveGlobPaths(\n  paths: String,\n  hadoopConf: Configuration): String\n</code></pre> <p><code>resolveGlobPaths</code>...FIXME</p> <p><code>resolveGlobPaths</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment</li> <li><code>DependencyUtils</code> is used to resolveAndDownloadJars</li> </ul>","text":""},{"location":"tools/DependencyUtils/#downloadfile","title":"downloadFile <pre><code>downloadFile(\n  path: String,\n  targetDir: File,\n  sparkConf: SparkConf,\n  hadoopConf: Configuration,\n  secMgr: SecurityManager): String\n</code></pre> <p><code>downloadFile</code> resolves the path to a well-formed URI and branches off based on the scheme:</p> <ul> <li>For <code>file</code> and <code>local</code> schemes, <code>downloadFile</code> returns the input <code>path</code></li> <li>For other schemes, <code>downloadFile</code>...FIXME</li> </ul> <p><code>downloadFile</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment</li> <li><code>DependencyUtils</code> is used to downloadFileList</li> </ul>","text":""},{"location":"tools/DependencyUtils/#downloadfilelist","title":"downloadFileList <pre><code>downloadFileList(\n  fileList: String,\n  targetDir: File,\n  sparkConf: SparkConf,\n  hadoopConf: Configuration,\n  secMgr: SecurityManager): String\n</code></pre> <p><code>downloadFileList</code>...FIXME</p> <p><code>downloadFileList</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment</li> <li><code>DependencyUtils</code> is used to resolveAndDownloadJars</li> </ul>","text":""},{"location":"tools/DependencyUtils/#resolvemavendependencies","title":"resolveMavenDependencies <pre><code>resolveMavenDependencies(\n  packagesExclusions: String,\n  packages: String,\n  repositories: String,\n  ivyRepoPath: String,\n  ivySettingsPath: Option[String]): String\n</code></pre> <p><code>resolveMavenDependencies</code>...FIXME</p> <p><code>resolveMavenDependencies</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment (for all resource managers but Spark Standalone and Apache Mesos)</li> </ul>","text":""},{"location":"tools/DependencyUtils/#adding-local-jars-to-classloader","title":"Adding Local Jars to ClassLoader <pre><code>addJarToClasspath(\n  localJar: String,\n  loader: MutableURLClassLoader): Unit\n</code></pre> <p><code>addJarToClasspath</code> adds <code>file</code> and <code>local</code> jars (as <code>localJar</code>) to the <code>loader</code> classloader.</p> <p><code>addJarToClasspath</code> resolves the URI of <code>localJar</code>. If the URI is <code>file</code> or <code>local</code> and the file denoted by <code>localJar</code> exists, <code>localJar</code> is added to <code>loader</code>. Otherwise, the following warning is printed out to the logs:</p> <pre><code>Warning: Local jar /path/to/fake.jar does not exist, skipping.\n</code></pre> <p>For all other URIs, the following warning is printed out to the logs:</p> <pre><code>Warning: Skip remote jar hdfs://fake.jar.\n</code></pre>  <p>Note</p> <p><code>addJarToClasspath</code> assumes <code>file</code> URI when <code>localJar</code> has no URI specified, e.g. <code>/path/to/local.jar</code>.</p>","text":""},{"location":"tools/DependencyUtils/#resolveanddownloadjars","title":"resolveAndDownloadJars <pre><code>resolveAndDownloadJars(\n  jars: String,\n  userJar: String,\n  sparkConf: SparkConf,\n  hadoopConf: Configuration,\n  secMgr: SecurityManager): String\n</code></pre> <p><code>resolveAndDownloadJars</code>...FIXME</p> <p><code>resolveAndDownloadJars</code> is used when:</p> <ul> <li><code>DriverWrapper</code> is requested to <code>setupDependencies</code> (Spark Standalone cluster mode)</li> </ul>","text":""},{"location":"tools/JavaMainApplication/","title":"JavaMainApplication","text":"<p><code>JavaMainApplication</code> is...FIXME</p>"},{"location":"tools/Main/","title":"Main","text":"<p><code>Main</code>\u00a0is the standalone application that is launched from spark-class shell script.</p>"},{"location":"tools/Main/#main","title":"Launching Application","text":"<pre><code>void main(\n  String[] argsArray)\n</code></pre> <p>Note</p> <p><code>main</code> requires that at least the class name (<code>className</code>) is given as the first argument in the given <code>argsArray</code>.</p> <p>For <code>org.apache.spark.deploy.SparkSubmit</code> class name, <code>main</code> creates a SparkSubmitCommandBuilder and builds a command (with the <code>SparkSubmitCommandBuilder</code>).</p> <p>Otherwise, <code>main</code> creates a SparkClassCommandBuilder and builds a command (with the <code>SparkClassCommandBuilder</code>).</p> Class Name AbstractCommandBuilder <code>org.apache.spark.deploy.SparkSubmit</code> SparkSubmitCommandBuilder anything else SparkClassCommandBuilder <p>In the end, <code>main</code> <code>prepareWindowsCommand</code> or prepareBashCommand based on the operating system it runs on, MS Windows or non-Windows, respectively.</p>"},{"location":"tools/Main/#buildCommand","title":"Building Command","text":"<pre><code>List&lt;String&gt; buildCommand(\n  AbstractCommandBuilder builder,\n  Map&lt;String, String&gt; env,\n  boolean printLaunchCommand)\n</code></pre> <p><code>buildCommand</code> requests the given AbstractCommandBuilder to build a command.</p> <p>With <code>printLaunchCommand</code> enabled, <code>buildCommand</code> prints out the command to standard error:</p> <pre><code>Spark Command: [cmd]\n========================================\n</code></pre> <p>SPARK_PRINT_LAUNCH_COMMAND</p> <p><code>printLaunchCommand</code> is controlled by <code>SPARK_PRINT_LAUNCH_COMMAND</code> environment variable.</p>"},{"location":"tools/SparkApplication/","title":"SparkApplication","text":"<p><code>SparkApplication</code> is an abstraction of entry points to Spark applications that can be started (submitted for execution using spark-submit).</p>"},{"location":"tools/SparkApplication/#contract","title":"Contract","text":""},{"location":"tools/SparkApplication/#starting-spark-application","title":"Starting Spark Application <pre><code>start(\n  args: Array[String], conf: SparkConf): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkSubmit</code> is requested to submit an application for execution</li> </ul>","text":""},{"location":"tools/SparkApplication/#implementations","title":"Implementations","text":"<ul> <li><code>ClientApp</code></li> <li>JavaMainApplication</li> <li><code>KubernetesClientApplication</code> (Spark on Kubernetes)</li> <li><code>RestSubmissionClientApp</code></li> <li><code>YarnClusterApplication</code></li> </ul>"},{"location":"tools/SparkClassCommandBuilder/","title":"SparkClassCommandBuilder","text":"<p><code>SparkClassCommandBuilder</code> is an AbstractCommandBuilder.</p>"},{"location":"tools/SparkClassCommandBuilder/#creating-instance","title":"Creating Instance","text":"<p><code>SparkClassCommandBuilder</code> takes the following to be created:</p> <ul> <li> Class Name <li> Class Arguments (<code>List&lt;String&gt;</code>) <p><code>SparkClassCommandBuilder</code> is created when:</p> <ul> <li><code>Main</code> standalone application is launched</li> </ul>"},{"location":"tools/SparkLauncher/","title":"SparkLauncher","text":"<p><code>SparkLauncher</code> is an interface to launch Spark applications programmatically, i.e. from a code (not spark-submit/index.md[spark-submit] directly). It uses a builder pattern to configure a Spark application and launch it as a child process using spark-submit/index.md[spark-submit].</p> <p><code>SparkLauncher</code> uses SparkSubmitCommandBuilder to build the Spark command of a Spark application to launch.</p>"},{"location":"tools/SparkLauncher/#spark-internal","title":"spark-internal <p><code>SparkLauncher</code> defines <code>spark-internal</code> (<code>NO_RESOURCE</code>) as a special value to inform Spark not to try to process the application resource (primary resource) as a regular file (but as an imaginary resource that cluster managers would know how to look up and submit for execution, e.g. Spark on YARN or Spark on Kubernetes).</p> <p><code>spark-internal</code> special value is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment and checks whether to add the primaryResource as part of the following:</li> <li><code>--jar</code> (for Spark on YARN in <code>cluster</code> deploy mode)</li> <li><code>--primary-*</code> arguments and define the <code>--main-class</code> argument (for Spark on Kubernetes in <code>cluster</code> deploy mode with <code>KubernetesClientApplication</code> main class)</li> <li><code>SparkSubmit</code> is requested to check whether a resource is internal or not</li> </ul>","text":""},{"location":"tools/SparkLauncher/#other","title":"Other <p>.<code>SparkLauncher</code>'s Builder Methods to Set Up Invocation of Spark Application [options=\"header\",width=\"100%\"] |=== | Setter | Description | <code>addAppArgs(String... args)</code> | Adds command line arguments for a Spark application. | <code>addFile(String file)</code> | Adds a file to be submitted with a Spark application. | <code>addJar(String jar)</code> | Adds a jar file to be submitted with the application. | <code>addPyFile(String file)</code> | Adds a python file / zip / egg to be submitted with a Spark application. | <code>addSparkArg(String arg)</code> | Adds a no-value argument to the Spark invocation. | <code>addSparkArg(String name, String value)</code> | Adds an argument with a value to the Spark invocation. It recognizes known command-line arguments, i.e. <code>--master</code>, <code>--properties-file</code>, <code>--conf</code>, <code>--class</code>, <code>--jars</code>, <code>--files</code>, and <code>--py-files</code>. | <code>directory(File dir)</code> | Sets the working directory of spark-submit. | <code>redirectError()</code> | Redirects stderr to stdout. | <code>redirectError(File errFile)</code> | Redirects error output to the specified <code>errFile</code> file. | <code>redirectError(ProcessBuilder.Redirect to)</code> | Redirects error output to the specified <code>to</code> Redirect. | <code>redirectOutput(File outFile)</code> | Redirects output to the specified <code>outFile</code> file. | <code>redirectOutput(ProcessBuilder.Redirect to)</code> | Redirects standard output to the specified <code>to</code> Redirect. | <code>redirectToLog(String loggerName)</code> | Sets all output to be logged and redirected to a logger with the specified name. | <code>setAppName(String appName)</code> | Sets the name of an Spark application | <code>setAppResource(String resource)</code> | Sets the main application resource, i.e. the location of a jar file for Scala/Java applications. | <code>setConf(String key, String value)</code> | Sets a Spark property. Expects <code>key</code> starting with <code>spark.</code> prefix. | <code>setDeployMode(String mode)</code> | Sets the deploy mode. | <code>setJavaHome(String javaHome)</code> | Sets a custom <code>JAVA_HOME</code>. | <code>setMainClass(String mainClass)</code> | Sets the main class. | <code>setMaster(String master)</code> | Sets the master URL. | <code>setPropertiesFile(String path)</code> | Sets the internal <code>propertiesFile</code>.</p> <p>See spark-AbstractCommandBuilder.md#loadPropertiesFile[<code>loadPropertiesFile</code> Internal Method]. | <code>setSparkHome(String sparkHome)</code> | Sets a custom <code>SPARK_HOME</code>. | <code>setVerbose(boolean verbose)</code> | Enables verbose reporting for SparkSubmit. |===</p> <p>After the invocation of a Spark application is set up, use <code>launch()</code> method to launch a sub-process that will start the configured Spark application. It is however recommended to use <code>startApplication</code> method instead.</p>","text":""},{"location":"tools/SparkLauncher/#source-scala","title":"[source, scala] <p>import org.apache.spark.launcher.SparkLauncher</p> <p>val command = new SparkLauncher()   .setAppResource(\"SparkPi\")   .setVerbose(true)</p>","text":""},{"location":"tools/SparkLauncher/#val-apphandle-commandstartapplication","title":"val appHandle = command.startApplication()","text":""},{"location":"tools/pyspark/","title":"pyspark Shell Script","text":"<p><code>pyspark</code> shell script runs spark-submit with pyspark-shell-main application resource as the first argument followed by <code>--name \"PySparkShell\"</code> option (with other command-line arguments, if specified).</p>"},{"location":"tools/pyspark/#pyspark-shell","title":"pyspark/shell.py","text":"<p>pyspark/shell.py</p> <p>Learn more about <code>pyspark/shell.py</code> in The Internals of PySpark.</p> <p><code>pyspark/shell.py</code> module is launched as a PYTHONSTARTUP script.</p>"},{"location":"tools/pyspark/#environment-variables","title":"Environment Variables","text":"<p><code>pyspark</code> script exports the following environment variables:</p> <ul> <li>OLD_PYTHONSTARTUP</li> <li><code>PYSPARK_DRIVER_PYTHON</code></li> <li><code>PYSPARK_DRIVER_PYTHON_OPTS</code></li> <li>PYSPARK_PYTHON</li> <li><code>PYTHONPATH</code></li> <li>PYTHONSTARTUP</li> </ul>"},{"location":"tools/pyspark/#OLD_PYTHONSTARTUP","title":"OLD_PYTHONSTARTUP","text":"<p><code>pyspark</code> defines <code>OLD_PYTHONSTARTUP</code> environment variable to be the initial value of PYTHONSTARTUP (before it gets redefined).</p> <p>The idea of <code>OLD_PYTHONSTARTUP</code> is to delay execution of the Python startup script until pyspark/shell.py finishes.</p>"},{"location":"tools/pyspark/#PYSPARK_PYTHON","title":"PYSPARK_PYTHON","text":"<p><code>PYSPARK_PYTHON</code> environment variable can be used to specify a Python executable to run PySpark scripts.</p> The Internals of PySpark <p>Learn more about PySpark in The Internals of PySpark.</p> <p><code>PYSPARK_PYTHON</code> can be overriden by PYSPARK_DRIVER_PYTHON and configuration properties when <code>SparkSubmitCommandBuilder</code> is requested to buildPySparkShellCommand.</p> <p><code>PYSPARK_PYTHON</code> is overriden by <code>spark.pyspark.python</code> configuration property, if defined, when <code>SparkSubmitCommandBuilder</code> is requested to buildPySparkShellCommand.</p>"},{"location":"tools/pyspark/#PYTHONSTARTUP","title":"PYTHONSTARTUP","text":"<p>From Python Documentation:</p> <p>PYTHONSTARTUP</p> <p>If this is the name of a readable file, the Python commands in that file are executed before the first prompt is displayed in interactive mode. The file is executed in the same namespace where interactive commands are executed so that objects defined or imported in it can be used without qualification in the interactive session. You can also change the prompts <code>sys.ps1</code> and <code>sys.ps2</code> and the hook <code>sys.__interactivehook__</code> in this file.</p> <p><code>pyspark</code> (re)defines <code>PYTHONSTARTUP</code> environment variable to be pyspark/shell.py module:</p> <pre><code>${SPARK_HOME}/python/pyspark/shell.py\n</code></pre> <p>OLD_PYTHONSTARTUP</p> <p>The initial value of <code>PYTHONSTARTUP</code> environment variable is available as OLD_PYTHONSTARTUP.</p>"},{"location":"tools/spark-class/","title":"spark-class shell script","text":"<p><code>spark-class</code> shell script is the Spark application command-line launcher that is responsible for setting up JVM environment and executing a Spark application.</p> <p>NOTE: Ultimately, any shell script in Spark, e.g. link:spark-submit.adoc[spark-submit], calls <code>spark-class</code> script.</p> <p>You can find <code>spark-class</code> script in <code>bin</code> directory of the Spark distribution.</p> <p>When started, <code>spark-class</code> first loads <code>$SPARK_HOME/bin/load-spark-env.sh</code>, collects the Spark assembly jars, and executes &lt;&gt;. <p>Depending on the Spark distribution (or rather lack thereof), i.e. whether <code>RELEASE</code> file exists or not, it sets <code>SPARK_JARS_DIR</code> environment variable to <code>[SPARK_HOME]/jars</code> or <code>[SPARK_HOME]/assembly/target/scala-[SPARK_SCALA_VERSION]/jars</code>, respectively (with the latter being a local build).</p> <p>If <code>SPARK_JARS_DIR</code> does not exist, <code>spark-class</code> prints the following error message and exits with the code <code>1</code>.</p> <pre><code>Failed to find Spark jars directory ([SPARK_JARS_DIR]).\nYou need to build Spark with the target \"package\" before running this program.\n</code></pre> <p><code>spark-class</code> sets <code>LAUNCH_CLASSPATH</code> environment variable to include all the jars under <code>SPARK_JARS_DIR</code>.</p> <p>If <code>SPARK_PREPEND_CLASSES</code> is enabled, <code>[SPARK_HOME]/launcher/target/scala-[SPARK_SCALA_VERSION]/classes</code> directory is added to <code>LAUNCH_CLASSPATH</code> as the first entry.</p> <p>NOTE: Use <code>SPARK_PREPEND_CLASSES</code> to have the Spark launcher classes (from <code>[SPARK_HOME]/launcher/target/scala-[SPARK_SCALA_VERSION]/classes</code>) to appear before the other Spark assembly jars. It is useful for development so your changes don't require rebuilding Spark again.</p> <p><code>SPARK_TESTING</code> and <code>SPARK_SQL_TESTING</code> environment variables enable test special mode.</p> <p>CAUTION: FIXME What's so special about the env vars?</p> <p><code>spark-class</code> uses &lt;&gt; command-line application to compute the Spark command to launch. The <code>Main</code> class programmatically computes the command that <code>spark-class</code> executes afterwards. <p>TIP: Use <code>JAVA_HOME</code> to point at the JVM to use.</p> <p>=== [[main]] Launching org.apache.spark.launcher.Main Standalone Application</p> <p><code>org.apache.spark.launcher.Main</code> is a Scala standalone application used in <code>spark-class</code> to prepare the Spark command to execute.</p> <p><code>Main</code> expects that the first parameter is the class name that is the \"operation mode\":</p> <ol> <li><code>org.apache.spark.deploy.SparkSubmit</code> -- <code>Main</code> uses link:spark-submit-SparkSubmitCommandBuilder.adoc[SparkSubmitCommandBuilder] to parse command-line arguments. This is the mode link:spark-submit.adoc[spark-submit] uses.</li> <li>anything -- <code>Main</code> uses <code>SparkClassCommandBuilder</code> to parse command-line arguments.</li> </ol> <pre><code>$ ./bin/spark-class org.apache.spark.launcher.Main\nException in thread \"main\" java.lang.IllegalArgumentException: Not enough arguments: missing class name.\n    at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)\n    at org.apache.spark.launcher.Main.main(Main.java:51)\n</code></pre> <p><code>Main</code> uses <code>buildCommand</code> method on the builder to build a Spark command.</p> <p>If <code>SPARK_PRINT_LAUNCH_COMMAND</code> environment variable is enabled, <code>Main</code> prints the final Spark command to standard error.</p> <pre><code>Spark Command: [cmd]\n========================================\n</code></pre> <p>If on Windows it calls <code>prepareWindowsCommand</code> while on non-Windows OSes <code>prepareBashCommand</code> with tokens separated by <code>\u0000\u0000\\0</code>.</p> <p>CAUTION: FIXME What's <code>prepareWindowsCommand</code>? <code>prepareBashCommand</code>?</p> <p><code>Main</code> uses the following environment variables:</p> <ul> <li><code>SPARK_DAEMON_JAVA_OPTS</code> and <code>SPARK_MASTER_OPTS</code> to be added to the command line of the command.</li> <li><code>SPARK_DAEMON_MEMORY</code> (default: <code>1g</code>) for <code>-Xms</code> and <code>-Xmx</code>.</li> </ul>"},{"location":"tools/spark-shell/","title":"spark-shell shell script","text":"<p>Spark shell is an interactive environment where you can learn how to make the most out of Apache Spark quickly and conveniently.</p> <p>TIP: Spark shell is particularly helpful for fast interactive prototyping.</p> <p>Under the covers, Spark shell is a standalone Spark application written in Scala that offers environment with auto-completion (using <code>TAB</code> key) where you can run ad-hoc queries and get familiar with the features of Spark (that help you in developing your own standalone Spark applications). It is a very convenient tool to explore the many things available in Spark with immediate feedback. It is one of the many reasons why spark-overview.md#why-spark[Spark is so helpful for tasks to process datasets of any size].</p> <p>There are variants of Spark shell for different languages: <code>spark-shell</code> for Scala, <code>pyspark</code> for Python and <code>sparkR</code> for R.</p> <p>NOTE: This document (and the book in general) uses <code>spark-shell</code> for Scala only.</p> <p>You can start Spark shell using &lt;spark-shell script&gt;&gt;. <pre><code>$ ./bin/spark-shell\nscala&gt;\n</code></pre> <p><code>spark-shell</code> is an extension of Scala REPL with automatic instantiation of spark-sql-SparkSession.md[SparkSession] as <code>spark</code> (and SparkContext.md[] as <code>sc</code>).</p>"},{"location":"tools/spark-shell/#source-scala","title":"[source, scala]","text":"<p>scala&gt; :type spark org.apache.spark.sql.SparkSession</p> <p>// Learn the current version of Spark in use scala&gt; spark.version res0: String = 2.1.0-SNAPSHOT</p> <p><code>spark-shell</code> also imports spark-sql-SparkSession.md#implicits[Scala SQL's implicits] and spark-sql-SparkSession.md#sql[<code>sql</code> method].</p>"},{"location":"tools/spark-shell/#source-scala_1","title":"[source, scala]","text":"<p>scala&gt; :imports  1) import spark.implicits._       (59 terms, 38 are implicit)  2) import spark.sql               (1 terms)</p>"},{"location":"tools/spark-shell/#note","title":"[NOTE]","text":"<p>When you execute <code>spark-shell</code> you actually execute spark-submit/index.md[Spark submit] as follows:</p>"},{"location":"tools/spark-shell/#optionswrap","title":"[options=\"wrap\"]","text":""},{"location":"tools/spark-shell/#orgapachesparkdeploysparksubmit-class-orgapachesparkreplmain-name-spark-shell-spark-shell","title":"org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name Spark shell spark-shell","text":""},{"location":"tools/spark-shell/#set-spark_print_launch_command-to-see-the-entire-command-to-be-executed-refer-to-spark-tips-and-tricksmdspark_print_launch_commandprint-launch-command-of-spark-scripts","title":"Set <code>SPARK_PRINT_LAUNCH_COMMAND</code> to see the entire command to be executed. Refer to spark-tips-and-tricks.md#SPARK_PRINT_LAUNCH_COMMAND[Print Launch Command of Spark Scripts].","text":"<p>=== [[using-spark-shell]] Using Spark shell</p> <p>You start Spark shell using <code>spark-shell</code> script (available in <code>bin</code> directory).</p> <pre><code>$ ./bin/spark-shell\nSetting default log level to \"WARN\".\nTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\nWARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\nWARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException\nSpark context Web UI available at http://10.47.71.138:4040\nSpark context available as 'sc' (master = local[*], app id = local-1477858597347).\nSpark session available as 'spark'.\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 2.1.0-SNAPSHOT\n      /_/\n\nUsing Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)\nType in expressions to have them evaluated.\nType :help for more information.\n\nscala&gt;\n</code></pre> <p>Spark shell creates an instance of spark-sql-SparkSession.md[SparkSession] under the name <code>spark</code> for you (so you don't have to know the details how to do it yourself on day 1).</p> <pre><code>scala&gt; :type spark\norg.apache.spark.sql.SparkSession\n</code></pre> <p>Besides, there is also <code>sc</code> value created which is an instance of SparkContext.md[].</p> <pre><code>scala&gt; :type sc\norg.apache.spark.SparkContext\n</code></pre> <p>To close Spark shell, you press <code>Ctrl+D</code> or type in <code>:q</code> (or any subset of <code>:quit</code>).</p> <pre><code>scala&gt; :q\n</code></pre>"},{"location":"tools/spark-submit/","title":"spark-submit Shell Script","text":"<p><code>spark-submit</code> shell script allows managing Spark applications.</p> <p><code>spark-submit</code> is a command-line frontend to SparkSubmit.</p>"},{"location":"tools/spark-submit/#options","title":"Command-Line Options","text":""},{"location":"tools/spark-submit/#archives","title":"archives","text":"<ul> <li>Command-Line Option: <code>--archives</code></li> <li>Internal Property: <code>archives</code></li> </ul>"},{"location":"tools/spark-submit/#deploy-mode","title":"deploy-mode","text":"<p>Deploy mode</p> <ul> <li>Command-Line Option: <code>--deploy-mode</code></li> <li>Spark Property: <code>spark.submit.deployMode</code></li> <li>Environment Variable: <code>DEPLOY_MODE</code></li> <li>Internal Property: <code>deployMode</code></li> </ul>"},{"location":"tools/spark-submit/#driver-class-path","title":"driver-class-path","text":"<pre><code>--driver-class-path\n</code></pre> <p>Extra class path entries (e.g. jars and directories) to pass to a driver's JVM.</p> <p><code>--driver-class-path</code> command-line option sets the extra class path entries (e.g. jars and directories) that should be added to a driver's JVM.</p> <p>Tip</p> <p>Use <code>--driver-class-path</code> in <code>client</code> deploy mode (not SparkConf) to ensure that the CLASSPATH is set up with the entries.</p> <p><code>client</code> deploy mode uses the same JVM for the driver as <code>spark-submit</code>'s.</p> <p>Internal Property: <code>driverExtraClassPath</code></p> <p>Spark Property: spark.driver.extraClassPath</p> <p>Note</p> <p>Command-line options (e.g. <code>--driver-class-path</code>) have higher precedence than their corresponding Spark settings in a Spark properties file (e.g. <code>spark.driver.extraClassPath</code>). You can therefore control the final settings by overriding Spark settings on command line using the command-line options.</p>"},{"location":"tools/spark-submit/#driver-cores","title":"driver-cores","text":"<pre><code>--driver-cores NUM\n</code></pre> <p><code>--driver-cores</code> command-line option sets the number of cores to <code>NUM</code> for the driver in the <code>cluster</code> deploy mode.</p> <p>Spark Property: spark.driver.cores</p> <p>Note</p> <p>Only available for <code>cluster</code> deploy mode (when the driver is executed outside <code>spark-submit</code>).</p> <p>Internal Property: <code>driverCores</code></p>"},{"location":"tools/spark-submit/#properties-file","title":"properties-file","text":"<pre><code>--properties-file [FILE]\n</code></pre> <p><code>--properties-file</code> command-line option sets the path to a file <code>FILE</code> from which Spark loads extra Spark properties.</p> <p>Note</p> <p>Spark uses conf/spark-defaults.conf by default.</p>"},{"location":"tools/spark-submit/#queue","title":"queue","text":"<pre><code>--queue QUEUE_NAME\n</code></pre> <p>YARN resource queue</p> <ul> <li>Spark Property: <code>spark.yarn.queue</code></li> <li>Internal Property: <code>queue</code></li> </ul>"},{"location":"tools/spark-submit/#version","title":"version","text":"<p>Command-Line Option: <code>--version</code></p> <pre><code>$ ./bin/spark-submit --version\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 2.1.0-SNAPSHOT\n      /_/\n\nBranch master\nCompiled by user jacek on 2016-09-30T07:08:39Z\nRevision 1fad5596885aab8b32d2307c0edecbae50d5bd7a\nUrl https://github.com/apache/spark.git\nType --help for more information.\n</code></pre>"},{"location":"tools/spark-submit/#SPARK_PRINT_LAUNCH_COMMAND","title":"SPARK_PRINT_LAUNCH_COMMAND","text":"<p>SPARK_PRINT_LAUNCH_COMMAND environment variable allows to have the complete Spark command printed out to the standard output.</p> <pre><code>$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell\nSpark Command: /Library/Ja...\n</code></pre>"},{"location":"tools/spark-submit/SparkSubmit/","title":"SparkSubmit","text":"<p><code>SparkSubmit</code> is the entry point to spark-submit shell script.</p>"},{"location":"tools/spark-submit/SparkSubmit/#special-primary-resource-names","title":"Special Primary Resource Names <p><code>SparkSubmit</code> uses the following special primary resource names to represent Spark shells rather than application jars:</p> <ul> <li><code>spark-shell</code></li> <li>pyspark-shell</li> <li><code>sparkr-shell</code></li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#pyspark-shell","title":"pyspark-shell <p><code>SparkSubmit</code> uses <code>pyspark-shell</code> when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment for <code>.py</code> scripts or <code>pyspark</code>, isShell and isPython</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#isshell","title":"isShell <pre><code>isShell(\n  res: String): Boolean\n</code></pre> <p><code>isShell</code> is <code>true</code> when the given <code>res</code> primary resource represents a Spark shell.</p> <p><code>isShell</code>\u00a0is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment and isUserJar</li> <li><code>SparkSubmitArguments</code> is requested to handleUnknown (and determine a primary application resource)</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#actions","title":"Actions <p><code>SparkSubmit</code> executes actions (based on the action argument).</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#killing-submission","title":"Killing Submission <pre><code>kill(\n  args: SparkSubmitArguments): Unit\n</code></pre> <p><code>kill</code>...FIXME</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#displaying-version","title":"Displaying Version <pre><code>printVersion(): Unit\n</code></pre> <p><code>printVersion</code>...FIXME</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#submission-status","title":"Submission Status <pre><code>requestStatus(\n  args: SparkSubmitArguments): Unit\n</code></pre> <p><code>requestStatus</code>...FIXME</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#submit","title":"Application Submission <pre><code>submit(\n  args: SparkSubmitArguments,\n  uninitLog: Boolean): Unit\n</code></pre> <p><code>submit</code> doRunMain unless isStandaloneCluster and useRest.</p> <p>For isStandaloneCluster with useRest requested, <code>submit</code>...FIXME</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#doRunMain","title":"doRunMain","text":"<pre><code>doRunMain(): Unit\n</code></pre> <p><code>doRunMain</code> runMain unless proxyUser is specified.</p> <p>With proxyUser specified, <code>doRunMain</code>...FIXME</p>"},{"location":"tools/spark-submit/SparkSubmit/#runMain","title":"Running Main Class","text":"<pre><code>runMain(\n  args: SparkSubmitArguments,\n  uninitLog: Boolean): Unit\n</code></pre> <p><code>runMain</code> prepares submit environment for the given SparkSubmitArguments (that gives <code>childArgs</code>, <code>childClasspath</code>, <code>sparkConf</code> and childMainClass).</p> <p>With verbose enabled, <code>runMain</code> prints out the following INFO messages to the logs:</p> <pre><code>Main class:\n[childMainClass]\nArguments:\n[childArgs]\nSpark config:\n[sparkConf_redacted]\nClasspath elements:\n[childClasspath]\n</code></pre> <p> <code>runMain</code> creates and sets a context classloader (based on <code>spark.driver.userClassPathFirst</code> configuration property) and adds the jars (from <code>childClasspath</code>). <p> <code>runMain</code> loads the main class (<code>childMainClass</code>). <p><code>runMain</code> creates a SparkApplication (if the main class is a subtype of) or creates a JavaMainApplication (with the main class).</p> <p>In the end, <code>runMain</code> requests the <code>SparkApplication</code> to start (with the <code>childArgs</code> and <code>sparkConf</code>).</p>"},{"location":"tools/spark-submit/SparkSubmit/#cluster-managers","title":"Cluster Managers <p><code>SparkSubmit</code> has a built-in support for some cluster managers (that are selected based on the master argument).</p>    Nickname Master URL      KUBERNETES <code>k8s://</code>-prefix    LOCAL <code>local</code>-prefix    MESOS <code>mesos</code>-prefix    STANDALONE <code>spark</code>-prefix    YARN <code>yarn</code>","text":""},{"location":"tools/spark-submit/SparkSubmit/#main","title":"Launching Standalone Application <pre><code>main(\n  args: Array[String]): Unit\n</code></pre> <p><code>main</code> creates a <code>SparkSubmit</code> to doSubmit (with the given <code>args</code>).</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#doSubmit","title":"doSubmit <pre><code>doSubmit(\n  args: Array[String]): Unit\n</code></pre> <p><code>doSubmit</code> initializeLogIfNecessary.</p> <p><code>doSubmit</code> parses the arguments in the given <code>args</code> (that gives a SparkSubmitArguments).</p> <p>With verbose option on, <code>doSubmit</code> prints out the <code>appArgs</code> to standard output.</p> <p><code>doSubmit</code> branches off based on action.</p>    Action Handler     <code>SUBMIT</code> submit   <code>KILL</code> kill   <code>REQUEST_STATUS</code> requestStatus   <code>PRINT_VERSION</code> printVersion     <p><code>doSubmit</code> is used when:</p> <ul> <li><code>InProcessSparkSubmit</code> standalone application is started</li> <li><code>SparkSubmit</code> standalone application is started</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#parseArguments","title":"Parsing Arguments <pre><code>parseArguments(\n  args: Array[String]): SparkSubmitArguments\n</code></pre> <p><code>parseArguments</code> creates a SparkSubmitArguments (with the given <code>args</code>).</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#prepareSubmitEnvironment","title":"prepareSubmitEnvironment <pre><code>prepareSubmitEnvironment(\n  args: SparkSubmitArguments,\n  conf: Option[HadoopConfiguration] = None): (Seq[String], Seq[String], SparkConf, String)\n</code></pre> <p><code>prepareSubmitEnvironment</code> creates a 4-element tuple made up of the following:</p> <ol> <li><code>childArgs</code> for arguments</li> <li><code>childClasspath</code> for Classpath elements</li> <li><code>sysProps</code> for Spark properties</li> <li>childMainClass</li> </ol>  <p>Tip</p> <p>Use <code>--verbose</code> command-line option to have the elements of the tuple printed out to the standard output.</p>  <p><code>prepareSubmitEnvironment</code>...FIXME</p> <p>For isPython in <code>CLIENT</code> deploy mode, <code>prepareSubmitEnvironment</code> sets the following based on primaryResource:</p> <ul> <li> <p>For pyspark-shell the mainClass is <code>org.apache.spark.api.python.PythonGatewayServer</code></p> </li> <li> <p>Otherwise, the mainClass is <code>org.apache.spark.deploy.PythonRunner</code> and the main python file, extra python files and the childArgs</p> </li> </ul> <p><code>prepareSubmitEnvironment</code>...FIXME</p> <p><code>prepareSubmitEnvironment</code> determines the cluster manager based on master argument.</p> <p>For KUBERNETES, <code>prepareSubmitEnvironment</code> checkAndGetK8sMasterUrl.</p> <p><code>prepareSubmitEnvironment</code>...FIXME</p> <p><code>prepareSubmitEnvironment</code>\u00a0is used when...FIXME</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#childMainClass","title":"childMainClass <p><code>childMainClass</code> is the last 4<sup>th</sup> argument in the result tuple of prepareSubmitEnvironment.</p> <pre><code>// (childArgs, childClasspath, sparkConf, childMainClass)\n(Seq[String], Seq[String], SparkConf, String)\n</code></pre> <p><code>childMainClass</code> can be as follows (based on the deployMode):</p>    Deploy Mode Master URL childMainClass     <code>client</code> any mainClass   <code>cluster</code> KUBERNETES  <code>KubernetesClientApplication</code>   <code>cluster</code> MESOS RestSubmissionClientApp (for REST submission API)   <code>cluster</code> STANDALONE  <code>RestSubmissionClientApp</code> (for REST submission API)   <code>cluster</code> STANDALONE  <code>ClientApp</code>   <code>cluster</code> YARN  <code>YarnClusterApplication</code>","text":""},{"location":"tools/spark-submit/SparkSubmit/#iskubernetesclient","title":"isKubernetesClient <p><code>prepareSubmitEnvironment</code> uses <code>isKubernetesClient</code> flag to indicate that:</p> <ul> <li>Cluster manager is Kubernetes</li> <li>Deploy mode is client</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#iskubernetesclustermodedriver","title":"isKubernetesClusterModeDriver <p><code>prepareSubmitEnvironment</code> uses <code>isKubernetesClusterModeDriver</code> flag to indicate that:</p> <ul> <li>isKubernetesClient</li> <li><code>spark.kubernetes.submitInDriver</code> configuration property is enabled (Spark on Kubernetes)</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#renameresourcestolocalfs","title":"renameResourcesToLocalFS <pre><code>renameResourcesToLocalFS(\n  resources: String,\n  localResources: String): String\n</code></pre> <p><code>renameResourcesToLocalFS</code>...FIXME</p> <p><code>renameResourcesToLocalFS</code> is used for isKubernetesClusterModeDriver mode.</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#downloadresource","title":"downloadResource <pre><code>downloadResource(\n  resource: String): String\n</code></pre> <p><code>downloadResource</code>...FIXME</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#checking-whether-resource-is-internal","title":"Checking Whether Resource is Internal <pre><code>isInternal(\n  res: String): Boolean\n</code></pre> <p><code>isInternal</code> is <code>true</code> when the given <code>res</code> is spark-internal.</p> <p><code>isInternal</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to isUserJar</li> <li><code>SparkSubmitArguments</code> is requested to handleUnknown</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#isuserjar","title":"isUserJar <pre><code>isUserJar(\n  res: String): Boolean\n</code></pre> <p><code>isUserJar</code> is <code>true</code> when the given <code>res</code> is none of the following:</p> <ul> <li><code>isShell</code></li> <li>isPython</li> <li>isInternal</li> <li><code>isR</code></li> </ul> <p><code>isUserJar</code> is used when:</p> <ul> <li>FIXME</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#isPython","title":"isPython <pre><code>isPython(\n  res: String): Boolean\n</code></pre> <p><code>isPython</code> is positive (<code>true</code>) when the given <code>res</code> primary resource represents a PySpark application:</p> <ul> <li><code>.py</code> script</li> <li>pyspark-shell</li> </ul>  <p><code>isPython</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to isUserJar</li> <li><code>SparkSubmitArguments</code> is requested to handle an unknown option</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/","title":"SparkSubmitArguments","text":"<p><code>SparkSubmitArguments</code> is created\u00a0 for <code>SparkSubmit</code> to parseArguments.</p> <p><code>SparkSubmitArguments</code> is a custom <code>SparkSubmitArgumentsParser</code> to handle the command-line arguments of spark-submit script that the actions use for execution (possibly with the explicit <code>env</code> environment).</p> <p><code>SparkSubmitArguments</code> is created when launching spark-submit script with only <code>args</code> passed in and later used for printing the arguments in verbose mode.</p>"},{"location":"tools/spark-submit/SparkSubmitArguments/#creating-instance","title":"Creating Instance","text":"<p><code>SparkSubmitArguments</code> takes the following to be created:</p> <ul> <li> Arguments (<code>Seq[String]</code>) <li> Environment Variables (default: <code>sys.env</code>) <p><code>SparkSubmitArguments</code> is created\u00a0when:</p> <ul> <li><code>SparkSubmit</code> is requested to parseArguments</li> </ul>"},{"location":"tools/spark-submit/SparkSubmitArguments/#action","title":"Action","text":"<pre><code>action: SparkSubmitAction\n</code></pre> <p><code>action</code> is used by SparkSubmit to determine what to do when executed.</p> <p><code>action</code> can be one of the following <code>SparkSubmitAction</code>s:</p> Action Description <code>SUBMIT</code> The default action if none specified <code>KILL</code> Indicates --kill switch <code>REQUEST_STATUS</code> Indicates --status switch <code>PRINT_VERSION</code> Indicates --version switch <p><code>action</code> is undefined (<code>null</code>) by default (when <code>SparkSubmitAction</code> is created).</p> <p><code>action</code> is validated when validateArguments.</p>"},{"location":"tools/spark-submit/SparkSubmitArguments/#command-line-options","title":"Command-Line Options","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#-files","title":"--files <ul> <li>Configuration Property: spark.files</li> <li>Configuration Property (Spark on YARN): <code>spark.yarn.dist.files</code></li> </ul> <p>Printed out to standard output for <code>--verbose</code> option</p> <p>When <code>SparkSubmit</code> is requested to prepareSubmitEnvironment, the files are:</p> <ul> <li>resolveGlobPaths</li> <li>downloadFileList</li> <li>renameResourcesToLocalFS</li> <li>downloadResource</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#loading-spark-properties","title":"Loading Spark Properties <pre><code>loadEnvironmentArguments(): Unit\n</code></pre> <p><code>loadEnvironmentArguments</code> loads the Spark properties for the current execution of spark-submit.</p> <p><code>loadEnvironmentArguments</code> reads command-line options first followed by Spark properties and System's environment variables.</p>  <p>Note</p> <p>Spark config properties start with <code>spark.</code> prefix and can be set using <code>--conf [key=value]</code> command-line option.</p>","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#handle","title":"Option Handling  SparkSubmitOptionParser <pre><code>handle(\n  opt: String,\n  value: String): Boolean\n</code></pre> <p><code>handle</code> is part of the SparkSubmitOptionParser abstraction.</p>  <p><code>handle</code> parses the input <code>opt</code> argument and assigns the given <code>value</code> to corresponding properties.</p> <p>In the end, <code>handle</code> returns whether it was executed for any action but PRINT_VERSION.</p>    User Option (<code>opt</code>) Property     <code>--kill</code> action   <code>--name</code> name   <code>--status</code> action   <code>--version</code> action   ... ...","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#mergedefaultsparkproperties","title":"mergeDefaultSparkProperties <pre><code>mergeDefaultSparkProperties(): Unit\n</code></pre> <p><code>mergeDefaultSparkProperties</code> merges Spark properties from the default Spark properties file, i.e. <code>spark-defaults.conf</code> with those specified through <code>--conf</code> command-line option.</p>","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#isPython","title":"isPython <pre><code>isPython: Boolean = false\n</code></pre> <p><code>isPython</code> indicates whether the application resource is a PySpark application (a Python script or pyspark shell).</p> <p><code>isPython</code> is isPython when <code>SparkSubmitArguments</code> is requested to handle a unknown option.</p>","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#client-deploy-mode","title":"Client Deploy Mode <p>With isPython flag enabled, SparkSubmit determines the mainClass (and the childArgs) based on the primaryResource.</p>    primaryResource mainClass     <code>pyspark-shell</code> <code>org.apache.spark.api.python.PythonGatewayServer</code> (PySpark)   anything else <code>org.apache.spark.deploy.PythonRunner</code> (PySpark)","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder.OptionParser/","title":"SparkSubmitCommandBuilder.OptionParser","text":"<p><code>SparkSubmitCommandBuilder.OptionParser</code> is...FIXME</p>"},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/","title":"SparkSubmitCommandBuilder","text":"<p><code>SparkSubmitCommandBuilder</code> is an AbstractCommandBuilder.</p> <p><code>SparkSubmitCommandBuilder</code> is used to build a command that spark-submit and SparkLauncher use to launch a Spark application.</p> <p><code>SparkSubmitCommandBuilder</code> uses the first argument to distinguish the shells:</p> <ol> <li><code>pyspark-shell-main</code></li> <li><code>sparkr-shell-main</code></li> <li><code>run-example</code></li> </ol> <p><code>SparkSubmitCommandBuilder</code> parses command-line arguments using <code>OptionParser</code> (which is a spark-submit-SparkSubmitOptionParser.md[SparkSubmitOptionParser]). <code>OptionParser</code> comes with the following methods:</p> <ol> <li> <p><code>handle</code> to handle the known options (see the table below). It sets up <code>master</code>, <code>deployMode</code>, <code>propertiesFile</code>, <code>conf</code>, <code>mainClass</code>, <code>sparkArgs</code> internal properties.</p> </li> <li> <p><code>handleUnknown</code> to handle unrecognized options that usually lead to <code>Unrecognized option</code> error message.</p> </li> <li> <p><code>handleExtraArgs</code> to handle extra arguments that are considered a Spark application's arguments.</p> </li> </ol> <p>Note</p> <p>For <code>spark-shell</code> it assumes that the application arguments are after <code>spark-submit</code>'s arguments.</p>"},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#pyspark-shell-main","title":"pyspark-shell-main Application Resource <p>When <code>bin/pyspark</code> shell script (and <code>bin\\pyspark2.cmd</code>) are launched, they use bin/spark-submit with <code>pyspark-shell-main</code> application resource as the first argument (followed by <code>--name \"PySparkShell\"</code> option among the others).</p> <p><code>pyspark-shell-main</code> is used when:</p> <ul> <li><code>SparkSubmitCommandBuilder</code> is created and then requested to build a command (buildPySparkShellCommand actually)</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildCommand","title":"Building Command  AbstractCommandBuilder <pre><code>List&lt;String&gt; buildCommand(\n  Map&lt;String, String&gt; env)\n</code></pre> <p><code>buildCommand</code> is part of the AbstractCommandBuilder abstraction.</p>  <p><code>buildCommand</code> branches off based on the application resource.</p>    Application Resource Command Builder     pyspark-shell-main (but not isSpecialCommand) buildPySparkShellCommand   <code>sparkr-shell-main</code> (but not isSpecialCommand) buildSparkRCommand   anything else buildSparkSubmitCommand","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildPySparkShellCommand","title":"buildPySparkShellCommand","text":"<pre><code>List&lt;String&gt; buildPySparkShellCommand(\n  Map&lt;String, String&gt; env)\n</code></pre> appArgs expected to be empty <p><code>buildPySparkShellCommand</code> makes sure that:</p> <ul> <li>There are no appArgs</li> <li>If there are appArgs the first argument is not a Python script (a file with <code>.py</code> extension)</li> </ul> <p><code>buildPySparkShellCommand</code> sets the application resource as <code>pyspark-shell</code>.</p> pyspark-shell-main redefined to pyspark-shell <p><code>buildPySparkShellCommand</code> is executed when requested for a command with <code>pyspark-shell-main</code> application resource that is re-defined (reset) to <code>pyspark-shell</code> now.</p> <p><code>buildPySparkShellCommand</code> constructEnvVarArgs with the given <code>env</code> and <code>PYSPARK_SUBMIT_ARGS</code>.</p> <p><code>buildPySparkShellCommand</code> defines an internal <code>pyargs</code> collection for the parts of the shell command to execute.</p> <p><code>buildPySparkShellCommand</code> stores the Python executable (in <code>pyargs</code>) to be the first specified in the following order:</p> <ul> <li><code>spark.pyspark.driver.python</code> configuration property</li> <li><code>spark.pyspark.python</code> configuration property</li> <li><code>PYSPARK_DRIVER_PYTHON</code> environment variable</li> <li><code>PYSPARK_PYTHON</code> environment variable</li> <li><code>python3</code></li> </ul> <p><code>buildPySparkShellCommand</code> sets the environment variables (for the Python executable to use), if specified.</p> Environment Variable Configuration Property <code>PYSPARK_PYTHON</code> <code>spark.pyspark.python</code> <code>SPARK_REMOTE</code> remote option or <code>spark.remote</code> <p>In the end, <code>buildPySparkShellCommand</code> copies all the options from <code>PYSPARK_DRIVER_PYTHON_OPTS</code>, if specified.</p>"},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildSparkSubmitCommand","title":"buildSparkSubmitCommand","text":"<pre><code>List&lt;String&gt; buildSparkSubmitCommand(\n  Map&lt;String, String&gt; env)\n</code></pre> <p><code>buildSparkSubmitCommand</code> starts by building so-called effective config. When in client mode, <code>buildSparkSubmitCommand</code> adds spark.driver.extraClassPath to the result Spark command.</p> <p><code>buildSparkSubmitCommand</code> builds the first part of the Java command passing in the extra classpath (only for <code>client</code> deploy mode).</p> Add <code>isThriftServer</code> case <p><code>buildSparkSubmitCommand</code> appends <code>SPARK_SUBMIT_OPTS</code> and <code>SPARK_JAVA_OPTS</code> environment variables.</p> <p>(only for <code>client</code> deploy mode) ...</p> Elaborate on the client deply mode case <p><code>addPermGenSizeOpt</code> case...elaborate</p> Elaborate on <code>addPermGenSizeOpt</code> <p><code>buildSparkSubmitCommand</code> appends <code>org.apache.spark.deploy.SparkSubmit</code> and the command-line arguments (using buildSparkSubmitArgs).</p>"},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildsparksubmitargs","title":"buildSparkSubmitArgs <pre><code>List&lt;String&gt; buildSparkSubmitArgs()\n</code></pre> <p><code>buildSparkSubmitArgs</code> builds a list of command-line arguments for spark-submit.</p> <p><code>buildSparkSubmitArgs</code> uses a SparkSubmitOptionParser to add the command-line arguments that <code>spark-submit</code> recognizes (when it is executed later on and uses the very same <code>SparkSubmitOptionParser</code> parser to parse command-line arguments).</p> <p><code>buildSparkSubmitArgs</code> is used when:</p> <ul> <li><code>InProcessLauncher</code> is requested to <code>startApplication</code></li> <li><code>SparkLauncher</code> is requested to createBuilder</li> <li><code>SparkSubmitCommandBuilder</code> is requested to buildSparkSubmitCommand and constructEnvVarArgs</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#sparksubmitcommandbuilder-properties-and-sparksubmitoptionparser-attributes","title":"SparkSubmitCommandBuilder Properties and SparkSubmitOptionParser Attributes    SparkSubmitCommandBuilder Property SparkSubmitOptionParser Attribute     <code>verbose</code> <code>VERBOSE</code>   <code>master</code> <code>MASTER [master]</code>   <code>deployMode</code> <code>DEPLOY_MODE [deployMode]</code>   <code>appName</code> <code>NAME [appName]</code>   <code>conf</code> <code>CONF [key=value]*</code>   <code>propertiesFile</code> <code>PROPERTIES_FILE [propertiesFile]</code>   <code>jars</code> <code>JARS [comma-separated jars]</code>   <code>files</code> <code>FILES [comma-separated files]</code>   <code>pyFiles</code> <code>PY_FILES [comma-separated pyFiles]</code>   <code>mainClass</code> <code>CLASS [mainClass]</code>   <code>sparkArgs</code> <code>sparkArgs</code> (passed straight through)   <code>appResource</code> <code>appResource</code> (passed straight through)   <code>appArgs</code> <code>appArgs</code> (passed straight through)","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/","title":"SparkSubmitOperation","text":"<p><code>SparkSubmitOperation</code> is an abstraction of operations of spark-submit (when requested to kill a submission or for a submission status).</p>"},{"location":"tools/spark-submit/SparkSubmitOperation/#contract","title":"Contract","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#killing-submission","title":"Killing Submission <pre><code>kill(\n  submissionId: String,\n  conf: SparkConf): Unit\n</code></pre> <p>Kills a given submission</p> <p>Used when:</p> <ul> <li><code>SparkSubmit</code> is requested to kill a submission</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#displaying-submission-status","title":"Displaying Submission Status <pre><code>printSubmissionStatus(\n  submissionId: String,\n  conf: SparkConf): Unit\n</code></pre> <p>Displays status of a given submission</p> <p>Used when:</p> <ul> <li><code>SparkSubmit</code> is requested for submission status</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#checking-whether-master-url-supported","title":"Checking Whether Master URL Supported <pre><code>supports(\n  master: String): Boolean\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkSubmit</code> is requested to kill a submission and for a submission status (via getSubmitOperations utility)</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#implementations","title":"Implementations","text":"<ul> <li><code>K8SSparkSubmitOperation</code> (Spark on Kubernetes)</li> </ul>"},{"location":"tools/spark-submit/SparkSubmitOptionParser/","title":"SparkSubmitOptionParser","text":"<p><code>SparkSubmitOptionParser</code> is the parser of spark-submit's command-line options.</p>"},{"location":"tools/spark-submit/SparkSubmitOptionParser/#parse","title":"Parsing Arguments","text":"<pre><code>void parse(\n  List&lt;String&gt; args)\n</code></pre> <p><code>parse</code>...FIXME</p> <p><code>parse</code> is used when:</p> <ul> <li><code>AbstractLauncher</code> is requested to addSparkArg</li> <li><code>Main</code> is launched</li> <li><code>SparkSubmitCommandBuilder</code> is created and requested to buildSparkSubmitArgs</li> </ul>"},{"location":"tools/spark-submit/SparkSubmitOptionParser/#handle","title":"Option Handling","text":"<pre><code>boolean handle(\n  String opt,\n  String value)\n</code></pre> <p><code>handle</code> throws an <code>UnsupportedOperationException</code> (and expects subclasses to override the default behaviour, e.g. SparkSubmitArguments).</p>"},{"location":"tools/spark-submit/SparkSubmitOptionParser/#-files","title":"--files <p>A comma-separated sequence of paths</p>","text":""},{"location":"tools/spark-submit/SparkSubmitUtils/","title":"SparkSubmitUtils","text":"<p><code>SparkSubmitUtils</code> provides utilities for SparkSubmit.</p>"},{"location":"tools/spark-submit/SparkSubmitUtils/#getsubmitoperations","title":"getSubmitOperations <pre><code>getSubmitOperations(\n  master: String): SparkSubmitOperation\n</code></pre> <p><code>getSubmitOperations</code>...FIXME</p> <p><code>getSubmitOperations</code>\u00a0is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to kill a submission and requestStatus</li> </ul>","text":""},{"location":"webui/","title":"Web UIs","text":"<p>web UI is the web interface of Spark applications or infrastructure for monitoring and inspection.</p> <p>The main abstraction is WebUI.</p>"},{"location":"webui/AllJobsPage/","title":"AllJobsPage","text":"<p><code>AllJobsPage</code> is a WebUIPage of JobsTab.</p>"},{"location":"webui/AllJobsPage/#creating-instance","title":"Creating Instance","text":"<p><code>AllJobsPage</code> takes the following to be created:</p> <ul> <li> Parent JobsTab <li> AppStatusStore"},{"location":"webui/AllJobsPage/#rendering-page","title":"Rendering Page <pre><code>render(\n  request: HttpServletRequest): Seq[Node]\n</code></pre> <p><code>render</code>\u00a0is part of the WebUIPage abstraction.</p> <p><code>render</code> renders a <code>Spark Jobs</code> page with the jobs and executors alongside applicationInfo and appSummary (from the AppStatusStore).</p>","text":""},{"location":"webui/AllJobsPage/#introduction","title":"Introduction <p><code>AllJobsPage</code> renders a summary, an event timeline, and active, completed, and failed jobs of a Spark application.</p> <p><code>AllJobsPage</code> displays the Summary section with the current Spark user, total uptime, scheduling mode, and the number of jobs per status.</p> <p></p> <p>Under the summary section is the Event Timeline section.</p> <p></p> <p>Active Jobs, Completed Jobs, and Failed Jobs sections follow.</p> <p></p> <p>Jobs are clickable (and give information about the stages of tasks inside it).</p> <p>When you hover over a job in Event Timeline not only you see the job legend but also the job is highlighted in the Summary section.</p> <p></p> <p>The Event Timeline section shows not only jobs but also executors.</p> <p></p>","text":""},{"location":"webui/AllStagesPage/","title":"AllStagesPage","text":"<p><code>AllStagesPage</code> is a WebUIPage of StagesTab.</p> <p></p> <p></p>"},{"location":"webui/AllStagesPage/#creating-instance","title":"Creating Instance","text":"<p><code>AllStagesPage</code> takes the following to be created:</p> <ul> <li> Parent StagesTab"},{"location":"webui/AllStagesPage/#rendering-page","title":"Rendering Page <pre><code>render(\n  request: HttpServletRequest): Seq[Node]\n</code></pre> <p><code>render</code>\u00a0is part of the WebUIPage abstraction.</p> <p><code>render</code> renders a <code>Stages for All Jobs</code> page with the stages and application summary (from the AppStatusStore of the parent StagesTab).</p>","text":""},{"location":"webui/AllStagesPage/#stage-headers","title":"Stage Headers <p><code>AllStagesPage</code> uses the following headers and tooltips for the Stages table.</p>    Header Tooltip     Stage Id    Pool Name    Description    Submitted    Duration Elapsed time since the stage was submitted until execution completion of all its tasks.   Tasks: Succeeded/Total    Input Bytes read from Hadoop or from Spark storage.   Output Bytes written to Hadoop.   Shuffle Read Total shuffle bytes and records read (includes both data read locally and data read from remote executors).   Shuffle Write Bytes and records written to disk in order to be read by a shuffle in a future stage.   Failure Reason Bytes and records written to disk in order to be read by a shuffle in a future stage.","text":""},{"location":"webui/EnvironmentPage/","title":"EnvironmentPage","text":""},{"location":"webui/EnvironmentPage/#review-me","title":"Review Me","text":"<p>[[prefix]] <code>EnvironmentPage</code> is a spark-webui-WebUIPage.md[WebUIPage] with an empty spark-webui-WebUIPage.md#prefix[prefix].</p> <p><code>EnvironmentPage</code> is &lt;&gt; exclusively when <code>EnvironmentTab</code> is spark-webui-EnvironmentTab.md#creating-instance[created]. <p>== [[creating-instance]] Creating EnvironmentPage Instance</p> <p><code>EnvironmentPage</code> takes the following when created:</p> <ul> <li>[[parent]] Parent spark-webui-EnvironmentTab.md[EnvironmentTab]</li> <li>[[conf]] SparkConf.md[SparkConf]</li> <li>[[store]] core:AppStatusStore.md[]</li> </ul>"},{"location":"webui/EnvironmentTab/","title":"EnvironmentTab","text":""},{"location":"webui/EnvironmentTab/#review-me","title":"Review Me","text":"<p>[[prefix]] <code>EnvironmentTab</code> is a spark-webui-SparkUITab.md[SparkUITab] with environment spark-webui-SparkUITab.md#prefix[prefix].</p> <p><code>EnvironmentTab</code> is &lt;&gt; exclusively when <code>SparkUI</code> is spark-webui-SparkUI.md#initialize[initialized]. <p>[[creating-instance]] <code>EnvironmentTab</code> takes the following when created:</p> <ul> <li>[[parent]] Parent spark-webui-SparkUI.md[SparkUI]</li> <li>[[store]] core:AppStatusStore.md[]</li> </ul> <p>When created, <code>EnvironmentTab</code> creates the spark-webui-EnvironmentPage.md#creating-instance[EnvironmentPage] page and spark-webui-WebUITab.md#attachPage[attaches] it immediately.</p>"},{"location":"webui/ExecutorThreadDumpPage/","title":"ExecutorThreadDumpPage","text":""},{"location":"webui/ExecutorThreadDumpPage/#review-me","title":"Review Me","text":"<p>[[prefix]] <code>ExecutorThreadDumpPage</code> is a spark-webui-WebUIPage.md[WebUIPage] with threadDump spark-webui-WebUIPage.md#prefix[prefix].</p> <p><code>ExecutorThreadDumpPage</code> is &lt;&gt; exclusively when <code>ExecutorsTab</code> is spark-webui-ExecutorsTab.md#creating-instance[created] (with <code>spark.ui.threadDumpsEnabled</code> configuration property enabled). <p>NOTE: <code>spark.ui.threadDumpsEnabled</code> configuration property is enabled (i.e. <code>true</code>) by default.</p> <p>=== [[creating-instance]] Creating ExecutorThreadDumpPage Instance</p> <p><code>ExecutorThreadDumpPage</code> takes the following when created:</p> <ul> <li>[[parent]] spark-webui-SparkUITab.md[SparkUITab]</li> <li>[[sc]] Optional SparkContext.md[]</li> </ul>"},{"location":"webui/ExecutorsPage/","title":"ExecutorsPage","text":""},{"location":"webui/ExecutorsPage/#review-me","title":"Review Me","text":"<p>[[prefix]] <code>ExecutorsPage</code> is a spark-webui-WebUIPage.md[WebUIPage] with an empty spark-webui-WebUIPage.md#prefix[prefix].</p> <p><code>ExecutorsPage</code> is &lt;&gt; exclusively when <code>ExecutorsTab</code> is spark-webui-ExecutorsTab.md#creating-instance[created]. <p>=== [[creating-instance]] Creating ExecutorsPage Instance</p> <p><code>ExecutorsPage</code> takes the following when created:</p> <ul> <li>[[parent]] Parent spark-webui-SparkUITab.md[SparkUITab]</li> <li>[[threadDumpEnabled]] <code>threadDumpEnabled</code> flag</li> </ul>"},{"location":"webui/ExecutorsTab/","title":"ExecutorsTab","text":""},{"location":"webui/ExecutorsTab/#review-me","title":"Review Me","text":"<p>[[prefix]] <code>ExecutorsTab</code> is a spark-webui-SparkUITab.md[SparkUITab] with executors spark-webui-SparkUITab.md#prefix[prefix].</p> <p><code>ExecutorsTab</code> is &lt;&gt; exclusively when <code>SparkUI</code> is spark-webui-SparkUI.md#initialize[initialized]. <p>[[creating-instance]] [[parent]] <code>ExecutorsTab</code> takes the parent spark-webui-SparkUI.md[SparkUI] when created.</p> <p>When &lt;&gt;, <code>ExecutorsTab</code> creates the following pages and spark-webui-WebUITab.md#attachPage[attaches] them immediately: <ul> <li> <p>spark-webui-ExecutorsPage.md[ExecutorsPage]</p> </li> <li> <p>spark-webui-ExecutorThreadDumpPage.md[ExecutorThreadDumpPage]</p> </li> </ul>"},{"location":"webui/JettyUtils/","title":"JettyUtils","text":"<p>== [[JettyUtils]] JettyUtils</p> <p><code>JettyUtils</code> is a set of &lt;&gt; for creating Jetty HTTP Server-specific components. <p>[[utility-methods]] .JettyUtils's Utility Methods [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| &lt;&gt; | Creates an HttpServlet <p>| &lt;&gt; | Creates a Handler for a static content <p>| &lt;&gt; | Creates a ServletContextHandler for a path &lt;&gt; === <p>=== [[createServletHandler]] Creating ServletContextHandler for Path -- <code>createServletHandler</code> Method</p>"},{"location":"webui/JettyUtils/#source-scala","title":"[source, scala]","text":"<p>createServletHandler(   path: String,   servlet: HttpServlet,   basePath: String): ServletContextHandler createServletHandlerT &lt;: AnyRef: ServletContextHandler // &lt;1&gt;</p> <p>&lt;1&gt; Uses the first three-argument <code>createServletHandler</code></p> <p><code>createServletHandler</code>...FIXME</p>"},{"location":"webui/JettyUtils/#note","title":"[NOTE]","text":"<p><code>createServletHandler</code> is used when:</p> <ul> <li> <p><code>WebUI</code> is requested to spark-webui-WebUI.md#attachPage[attachPage]</p> </li> <li> <p><code>MetricsServlet</code> is requested to <code>getHandlers</code></p> </li> </ul>"},{"location":"webui/JettyUtils/#spark-standalones-workerwebui-is-requested-to-initialize","title":"* Spark Standalone's <code>WorkerWebUI</code> is requested to <code>initialize</code>","text":"<p>=== [[createServlet]] Creating HttpServlet -- <code>createServlet</code> Method</p>"},{"location":"webui/JettyUtils/#source-scala_1","title":"[source, scala]","text":"<p>createServletT &lt;: AnyRef: HttpServlet</p> <p><code>createServlet</code> creates the <code>X-Frame-Options</code> header that can be either <code>ALLOW-FROM</code> with the value of spark-webui-properties.md#spark.ui.allowFramingFrom[spark.ui.allowFramingFrom] configuration property if defined or <code>SAMEORIGIN</code>.</p> <p><code>createServlet</code> creates a Java Servlets <code>HttpServlet</code> with support for <code>GET</code> requests.</p> <p>When handling <code>GET</code> requests, the <code>HttpServlet</code> first checks view permissions of the remote user (by requesting the <code>SecurityManager</code> to <code>checkUIViewPermissions</code> of the remote user).</p>"},{"location":"webui/JettyUtils/#tip","title":"[TIP]","text":"<p>Enable <code>DEBUG</code> logging level for <code>org.apache.spark.SecurityManager</code> logger to see what happens when <code>SecurityManager</code> does the security check.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.SecurityManager=DEBUG\n</code></pre> <p>You should see the following DEBUG message in the logs:</p>"},{"location":"webui/JettyUtils/#debug-securitymanager-useruser-aclsenabledaclsenabled-viewaclsviewacls-viewaclsgroupsviewaclsgroups","title":"<pre><code>DEBUG SecurityManager: user=[user] aclsEnabled=[aclsEnabled] viewAcls=[viewAcls] viewAclsGroups=[viewAclsGroups]\n</code></pre>","text":"<p>With view permissions check passed, the <code>HttpServlet</code> sends a response with the following:</p> <ul> <li>FIXME</li> </ul> <p>In case the view permissions didn't allow to view the page, the <code>HttpServlet</code> sends an error response with the following:</p> <ul> <li> <p>Status <code>403</code></p> </li> <li> <p><code>Cache-Control</code> header with \"no-cache, no-store, must-revalidate\"</p> </li> <li> <p>Error message: \"User is not authorized to access this page.\"</p> </li> </ul> <p>NOTE: <code>createServlet</code> is used exclusively when <code>JettyUtils</code> is requested to &lt;&gt;. <p>=== [[createStaticHandler]] Creating Handler For Static Content -- <code>createStaticHandler</code> Method</p>"},{"location":"webui/JettyUtils/#source-scala_2","title":"[source, scala]","text":""},{"location":"webui/JettyUtils/#createstatichandlerresourcebase-string-path-string-servletcontexthandler","title":"createStaticHandler(resourceBase: String, path: String): ServletContextHandler","text":"<p><code>createStaticHandler</code> creates a handler for serving files from a static directory</p> <p>Internally, <code>createStaticHandler</code> creates a Jetty <code>ServletContextHandler</code> and sets <code>org.eclipse.jetty.servlet.Default.gzip</code> init parameter to <code>false</code>.</p> <p><code>createRedirectHandler</code> creates a Jetty https://www.eclipse.org/jetty/javadoc/current/org/eclipse/jetty/servlet/DefaultServlet.html[DefaultServlet].</p>"},{"location":"webui/JettyUtils/#note_1","title":"[NOTE]","text":"<p>Quoting the official documentation of Jetty's https://www.eclipse.org/jetty/javadoc/current/org/eclipse/jetty/servlet/DefaultServlet.html[DefaultServlet]:</p> <p>DefaultServlet The default servlet. This servlet, normally mapped to <code>/</code>, provides the handling for static content, OPTION and TRACE methods for the context. The following initParameters are supported, these can be set either on the servlet itself or as ServletContext initParameters with a prefix of <code>org.eclipse.jetty.servlet.Default.</code></p> <p>With that, <code>org.eclipse.jetty.servlet.Default.gzip</code> is to configure https://www.eclipse.org/jetty/documentation/current/advanced-extras.html#default-servlet-init[gzip] init parameter for Jetty's <code>DefaultServlet</code>.</p> <p>gzip If set to true, then static content will be served as gzip content encoded if a matching resource is found ending with \".gz\" (default <code>false</code>) (deprecated: use precompressed)</p> <p>====</p> <p><code>createRedirectHandler</code> resolves the <code>resourceBase</code> in the Spark classloader and, if successful, sets <code>resourceBase</code> init parameter of the Jetty <code>DefaultServlet</code> to the URL.</p> <p>NOTE: https://www.eclipse.org/jetty/documentation/current/advanced-extras.html#default-servlet-init[resourceBase] init parameter is used to replace the context resource base.</p> <p><code>createRedirectHandler</code> requests the <code>ServletContextHandler</code> to use the <code>path</code> as the context path and register the <code>DefaultServlet</code> to serve it.</p> <p><code>createRedirectHandler</code> throws an <code>Exception</code> if the input <code>resourceBase</code> could not be resolved.</p> <pre><code>Could not find resource path for Web UI: [resourceBase]\n</code></pre> <p>NOTE: <code>createStaticHandler</code> is used when spark-webui-SparkUI.md#initialize[SparkUI], spark-history-server:HistoryServer.md#initialize[HistoryServer], Spark Standalone's <code>MasterWebUI</code> and <code>WorkerWebUI</code>, Spark on Mesos' <code>MesosClusterUI</code> are requested to initialize.</p> <p>=== [[createRedirectHandler]] <code>createRedirectHandler</code> Method</p>"},{"location":"webui/JettyUtils/#source-scala_3","title":"[source, scala]","text":"<p>createRedirectHandler(   srcPath: String,   destPath: String,   beforeRedirect: HttpServletRequest =&gt; Unit = x =&gt; (),   basePath: String = \"\",   httpMethods: Set[String] = Set(\"GET\")): ServletContextHandler</p> <p><code>createRedirectHandler</code>...FIXME</p> <p>NOTE: <code>createRedirectHandler</code> is used when spark-webui-SparkUI.md#initialize[SparkUI] and Spark Standalone's <code>MasterWebUI</code> are requested to initialize.</p>"},{"location":"webui/JobPage/","title":"JobPage","text":""},{"location":"webui/JobPage/#review-me","title":"Review Me","text":"<p>[[prefix]] <code>JobPage</code> is a spark-webui-WebUIPage.md[WebUIPage] with job spark-webui-WebUIPage.md#prefix[prefix].</p> <p><code>JobPage</code> is &lt;&gt; exclusively when <code>JobsTab</code> is created. <p>=== [[creating-instance]] Creating JobPage Instance</p> <p><code>JobPage</code> takes the following when created:</p> <ul> <li>[[parent]] Parent JobsTab</li> <li>[[store]] core:AppStatusStore.md[]</li> </ul>"},{"location":"webui/JobsTab/","title":"JobsTab","text":"<p><code>JobsTab</code> is a SparkUITab with <code>jobs</code> URL prefix.</p> <p></p>"},{"location":"webui/JobsTab/#creating-instance","title":"Creating Instance","text":"<p><code>JobsTab</code> takes the following to be created:</p> <ul> <li> Parent SparkUI <li> AppStatusStore <p><code>JobsTab</code> is created\u00a0when:</p> <ul> <li><code>SparkUI</code> is requested to initialize</li> </ul>"},{"location":"webui/JobsTab/#pages","title":"Pages","text":"<p>When created, <code>JobsTab</code> attaches the following pages (with a reference to itself and the AppStatusStore):</p> <ul> <li>AllJobsPage</li> <li>JobPage</li> </ul>"},{"location":"webui/JobsTab/#event-timeline","title":"Event Timeline","text":""},{"location":"webui/JobsTab/#details-for-job","title":"Details for Job","text":"<p>Clicking a job in AllJobsPage, leads to Details for Job page.</p> <p></p> <p>When a job id is not found, you should see \"No information to display for job ID\" message.</p> <p></p> <p></p> <p></p>"},{"location":"webui/PoolPage/","title":"PoolPage","text":"<p><code>PoolPage</code> is a WebUIPage of StagesTab.</p> <p></p>"},{"location":"webui/PoolPage/#creating-instance","title":"Creating Instance","text":"<p><code>PoolPage</code> takes the following to be created:</p> <ul> <li> Parent StagesTab"},{"location":"webui/PoolPage/#url-prefix","title":"URL Prefix <p><code>PoolPage</code> uses <code>pool</code> URL prefix.</p>","text":""},{"location":"webui/PoolPage/#rendering-page","title":"Rendering Page <pre><code>render(\n  request: HttpServletRequest): Seq[Node]\n</code></pre> <p><code>render</code>\u00a0is part of the WebUIPage abstraction.</p> <p><code>render</code> requires <code>poolname</code> and <code>attempt</code> request parameters.</p> <p><code>render</code> renders a <code>Fair Scheduler Pool</code> page with the PoolData (from the AppStatusStore of the parent StagesTab).</p>","text":""},{"location":"webui/PoolPage/#introduction","title":"Introduction <p>The Fair Scheduler Pool Details page shows information about a Schedulable pool and is only available when a Spark application uses the FAIR scheduling mode.</p>","text":""},{"location":"webui/PoolPage/#summary-table","title":"Summary Table","text":"<p>The Summary table shows the details of a Schedulable pool.</p> <p></p> <p>It uses the following columns:</p> <ul> <li>Pool Name</li> <li>Minimum Share</li> <li>Pool Weight</li> <li>Active Stages (the number of the active stages in a <code>Schedulable</code> pool)</li> <li>Running Tasks</li> <li>SchedulingMode</li> </ul>"},{"location":"webui/PoolPage/#active-stages-table","title":"Active Stages Table","text":"<p>The Active Stages table shows the active stages in a pool.</p> <p></p>"},{"location":"webui/PrometheusResource/","title":"PrometheusResource","text":""},{"location":"webui/PrometheusResource/#getservlethandler","title":"getServletHandler <pre><code>getServletHandler(\n  uiRoot: UIRoot): ServletContextHandler\n</code></pre> <p><code>getServletHandler</code>...FIXME</p> <p><code>getServletHandler</code>\u00a0is used when:</p> <ul> <li><code>SparkUI</code> is requested to initialize</li> </ul>","text":""},{"location":"webui/RDDPage/","title":"RDDPage","text":"<p>== [[RDDPage]] RDDPage</p> <p>[[prefix]] <code>RDDPage</code> is a spark-webui-WebUIPage.md[WebUIPage] with rdd spark-webui-WebUIPage.md#prefix[prefix].</p> <p><code>RDDPage</code> is &lt;&gt; exclusively when <code>StorageTab</code> is spark-webui-StorageTab.md#creating-instance[created]. <p>[[creating-instance]] <code>RDDPage</code> takes the following when created:</p> <ul> <li>[[parent]] Parent spark-webui-SparkUITab.md[SparkUITab]</li> <li>[[store]] core:AppStatusStore.md[]</li> </ul> <p>=== [[render]] <code>render</code> Method</p>"},{"location":"webui/RDDPage/#source-scala","title":"[source, scala]","text":""},{"location":"webui/RDDPage/#renderrequest-httpservletrequest-seqnode","title":"render(request: HttpServletRequest): Seq[Node]","text":"<p>NOTE: <code>render</code> is part of spark-webui-WebUIPage.md#render[WebUIPage Contract] to...FIXME.</p> <p><code>render</code>...FIXME</p>"},{"location":"webui/SparkUI/","title":"SparkUI","text":"<p><code>SparkUI</code> is a WebUI of Spark applications.</p> <p></p>"},{"location":"webui/SparkUI/#creating-instance","title":"Creating Instance","text":"<p><code>SparkUI</code> takes the following to be created:</p> <ul> <li> AppStatusStore <li> SparkContext <li> SparkConf <li> <code>SecurityManager</code> <li> Application Name <li> Base Path <li> Start Time <li> Spark Version <p>While being created, <code>SparkUI</code> initializes itself.</p> <p><code>SparkUI</code> is created using create utility.</p>"},{"location":"webui/SparkUI/#ui-port","title":"UI Port <pre><code>getUIPort(\n  conf: SparkConf): Int\n</code></pre> <p><code>getUIPort</code> requests the SparkConf for the value of spark.ui.port configuration property.</p> <p><code>getUIPort</code>\u00a0is used when:</p> <ul> <li><code>SparkUI</code> is created</li> </ul>","text":""},{"location":"webui/SparkUI/#creating-sparkui","title":"Creating SparkUI <pre><code>create(\n  sc: Option[SparkContext],\n  store: AppStatusStore,\n  conf: SparkConf,\n  securityManager: SecurityManager,\n  appName: String,\n  basePath: String,\n  startTime: Long,\n  appSparkVersion: String): SparkUI\n</code></pre> <p><code>create</code> creates a new <code>SparkUI</code> with <code>appSparkVersion</code> being the current Spark version.</p> <p><code>create</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is created (with the spark.ui.enabled configuration property turned on)</li> <li><code>FsHistoryProvider</code> (Spark History Server) is requested for the web UI of a Spark application</li> </ul>","text":""},{"location":"webui/SparkUI/#initializing","title":"Initializing <pre><code>initialize(): Unit\n</code></pre> <p><code>initialize</code>\u00a0is part of the WebUI abstraction.</p> <p><code>initialize</code> creates and attaches the following tabs:</p> <ol> <li>JobsTab</li> <li>StagesTab</li> <li>StorageTab</li> <li>EnvironmentTab</li> <li>ExecutorsTab</li> </ol> <p><code>initialize</code> attaches itself as the UIRoot.</p> <p><code>initialize</code> attaches the PrometheusResource for executor metrics based on spark.ui.prometheus.enabled configuration property.</p>","text":""},{"location":"webui/SparkUI/#uiroot","title":"UIRoot <p><code>SparkUI</code> is an UIRoot</p>","text":""},{"location":"webui/SparkUI/#review-me","title":"Review Me <p>SparkUI is &lt;&gt; along with the following: <ul> <li> <p>SparkContext is created (for a live Spark application with spark-webui-properties.md#spark.ui.enabled[spark.ui.enabled] configuration property enabled)</p> </li> <li> <p><code>FsHistoryProvider</code> is requested for the spark-history-server:FsHistoryProvider.md#getAppUI[application UI] (for a live or completed Spark application)</p> </li> </ul> <p>.Creating SparkUI for Live Spark Application image::spark-webui-SparkUI.png[align=\"center\"]</p> <p>When &lt;&gt; (while <code>SparkContext</code> is created for a live Spark application), SparkUI gets the following: <ul> <li> <p>Live AppStatusStore (with a ElementTrackingStore using an core:InMemoryStore.md[] and a AppStatusListener for a live Spark application)</p> </li> <li> <p>Name of the Spark application that is exactly the value of SparkConf.md#spark.app.name[spark.app.name] configuration property</p> </li> <li> <p>Empty base path</p> </li> </ul> <p>When started, SparkUI binds to &lt;&gt; address that you can control using <code>SPARK_PUBLIC_DNS</code> environment variable or spark-driver.md#spark_driver_host[spark.driver.host] Spark property. <p>NOTE: With spark-webui-properties.md#spark.ui.killEnabled[spark.ui.killEnabled] configuration property turned on, SparkUI &lt;&gt; (subject to <code>SecurityManager.checkModifyPermissions</code> permissions). <p>SparkUI gets an &lt;&gt; that is then used for the following: <ul> <li> <p>&lt;&gt;, i.e. JobsTab.md#creating-instance[JobsTab], spark-webui-StagesTab.md#creating-instance[StagesTab], spark-webui-StorageTab.md#creating-instance[StorageTab], spark-webui-EnvironmentTab.md#creating-instance[EnvironmentTab]  <li> <p><code>AbstractApplicationResource</code> is requested for spark-api-AbstractApplicationResource.md#jobsList[jobsList], spark-api-AbstractApplicationResource.md#oneJob[oneJob], spark-api-AbstractApplicationResource.md#executorList[executorList], spark-api-AbstractApplicationResource.md#allExecutorList[allExecutorList], spark-api-AbstractApplicationResource.md#rddList[rddList], spark-api-AbstractApplicationResource.md#rddData[rddData], spark-api-AbstractApplicationResource.md#environmentInfo[environmentInfo]</p> </li> <li> <p><code>StagesResource</code> is requested for spark-api-StagesResource.md#stageList[stageList], spark-api-StagesResource.md#stageData[stageData], spark-api-StagesResource.md#oneAttemptData[oneAttemptData], spark-api-StagesResource.md#taskSummary[taskSummary], spark-api-StagesResource.md#taskList[taskList]</p> </li> <li> <p>SparkUI is requested for the current &lt;&gt;  <li> <p>Creating Spark SQL's <code>SQLTab</code> (when <code>SQLHistoryServerPlugin</code> is requested to <code>setupUI</code>)</p> </li> <li> <p>Spark Streaming's <code>BatchPage</code> is created</p> </li>  <p>[[internal-registries]] .SparkUI's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| <code>appId</code> | [[appId]] |===</p>","text":""},{"location":"webui/SparkUI/#tip","title":"[TIP]","text":"<p>Enable <code>INFO</code> logging level for <code>org.apache.spark.ui.SparkUI</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.ui.SparkUI=INFO\n</code></pre>"},{"location":"webui/SparkUI/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":"<p>== [[setAppId]] Assigning Unique Identifier of Spark Application -- <code>setAppId</code> Method</p>"},{"location":"webui/SparkUI/#source-scala","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#setappidid-string-unit","title":"setAppId(id: String): Unit <p><code>setAppId</code> sets the internal &lt;&gt;. <p><code>setAppId</code> is used when SparkContext is created.</p> <p>== [[stop]] Stopping SparkUI -- <code>stop</code> Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_1","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#stop-unit","title":"stop(): Unit <p><code>stop</code> stops the HTTP server and prints the following INFO message to the logs:</p> <pre><code>INFO SparkUI: Stopped Spark web UI at [appUIAddress]\n</code></pre> <p>NOTE: <code>appUIAddress</code> in the above INFO message is the result of &lt;&gt; method. <p>== [[appUIAddress]] <code>appUIAddress</code> Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_2","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#appuiaddress-string","title":"appUIAddress: String <p><code>appUIAddress</code> returns the entire URL of a Spark application's web UI, including <code>http://</code> scheme.</p> <p>Internally, <code>appUIAddress</code> uses &lt;&gt;. <p>== [[createLiveUI]] <code>createLiveUI</code> Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_3","title":"[source, scala] <p>createLiveUI(   sc: SparkContext,   conf: SparkConf,   listenerBus: SparkListenerBus,   jobProgressListener: JobProgressListener,   securityManager: SecurityManager,   appName: String,   startTime: Long): SparkUI</p>  <p><code>createLiveUI</code> creates a SparkUI for a live running Spark application.</p> <p>Internally, <code>createLiveUI</code> simply forwards the call to &lt;&gt;. <p><code>createLiveUI</code> is used when SparkContext is created.</p> <p>== [[createHistoryUI]] <code>createHistoryUI</code> Method</p> <p>CAUTION: FIXME</p> <p>== [[appUIHostPort]] <code>appUIHostPort</code> Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_4","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#appuihostport-string","title":"appUIHostPort: String <p><code>appUIHostPort</code> returns the Spark application's web UI which is the public hostname and port, excluding the scheme.</p> <p>NOTE: &lt;&gt; uses <code>appUIHostPort</code> and adds <code>http://</code> scheme. <p>== [[getAppName]] <code>getAppName</code> Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_5","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#getappname-string","title":"getAppName: String <p><code>getAppName</code> returns the name of the Spark application (of a SparkUI instance).</p> <p>NOTE: <code>getAppName</code> is used when...FIXME</p> <p>== [[create]] Creating SparkUI Instance -- <code>create</code> Factory Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_6","title":"[source, scala] <p>create(   sc: Option[SparkContext],   store: AppStatusStore,   conf: SparkConf,   securityManager: SecurityManager,   appName: String,   basePath: String = \"\",   startTime: Long,   appSparkVersion: String = org.apache.spark.SPARK_VERSION): SparkUI</p>  <p><code>create</code> creates a SparkUI backed by a core:AppStatusStore.md[].</p> <p>Internally, <code>create</code> simply creates a new &lt;&gt; (with the predefined Spark version). <p><code>create</code> is used when:</p> <ul> <li>SparkContext is created</li> <li><code>FsHistoryProvider</code> is requested to spark-history-server:FsHistoryProvider.md#getAppUI[getAppUI] (for a Spark application that already finished)</li> </ul>","text":""},{"location":"webui/SparkUI/#creating-instance_1","title":"Creating Instance <p>SparkUI takes the following when created:</p> <ul> <li>[[store]] core:AppStatusStore.md[]</li> <li>[[sc]] SparkContext.md[]</li> <li>[[conf]] SparkConf.md[SparkConf]</li> <li>[[securityManager]] <code>SecurityManager</code></li> <li>[[appName]] Application name</li> <li>[[basePath]] <code>basePath</code></li> <li>[[startTime]] Start time</li> <li>[[appSparkVersion]] <code>appSparkVersion</code></li> </ul> <p>SparkUI initializes the &lt;&gt; and &lt;&gt;. <p>== [[initialize]] Attaching Tabs and Context Handlers -- <code>initialize</code> Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_7","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#initialize-unit","title":"initialize(): Unit <p>NOTE: <code>initialize</code> is part of spark-webui-WebUI.md#initialize[WebUI Contract] to initialize web components.</p> <p><code>initialize</code> creates and &lt;&gt; the following tabs (with the reference to the SparkUI and its &lt;&gt;): <p>. spark-webui-StagesTab.md[StagesTab] . spark-webui-StorageTab.md[StorageTab] . spark-webui-EnvironmentTab.md[EnvironmentTab] . spark-webui-ExecutorsTab.md[ExecutorsTab]</p> <p>In the end, <code>initialize</code> creates and spark-webui-WebUI.md#attachHandler[attaches] the following <code>ServletContextHandlers</code>:</p> <p>. spark-webui-JettyUtils.md#createStaticHandler[Creates a static handler] for serving files from a static directory, i.e. <code>/static</code> to serve static files from <code>org/apache/spark/ui/static</code> directory (on CLASSPATH)</p> <p>. spark-api-ApiRootResource.md#getServletHandler[Creates the /api/* context handler] for the spark-api.md[Status REST API]</p> <p>. spark-webui-JettyUtils.md#createRedirectHandler[Creates a redirect handler] to redirect <code>/jobs/job/kill</code> to <code>/jobs/</code> and request the <code>JobsTab</code> to execute handleKillRequest before redirection</p> <p>. spark-webui-JettyUtils.md#createRedirectHandler[Creates a redirect handler] to redirect <code>/stages/stage/kill</code> to <code>/stages/</code> and request the <code>StagesTab</code> to execute spark-webui-StagesTab.md#handleKillRequest[handleKillRequest] before redirection</p>","text":""},{"location":"webui/SparkUITab/","title":"SparkUITab","text":"<p><code>SparkUITab</code>\u00a0is an extension of the WebUITab abstraction for UI tabs with the application name and Spark version.</p>"},{"location":"webui/SparkUITab/#implementations","title":"Implementations","text":"<ul> <li>EnvironmentTab</li> <li>ExecutorsTab</li> <li>JobsTab</li> <li>StagesTab</li> <li>StorageTab</li> </ul>"},{"location":"webui/SparkUITab/#creating-instance","title":"Creating Instance","text":"<p><code>SparkUITab</code> takes the following to be created:</p> <ul> <li> Parent SparkUI <li> URL Prefix Abstract Class <p><code>SparkUITab</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete SparkUITabs.</p>"},{"location":"webui/SparkUITab/#application-name","title":"Application Name <pre><code>appName: String\n</code></pre> <p><code>appName</code> requests the parent SparkUI for the appName.</p>","text":""},{"location":"webui/SparkUITab/#spark-version","title":"Spark Version <pre><code>appSparkVersion: String\n</code></pre> <p><code>appSparkVersion</code> requests the parent SparkUI for the appSparkVersion.</p>","text":""},{"location":"webui/StagePage/","title":"StagePage","text":"<p><code>StagePage</code> is a WebUIPage of StagesTab.</p> <p></p>"},{"location":"webui/StagePage/#creating-instance","title":"Creating Instance","text":"<p><code>StagePage</code> takes the following to be created:</p> <ul> <li> Parent StagesTab <li> AppStatusStore"},{"location":"webui/StagePage/#url-prefix","title":"URL Prefix <p><code>StagePage</code> uses <code>stage</code> URL prefix.</p>","text":""},{"location":"webui/StagePage/#rendering-page","title":"Rendering Page <pre><code>render(\n  request: HttpServletRequest): Seq[Node]\n</code></pre> <p><code>render</code>\u00a0is part of the WebUIPage abstraction.</p>  <p><code>render</code> requires <code>id</code> and <code>attempt</code> request parameters.</p> <p><code>render</code>...FIXME</p>","text":""},{"location":"webui/StagePage/#tasks-section","title":"Tasks Section","text":""},{"location":"webui/StagePage/#summary-metrics-for-completed-tasks-in-stage","title":"Summary Metrics for Completed Tasks in Stage <p>The summary metrics table shows the metrics for the tasks in a given stage that have already finished with <code>SUCCESS</code> status and metrics available.</p> <p></p> <p>The 1<sup>st</sup> row is Duration which includes the quantiles based on <code>executorRunTime</code>.</p> <p>The 2<sup>nd</sup> row is the optional Scheduler Delay which includes the time to ship the task from the scheduler to executors, and the time to send the task result from the executors to the scheduler. It is not enabled by default and you should select Scheduler Delay checkbox under Show Additional Metrics to include it in the summary table.</p> <p>The 3<sup>rd</sup> row is the optional Task Deserialization Time which includes the quantiles based on <code>executorDeserializeTime</code> task metric. It is not enabled by default and you should select Task Deserialization Time checkbox under Show Additional Metrics to include it in the summary table.</p> <p>The 4<sup>th</sup> row is GC Time which is the time that an executor spent paused for Java garbage collection while the task was running (using <code>jvmGCTime</code> task metric).</p> <p>The 5<sup>th</sup> row is the optional Result Serialization Time which is the time spent serializing the task result on a executor before sending it back to the driver (using <code>resultSerializationTime</code> task metric). It is not enabled by default and you should select Result Serialization Time checkbox under Show Additional Metrics to include it in the summary table.</p> <p>The 6<sup>th</sup> row is the optional Getting Result Time which is the time that the driver spends fetching task results from workers. It is not enabled by default and you should select Getting Result Time checkbox under Show Additional Metrics to include it in the summary table.</p> <p>The 7<sup>th</sup> row is the optional Peak Execution Memory which is the sum of the peak sizes of the internal data structures created during shuffles, aggregations and joins (using <code>peakExecutionMemory</code> task metric).</p> <p>If the stage has an input, the 8<sup>th</sup> row is Input Size / Records which is the bytes and records read from Hadoop or from a Spark storage (using <code>inputMetrics.bytesRead</code> and <code>inputMetrics.recordsRead</code> task metrics).</p> <p>If the stage has an output, the 9<sup>th</sup> row is Output Size / Records which is the bytes and records written to Hadoop or to a Spark storage (using <code>outputMetrics.bytesWritten</code> and <code>outputMetrics.recordsWritten</code> task metrics).</p> <p>If the stage has shuffle read there will be three more rows in the table. The first row is Shuffle Read Blocked Time which is the time that tasks spent blocked waiting for shuffle data to be read from remote machines (using <code>shuffleReadMetrics.fetchWaitTime</code> task metric). The other row is Shuffle Read Size / Records which is the total shuffle bytes and records read (including both data read locally and data read from remote executors using <code>shuffleReadMetrics.totalBytesRead</code> and <code>shuffleReadMetrics.recordsRead</code> task metrics). And the last row is Shuffle Remote Reads which is the total shuffle bytes read from remote executors (which is a subset of the shuffle read bytes; the remaining shuffle data is read locally). It uses <code>shuffleReadMetrics.remoteBytesRead</code> task metric.</p> <p>If the stage has shuffle write, the following row is Shuffle Write Size / Records (using shuffleWriteMetrics.bytesWritten and shuffleWriteMetrics.recordsWritten task metrics).</p> <p>If the stage has bytes spilled, the following two rows are Shuffle spill (memory) (using <code>memoryBytesSpilled</code> task metric) and Shuffle spill (disk) (using <code>diskBytesSpilled</code> task metric).</p>","text":""},{"location":"webui/StagePage/#dag-visualization","title":"DAG Visualization","text":""},{"location":"webui/StagePage/#event-timeline","title":"Event Timeline","text":""},{"location":"webui/StagePage/#stage-task-and-shuffle-stats","title":"Stage Task and Shuffle Stats","text":""},{"location":"webui/StagePage/#aggregated-metrics-by-executor","title":"Aggregated Metrics by Executor <p><code>ExecutorTable</code> table shows the following columns:</p> <ul> <li>Executor ID</li> <li>Address</li> <li>Task Time</li> <li>Total Tasks</li> <li>Failed Tasks</li> <li>Killed Tasks</li> <li>Succeeded Tasks</li> <li>(optional) Input Size / Records (only when the stage has an input)</li> <li>(optional) Output Size / Records (only when the stage has an output)</li> <li>(optional) Shuffle Read Size / Records (only when the stage read bytes for a shuffle)</li> <li>(optional) Shuffle Write Size / Records (only when the stage wrote bytes for a shuffle)</li> <li>(optional) Shuffle Spill (Memory) (only when the stage spilled memory bytes)</li> <li>(optional) Shuffle Spill (Disk) (only when the stage spilled bytes to disk)</li> </ul> <p></p> <p>It gets <code>executorSummary</code> from <code>StageUIData</code> (for the stage and stage attempt id) and creates rows per executor.</p>","text":""},{"location":"webui/StagePage/#accumulators","title":"Accumulators <p>Stage page displays the table with named accumulators (only if they exist). It contains the name and value of the accumulators.</p> <p></p>","text":""},{"location":"webui/StagesTab/","title":"StagesTab","text":"<p><code>StagesTab</code> is a SparkUITab with <code>stages</code> URL prefix.</p> <p></p>"},{"location":"webui/StagesTab/#creating-instance","title":"Creating Instance","text":"<p><code>StagesTab</code> takes the following to be created:</p> <ul> <li> Parent SparkUI <li> AppStatusStore <p><code>StagesTab</code> is created\u00a0when:</p> <ul> <li><code>SparkUI</code> is requested to initialize</li> </ul>"},{"location":"webui/StagesTab/#pages","title":"Pages","text":"<p>When created, <code>StagesTab</code> attaches the following pages:</p> <ul> <li>AllStagesPage</li> <li>StagePage (with the AppStatusStore)</li> <li>PoolPage</li> </ul>"},{"location":"webui/StagesTab/#introduction","title":"Introduction","text":"<p>Stages tab shows the current state of all stages of all jobs in a Spark application with two optional pages for the tasks and statistics for a stage (when a stage is selected) and pool details (when the application works in FAIR scheduling mode).</p> <p>The title of the tab is Stages for All Jobs.</p> <p>With no jobs submitted yet (and hence no stages to display), the page shows nothing but the title.</p> <p></p> <p>The Stages page shows the stages in a Spark application per state in their respective sections:</p> <ul> <li>Active Stages</li> <li>Pending Stages</li> <li>Completed Stages</li> <li>Failed Stages</li> </ul> <p></p> <p>The state sections are only displayed when there are stages in a given state.</p> <p>In FAIR scheduling mode you have access to the table showing the scheduler pools.</p> <p></p>"},{"location":"webui/StoragePage/","title":"StoragePage","text":"<p><code>StoragePage</code> is a WebUIPage of StorageTab.</p>"},{"location":"webui/StoragePage/#creating-instance","title":"Creating Instance","text":"<p><code>StoragePage</code> takes the following to be created:</p> <ul> <li> Parent SparkUITab <li> AppStatusStore <p><code>StoragePage</code> is created\u00a0when:</p> <ul> <li>StorageTab is created</li> </ul>"},{"location":"webui/StoragePage/#rendering-page","title":"Rendering Page <pre><code>render(\n  request: HttpServletRequest): Seq[Node]\n</code></pre> <p><code>render</code>\u00a0is part of the WebUIPage abstraction.</p> <p><code>render</code> renders a <code>Storage</code> page with the RDDs and streaming blocks (from the AppStatusStore).</p>","text":""},{"location":"webui/StoragePage/#rdd-tables-headers","title":"RDD Table's Headers <p><code>StoragePage</code> uses the following headers and tooltips for the RDD table.</p>    Header Tooltip     ID    RDD Name Name of the persisted RDD   Storage Level StorageLevel displays where the persisted RDD is stored, format of the persisted RDD (serialized or de-serialized) and replication factor of the persisted RDD   Cached Partitions Number of partitions cached   Fraction Cached Fraction of total partitions cached   Size in Memory Total size of partitions in memory   Size on Disk Total size of partitions on the disk","text":""},{"location":"webui/StorageTab/","title":"StorageTab","text":"<p><code>StorageTab</code> is a SparkUITab with <code>storage</code> URL prefix.</p> <p></p>"},{"location":"webui/StorageTab/#creating-instance","title":"Creating Instance","text":"<p><code>StorageTab</code> takes the following to be created:</p> <ul> <li> Parent SparkUI <li> AppStatusStore <p><code>StorageTab</code> is created\u00a0when:</p> <ul> <li><code>SparkUI</code> is requested to initialize</li> </ul>"},{"location":"webui/StorageTab/#pages","title":"Pages","text":"<p>When created, <code>StorageTab</code> attaches the following pages (with a reference to itself and the AppStatusStore):</p> <ul> <li>StoragePage</li> <li>RDDPage</li> </ul>"},{"location":"webui/UIUtils/","title":"UIUtils","text":"<p>== [[UIUtils]] UIUtils</p> <p><code>UIUtils</code> is a utility object for...FIXME</p> <p>=== [[headerSparkPage]] <code>headerSparkPage</code> Method</p>"},{"location":"webui/UIUtils/#source-scala","title":"[source, scala]","text":"<p>headerSparkPage(   request: HttpServletRequest,   title: String,   content: =&gt; Seq[Node],   activeTab: SparkUITab,   refreshInterval: Option[Int] = None,   helpText: Option[String] = None,   showVisualization: Boolean = false,   useDataTables: Boolean = false): Seq[Node]</p> <p><code>headerSparkPage</code>...FIXME</p> <p>NOTE: <code>headerSparkPage</code> is used when...FIXME</p>"},{"location":"webui/WebUI/","title":"WebUI","text":"<p><code>WebUI</code> is an abstraction of UIs.</p>"},{"location":"webui/WebUI/#contract","title":"Contract","text":""},{"location":"webui/WebUI/#initializing","title":"Initializing <pre><code>initialize(): Unit\n</code></pre> <p>Initializes components of the UI</p> <p>Used by the implementations themselves.</p>  <p>Note</p> <p><code>initialize</code> does not add anything special to the Scala type hierarchy but a common name to use across WebUIs. In other words, <code>initialize</code> does not participate in any design pattern or a type hierarchy and serves no purpose of being part of the contract.</p>","text":""},{"location":"webui/WebUI/#implementations","title":"Implementations","text":"<ul> <li>HistoryServer</li> <li>MasterWebUI (Spark Standalone)</li> <li>MesosClusterUI (Spark on Mesos)</li> <li>SparkUI</li> <li>WorkerWebUI (Spark Standalone)</li> </ul>"},{"location":"webui/WebUI/#creating-instance","title":"Creating Instance","text":"<p><code>WebUI</code> takes the following to be created:</p> <ul> <li> <code>SecurityManager</code> <li> <code>SSLOptions</code> <li> Port <li> SparkConf <li> Base Path (default: empty) <li> Name (default: empty) Abstract Class <p><code>WebUI</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete WebUIs.</p>"},{"location":"webui/WebUI/#tabs","title":"Tabs <p><code>WebUI</code> uses <code>tabs</code> registry for WebUITabs (that have been attached).</p> <p>Tabs can be attached and detached.</p>","text":""},{"location":"webui/WebUI/#attaching-tab","title":"Attaching Tab <pre><code>attachTab(\n  tab: WebUITab): Unit\n</code></pre> <p><code>attachTab</code> attaches the pages of the given WebUITab (and adds it to the tabs).</p>","text":""},{"location":"webui/WebUI/#detaching-tab","title":"Detaching Tab <pre><code>detachTab(\n  tab: WebUITab): Unit\n</code></pre> <p><code>detachTab</code> detaches the pages of the given WebUITab (and removes it from the tabs).</p>","text":""},{"location":"webui/WebUI/#pages","title":"Pages <p><code>WebUI</code> uses <code>pageToHandlers</code> registry for WebUIPages and their associated <code>ServletContextHandler</code>s.</p> <p>Pages can be attached and detached.</p>","text":""},{"location":"webui/WebUI/#attaching-page","title":"Attaching Page <pre><code>attachPage(\n  page: WebUIPage): Unit\n</code></pre> <p><code>attachPage</code>...FIXME</p> <p><code>attachPage</code> is used when:</p> <ul> <li><code>WebUI</code> is requested to attach a tab</li> <li>others</li> </ul>","text":""},{"location":"webui/WebUI/#detaching-page","title":"Detaching Page <pre><code>detachPage(\n  page: WebUIPage): Unit\n</code></pre> <p><code>detachPage</code> removes the given WebUIPage from the UI (the pageToHandlers registry) with all of the handlers.</p> <p><code>detachPage</code> is used when:</p> <ul> <li><code>WebUI</code> is requested to detach a tab</li> </ul>","text":""},{"location":"webui/WebUI/#logging","title":"Logging <p>Since <code>WebUI</code> is an abstract class, logging is configured using the logger of the implementations.</p>","text":""},{"location":"webui/WebUIPage/","title":"WebUIPage","text":"<p><code>WebUIPage</code> is an abstraction of pages (of a WebUI) that can be rendered to HTML and JSON.</p>"},{"location":"webui/WebUIPage/#contract","title":"Contract","text":""},{"location":"webui/WebUIPage/#rendering-page-to-html","title":"Rendering Page (to HTML) <pre><code>render(\n  request: HttpServletRequest): Seq[Node]\n</code></pre> <p>Used when:</p> <ul> <li><code>WebUI</code> is requested to attach a page (to handle the URL)</li> </ul>","text":""},{"location":"webui/WebUIPage/#implementations","title":"Implementations","text":"<ul> <li>AllExecutionsPage</li> <li>AllJobsPage</li> <li>AllStagesPage</li> <li>ApplicationPage</li> <li>BatchPage</li> <li>DriverPage</li> <li>EnvironmentPage</li> <li>ExecutionPage</li> <li>ExecutorsPage</li> <li>ExecutorThreadDumpPage</li> <li>HistoryPage</li> <li>JobPage</li> <li>LogPage</li> <li>MasterPage</li> <li>MesosClusterPage</li> <li>PoolPage</li> <li>RDDPage</li> <li>StagePage</li> <li>StoragePage</li> <li>StreamingPage</li> <li>StreamingQueryPage</li> <li>StreamingQueryStatisticsPage</li> <li>ThriftServerPage</li> <li>ThriftServerSessionPage</li> <li>WorkerPage</li> </ul>"},{"location":"webui/WebUIPage/#creating-instance","title":"Creating Instance","text":"<p><code>WebUIPage</code> takes the following to be created:</p> <ul> <li> URL Prefix Abstract Class <p><code>WebUIPage</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete WebUIPages.</p>"},{"location":"webui/WebUIPage/#rendering-page-to-json","title":"Rendering Page to JSON <pre><code>renderJson(\n  request: HttpServletRequest): JValue\n</code></pre> <p><code>renderJson</code> returns a <code>JNothing</code> by default.</p> <p><code>renderJson</code>\u00a0is used when:</p> <ul> <li><code>WebUI</code> is requested to attach a page (and handle the <code>/json</code> URL)</li> </ul>","text":""},{"location":"webui/WebUITab/","title":"WebUITab","text":"<p><code>WebUITab</code> is an abstraction of UI tabs with a name and pages.</p>"},{"location":"webui/WebUITab/#implementations","title":"Implementations","text":"<ul> <li>SparkUITab</li> </ul>"},{"location":"webui/WebUITab/#creating-instance","title":"Creating Instance","text":"<p><code>WebUITab</code> takes the following to be created:</p> <ul> <li> WebUI <li> Prefix Abstract Class <p><code>WebUITab</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete WebUITabs.</p>"},{"location":"webui/WebUITab/#name","title":"Name <pre><code>name: String\n</code></pre> <p><code>WebUITab</code> has a name that is the prefix capitalized by default.</p>","text":""},{"location":"webui/WebUITab/#pages","title":"Pages <pre><code>pages: ArrayBuffer[WebUIPage]\n</code></pre> <p><code>WebUITab</code> has WebUIPages.</p>","text":""},{"location":"webui/WebUITab/#attaching-page","title":"Attaching Page <pre><code>attachPage(\n  page: WebUIPage): Unit\n</code></pre> <p><code>attachPage</code> registers the WebUIPage (in the pages registry).</p> <p><code>attachPage</code> adds the prefix of this <code>WebUITab</code> before the prefix of the given WebUIPage:</p> <pre><code>[prefix]/[page.prefix]\n</code></pre>","text":""},{"location":"webui/configuration-properties/","title":"web UI Configuration Properties","text":""},{"location":"webui/configuration-properties/#sparkuicustomexecutorlogurl","title":"spark.ui.custom.executor.log.url <p>Specifies custom spark executor log url for supporting external log service instead of using cluster managers' application log urls in the Spark UI. Spark will support some path variables via patterns which can vary on cluster manager. Please check the documentation for your cluster manager to see which patterns are supported, if any. This configuration replaces original log urls in event log, which will be also effective when accessing the application on history server. The new log urls must be permanent, otherwise you might have dead link for executor log urls.</p> <p>Used when:</p> <ul> <li><code>DriverEndpoint</code> is created (and initializes an ExecutorLogUrlHandler)</li> </ul>","text":""},{"location":"webui/configuration-properties/#sparkuienabled","title":"spark.ui.enabled <p>Controls whether the web UI is started for the Spark application</p> <p>Default: <code>true</code></p>","text":""},{"location":"webui/configuration-properties/#sparkuiport","title":"spark.ui.port <p>The port the web UI of a Spark application binds to</p> <p>Default: <code>4040</code></p> <p>If multiple <code>SparkContext</code>s attempt to run on the same host (as different Spark applications), they will bind to successive ports beginning with <code>spark.ui.port</code> (until <code>spark.port.maxRetries</code>).</p> <p>Used when:</p> <ul> <li><code>SparkUI</code> utility is used to get the UI port</li> </ul>","text":""},{"location":"webui/configuration-properties/#sparkuiprometheusenabled","title":"spark.ui.prometheus.enabled <p>internal Expose executor metrics at <code>/metrics/executors/prometheus</code></p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>SparkUI</code> is requested to initialize</li> </ul>","text":""},{"location":"webui/configuration-properties/#review-me","title":"Review Me <p>[[properties]] .web UI Configuration Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Name | Default Value | Description</p>    [[spark.ui.allowFramingFrom]] <code>spark.ui.allowFramingFrom</code>     Defines the URL to use in <code>ALLOW-FROM</code> in <code>X-Frame-Options</code> header (as described in http://tools.ietf.org/html/rfc7034).    <p>Used exclusively when <code>JettyUtils</code> is requested to spark-webui-JettyUtils.md#createServlet[create an HttpServlet].</p> <p>| [[spark.ui.consoleProgress.update.interval]] <code>spark.ui.consoleProgress.update.interval</code> | <code>200</code> (ms) | Update interval, i.e. how often to show the progress.</p> <p>| [[spark.ui.killEnabled]] <code>spark.ui.killEnabled</code> | <code>true</code> | Enables jobs and stages to be killed from the web UI (<code>true</code>) or not (<code>false</code>).</p> <p>Used exclusively when <code>SparkUI</code> is requested to spark-webui-SparkUI.md#initialize[initialize] (and registers the redirect handlers for <code>/jobs/job/kill</code> and <code>/stages/stage/kill</code> URIs)</p> <p>| [[spark.ui.retainedDeadExecutors]] <code>spark.ui.retainedDeadExecutors</code> | <code>100</code> |</p> <p>| [[spark.ui.timeline.executors.maximum]] <code>spark.ui.timeline.executors.maximum</code> | <code>1000</code> | The maximum number of entries in &lt;&gt; registry. <p>| [[spark.ui.timeline.tasks.maximum]] <code>spark.ui.timeline.tasks.maximum</code> | <code>1000</code> | |===</p>","text":""},{"location":"developer-api/","title":"Developer API","text":""},{"location":"developer-api/#developerapi","title":"DeveloperApi","text":"<ul> <li>SparkEnv</li> <li>SparkListener</li> <li>StatsReportListener</li> <li>TaskCompletionListener</li> <li>TaskFailureListener</li> <li>ExecutorMetrics</li> <li>ShuffleReadMetrics</li> <li>ShuffleWriteMetrics</li> <li>TaskMetrics</li> <li>SparkPlugin</li> <li>Dependency</li> <li>NarrowDependency</li> <li>ShuffleDependency</li> <li>ShuffledRDD</li> <li>ResourceID</li> <li>SparkListenerResourceProfileAdded</li> <li>StorageLevel</li> </ul>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"The Internals of Spark Core (Apache Spark 3.5.0)","text":"<p>Welcome to The Internals of Spark Core online book! \ud83e\udd19</p> <p>I'm Jacek Laskowski, a Freelance Data Engineer specializing in Apache Spark (incl. Spark SQL and Spark Structured Streaming), Delta Lake, Databricks, and Apache Kafka (incl. Kafka Streams) with brief forays into a wider data engineering space (e.g., Trino, Dask and dbt, mostly during Warsaw Data Engineering meetups).</p> <p>I'm very excited to have you here and hope you will enjoy exploring the internals of Spark Core as much as I have.</p> <p>Flannery O'Connor</p> <p>I write to discover what I know.</p> <p>\"The Internals Of\" series</p> <p>I'm also writing other online books in the \"The Internals Of\" series. Please visit \"The Internals Of\" Online Books home page.</p> <p>Expect text and code snippets from a variety of public sources. Attribution follows.</p> <p>Now, let's take a deep dive into Spark Core \ud83d\udd25</p> <p>Last update: 2023-12-28</p>"},{"location":"BytesToBytesMap/","title":"BytesToBytesMap","text":"<p><code>BytesToBytesMap</code> is a memory consumer that supports spilling.</p> <p>Spark SQL</p> <p><code>BytesToBytesMap</code> is used in Spark SQL only in the following:</p> <ul> <li>UnsafeFixedWidthAggregationMap</li> <li>UnsafeHashedRelation</li> </ul>"},{"location":"BytesToBytesMap/#creating-instance","title":"Creating Instance","text":"<p><code>BytesToBytesMap</code> takes the following to be created:</p> <ul> <li> TaskMemoryManager <li> BlockManager <li> SerializerManager <li> Initial Capacity <li> Load Factor (default: <code>0.5</code>) <li> Page Size (bytes) <p><code>BytesToBytesMap</code> is created when:</p> <ul> <li><code>UnsafeFixedWidthAggregationMap</code> (Spark SQL) is created</li> <li><code>UnsafeHashedRelation</code> (Spark SQL) is created</li> </ul>"},{"location":"BytesToBytesMap/#destructive-mapiterator","title":"Destructive MapIterator <pre><code>MapIterator destructiveIterator\n</code></pre> <p><code>BytesToBytesMap</code> defines a reference to a \"destructive\" <code>MapIterator</code> (if ever created for <code>UnsafeFixedWidthAggregationMap</code> (Spark SQL)).</p> <p>The <code>destructiveIterator</code> reference is in two states:</p> <ul> <li>Undefined (<code>null</code>) initially when <code>BytesToBytesMap</code> is created</li> <li>The <code>MapIterator</code> if created</li> </ul>","text":""},{"location":"BytesToBytesMap/#creating-destructive-mapiterator","title":"Creating Destructive MapIterator <pre><code>MapIterator destructiveIterator()\n</code></pre> <p><code>destructiveIterator</code> updatePeakMemoryUsed and then creates a <code>MapIterator</code> with the following:</p> <ul> <li>numValues for the number of records</li> <li>A new <code>Location</code></li> <li>Destructive flag enabled (<code>true</code>)</li> </ul>  <p><code>destructiveIterator</code> is used when:</p> <ul> <li><code>UnsafeFixedWidthAggregationMap</code> (Spark SQL) is created</li> </ul>","text":""},{"location":"BytesToBytesMap/#spilling","title":"Spilling <pre><code>long spill(\n  long size,\n  MemoryConsumer trigger)\n</code></pre> <p><code>spill</code> is part of the MemoryConsumer abstraction.</p>  <p>Only when the given MemoryConsumer is not this <code>BytesToBytesMap</code> and the destructive MapIterator has been used, <code>spill</code> requests the destructive <code>MapIterator</code> to <code>spill</code> (the given <code>size</code> bytes).</p> <p><code>spill</code> returns <code>0</code> when the <code>trigger</code> is this <code>BytesToBytesMap</code> or there is no destructiveIterator in use. Otherwise, <code>spill</code> returns how much bytes the destructiveIterator managed to release.</p>","text":""},{"location":"BytesToBytesMap/#numvalues","title":"numValues <p><code>numValues</code> registry is <code>0</code> after reset.</p> <p><code>numValues</code> is incremented when <code>Location</code> is requested to <code>append</code></p> <p><code>numValues</code> can never be bigger than maximum capacity of this <code>BytesToBytesMap</code> or growthThreshold.</p>","text":""},{"location":"BytesToBytesMap/#maximum-capacity","title":"Maximum Capacity <p><code>BytesToBytesMap</code> supports up to <code>1 &lt;&lt; 29</code> keys.</p> <p><code>BytesToBytesMap</code> makes sure that the initialCapacity is not bigger when creted.</p>","text":""},{"location":"BytesToBytesMap/#allocating-memory","title":"Allocating Memory <pre><code>void allocate(\n  int capacity)\n</code></pre> <p><code>allocate</code>...FIXME</p>  <p><code>allocate</code> is used when:</p> <ul> <li><code>BytesToBytesMap</code> is created, reset, growAndRehash</li> </ul>","text":""},{"location":"BytesToBytesMap/#growing-memory-and-rehashing","title":"Growing Memory And Rehashing <pre><code>void growAndRehash()\n</code></pre> <p><code>growAndRehash</code>...FIXME</p>  <p><code>growAndRehash</code> is used when:</p> <ul> <li><code>Location</code> is requested to <code>append</code> (a new value for a key)</li> </ul>","text":""},{"location":"ConsoleProgressBar/","title":"ConsoleProgressBar","text":"<p><code>ConsoleProgressBar</code> shows the progress of active stages to standard error, i.e. <code>stderr</code>. It uses SparkStatusTracker to poll the status of stages periodically and print out active stages with more than one task. It keeps overwriting itself to hold in one line for at most 3 first concurrent stages at a time.</p> <pre><code>[Stage 0:====&gt;          (316 + 4) / 1000][Stage 1:&gt;                (0 + 0) / 1000][Stage 2:&gt;                (0 + 0) / 1000]]]\n</code></pre> <p>The progress includes the stage id, the number of completed, active, and total tasks.</p> <p>TIP: <code>ConsoleProgressBar</code> may be useful when you <code>ssh</code> to workers and want to see the progress of active stages.</p> <p>&lt;ConsoleProgressBar is created&gt;&gt; when SparkContext is created with spark.ui.showConsoleProgress enabled and the logging level of SparkContext.md[org.apache.spark.SparkContext] logger as <code>WARN</code> or higher (i.e. less messages are printed out and so there is a \"space\" for <code>ConsoleProgressBar</code>)."},{"location":"ConsoleProgressBar/#source-scala","title":"[source, scala]","text":"<p>import org.apache.log4j._ Logger.getLogger(\"org.apache.spark.SparkContext\").setLevel(Level.WARN)</p> <p>To print the progress nicely <code>ConsoleProgressBar</code> uses <code>COLUMNS</code> environment variable to know the width of the terminal. It assumes <code>80</code> columns.</p> <p>The progress bar prints out the status after a stage has ran at least <code>500</code> milliseconds every spark-webui-properties.md#spark.ui.consoleProgress.update.interval[spark.ui.consoleProgress.update.interval] milliseconds.</p> <p>NOTE: The initial delay of <code>500</code> milliseconds before <code>ConsoleProgressBar</code> show the progress is not configurable.</p> <p>See the progress bar in Spark shell with the following:</p>"},{"location":"ConsoleProgressBar/#source","title":"[source]","text":"<p>$ ./bin/spark-shell --conf spark.ui.showConsoleProgress=true  # &lt;1&gt;</p> <p>scala&gt; sc.setLogLevel(\"OFF\")  // &lt;2&gt;</p> <p>import org.apache.log4j._ scala&gt; Logger.getLogger(\"org.apache.spark.SparkContext\").setLevel(Level.WARN)  // &lt;3&gt;</p> <p>scala&gt; sc.parallelize(1 to 4, 4).map { n =&gt; Thread.sleep(500 + 200 * n); n }.count  // &lt;4&gt; [Stage 2:&gt;                                                          (0 + 4) / 4] [Stage 2:==============&gt;                                            (1 + 3) / 4] [Stage 2:=============================&gt;                             (2 + 2) / 4] [Stage 2:============================================&gt;              (3 + 1) / 4]</p> <p>&lt;1&gt; Make sure <code>spark.ui.showConsoleProgress</code> is <code>true</code>. It is by default. &lt;2&gt; Disable (<code>OFF</code>) the root logger (that includes Spark's logger) &lt;3&gt; Make sure <code>org.apache.spark.SparkContext</code> logger is at least <code>WARN</code>. &lt;4&gt; Run a job with 4 tasks with 500ms initial sleep and 200ms sleep chunks to see the progress bar.</p> <p>TIP: https://youtu.be/uEmcGo8rwek[Watch the short video] that show ConsoleProgressBar in action.</p> <p>You may want to use the following example to see the progress bar in full glory - all 3 concurrent stages in console (borrowed from https://github.com/apache/spark/pull/3029#issuecomment-63244719[a comment to [SPARK-4017] show progress bar in console #3029]):</p> <pre><code>&gt; ./bin/spark-shell\nscala&gt; val a = sc.makeRDD(1 to 1000, 10000).map(x =&gt; (x, x)).reduceByKey(_ + _)\nscala&gt; val b = sc.makeRDD(1 to 1000, 10000).map(x =&gt; (x, x)).reduceByKey(_ + _)\nscala&gt; a.union(b).count()\n</code></pre> <p>=== [[creating-instance]] Creating ConsoleProgressBar Instance</p> <p><code>ConsoleProgressBar</code> requires a SparkContext.md[SparkContext].</p> <p>When being created, <code>ConsoleProgressBar</code> reads spark-webui-properties.md#spark.ui.consoleProgress.update.interval[spark.ui.consoleProgress.update.interval] configuration property to set up the update interval and <code>COLUMNS</code> environment variable for the terminal width (or assumes <code>80</code> columns).</p> <p><code>ConsoleProgressBar</code> starts the internal timer <code>refresh progress</code> that does &lt;&gt; and shows progress. <p>NOTE: <code>ConsoleProgressBar</code> is created when SparkContext is created, spark.ui.showConsoleProgress configuration property is enabled, and the logging level of SparkContext.md[org.apache.spark.SparkContext] logger is <code>WARN</code> or higher (i.e. less messages are printed out and so there is a \"space\" for <code>ConsoleProgressBar</code>).</p> <p>NOTE: Once created, <code>ConsoleProgressBar</code> is available internally as <code>_progressBar</code>.</p> <p>=== [[finishAll]] <code>finishAll</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[stop]] <code>stop</code> Method</p>"},{"location":"ConsoleProgressBar/#source-scala_1","title":"[source, scala]","text":""},{"location":"ConsoleProgressBar/#stop-unit","title":"stop(): Unit","text":"<p><code>stop</code> cancels (stops) the internal timer.</p> <p>NOTE: <code>stop</code> is executed when SparkContext.md#stop[<code>SparkContext</code> stops].</p> <p>=== [[refresh]] <code>refresh</code> Internal Method</p>"},{"location":"ConsoleProgressBar/#source-scala_2","title":"[source, scala]","text":""},{"location":"ConsoleProgressBar/#refresh-unit","title":"refresh(): Unit","text":"<p><code>refresh</code>...FIXME</p> <p>NOTE: <code>refresh</code> is used when...FIXME</p>"},{"location":"DriverLogger/","title":"DriverLogger","text":"<p><code>DriverLogger</code> runs on the driver (in <code>client</code> deploy mode) to copy driver logs to Hadoop DFS periodically.</p>"},{"location":"DriverLogger/#creating-instance","title":"Creating Instance","text":"<p><code>DriverLogger</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>DriverLogger</code> is created using apply utility.</p>"},{"location":"DriverLogger/#creating-driverlogger","title":"Creating DriverLogger <pre><code>apply(\n  conf: SparkConf): Option[DriverLogger]\n</code></pre> <p><code>apply</code> creates a DriverLogger when the following hold:</p> <ol> <li>spark.driver.log.persistToDfs.enabled configuration property is enabled</li> <li>The Spark application runs in <code>client</code> deploy mode (and spark.submit.deployMode is <code>client</code>)</li> <li>spark.driver.log.dfsDir is specified</li> </ol> <p><code>apply</code> prints out the following WARN message to the logs with no spark.driver.log.dfsDir specified:</p> <pre><code>Driver logs are not persisted because spark.driver.log.dfsDir is not configured\n</code></pre> <p><code>apply</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"DriverLogger/#starting-dfsasyncwriter","title":"Starting DfsAsyncWriter <pre><code>startSync(\n  hadoopConf: Configuration): Unit\n</code></pre> <p><code>startSync</code> creates and starts a <code>DfsAsyncWriter</code> (with the spark.app.id configuration property).</p> <p><code>startSync</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is requested to postApplicationStart</li> </ul>","text":""},{"location":"ExecutorDeadException/","title":"ExecutorDeadException","text":"<p><code>ExecutorDeadException</code> is a <code>SparkException</code>.</p>"},{"location":"ExecutorDeadException/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorDeadException</code> takes the following to be created:</p> <ul> <li> Error message <p><code>ExecutorDeadException</code> is created\u00a0when:</p> <ul> <li><code>NettyBlockTransferService</code> is requested to fetch blocks</li> </ul>"},{"location":"FileCommitProtocol/","title":"FileCommitProtocol","text":"<p><code>FileCommitProtocol</code> is an abstraction of file committers that can setup, commit or abort a Spark job or task (while writing out a pair RDD and partitions).</p> <p><code>FileCommitProtocol</code> is used for RDD.saveAsNewAPIHadoopDataset and RDD.saveAsHadoopDataset transformations (that use <code>SparkHadoopWriter</code> utility to write a key-value RDD out).</p> <p><code>FileCommitProtocol</code> is created using FileCommitProtocol.instantiate utility.</p>"},{"location":"FileCommitProtocol/#contract","title":"Contract","text":""},{"location":"FileCommitProtocol/#aborting-job","title":"Aborting Job <pre><code>abortJob(\n  jobContext: JobContext): Unit\n</code></pre> <p>Aborts a job</p> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> utility is used to write a key-value RDD (and writing fails)</li> <li>(Spark SQL) <code>FileFormatWriter</code> utility is used to write a result of a structured query (and writing fails)</li> <li>(Spark SQL) <code>FileBatchWrite</code> is requested to <code>abort</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#aborting-task","title":"Aborting Task <pre><code>abortTask(\n  taskContext: TaskAttemptContext): Unit\n</code></pre> <p>Abort a task</p> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> utility is used to write an RDD partition</li> <li>(Spark SQL) <code>FileFormatDataWriter</code> is requested to <code>abort</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#committing-job","title":"Committing Job <pre><code>commitJob(\n  jobContext: JobContext,\n  taskCommits: Seq[TaskCommitMessage]): Unit\n</code></pre> <p>Commits a job after the writes succeed</p> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> utility is used to write a key-value RDD</li> <li>(Spark SQL) <code>FileFormatWriter</code> utility is used to write a result of a structured query</li> <li>(Spark SQL) <code>FileBatchWrite</code> is requested to <code>commit</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#committing-task","title":"Committing Task <pre><code>commitTask(\n  taskContext: TaskAttemptContext): TaskCommitMessage\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> utility is used to write an RDD partition</li> <li>(Spark SQL) <code>FileFormatDataWriter</code> is requested to <code>commit</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#deleting-path-with-job","title":"Deleting Path with Job <pre><code>deleteWithJob(\n  fs: FileSystem,\n  path: Path,\n  recursive: Boolean): Boolean\n</code></pre> <p><code>deleteWithJob</code> requests the given Hadoop FileSystem to delete a <code>path</code> directory.</p> <p>Used when <code>InsertIntoHadoopFsRelationCommand</code> logical command (Spark SQL) is executed</p>","text":""},{"location":"FileCommitProtocol/#newtasktempfile","title":"newTaskTempFile <pre><code>newTaskTempFile(\n  taskContext: TaskAttemptContext,\n  dir: Option[String],\n  ext: String): String\n</code></pre> <p>Used when:</p> <ul> <li>(Spark SQL) <code>SingleDirectoryDataWriter</code> and <code>DynamicPartitionDataWriter</code> are requested to <code>write</code> (and in turn <code>newOutputWriter</code>)</li> </ul>","text":""},{"location":"FileCommitProtocol/#newtasktempfileabspath","title":"newTaskTempFileAbsPath <pre><code>newTaskTempFileAbsPath(\n  taskContext: TaskAttemptContext,\n  absoluteDir: String,\n  ext: String): String\n</code></pre> <p>Used when:</p> <ul> <li>(Spark SQL) <code>DynamicPartitionDataWriter</code> is requested to <code>write</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#on-task-committed","title":"On Task Committed <pre><code>onTaskCommit(\n  taskCommit: TaskCommitMessage): Unit\n</code></pre> <p>Used when:</p> <ul> <li>(Spark SQL) <code>FileFormatWriter</code> is requested to <code>write</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#setting-up-job","title":"Setting Up Job <pre><code>setupJob(\n  jobContext: JobContext): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> utility is used to write an RDD partition (while writing out a key-value RDD)</li> <li>(Spark SQL) <code>FileFormatWriter</code> utility is used to write a result of a structured query</li> <li>(Spark SQL) <code>FileWriteBuilder</code> is requested to <code>buildForBatch</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#setting-up-task","title":"Setting Up Task <pre><code>setupTask(\n  taskContext: TaskAttemptContext): Unit\n</code></pre> <p>Sets up the task with the Hadoop TaskAttemptContext</p> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> is requested to write an RDD partition (while writing out a key-value RDD)</li> <li>(Spark SQL) <code>FileFormatWriter</code> utility is used to write out a RDD partition (while writing out a result of a structured query)</li> <li>(Spark SQL) <code>FileWriterFactory</code> is requested to <code>createWriter</code></li> </ul>","text":""},{"location":"FileCommitProtocol/#implementations","title":"Implementations","text":"<ul> <li>HadoopMapReduceCommitProtocol</li> <li><code>ManifestFileCommitProtocol</code> (qv. Spark Structured Streaming)</li> </ul>"},{"location":"FileCommitProtocol/#instantiating-filecommitprotocol-committer","title":"Instantiating FileCommitProtocol Committer <pre><code>instantiate(\n  className: String,\n  jobId: String,\n  outputPath: String,\n  dynamicPartitionOverwrite: Boolean = false): FileCommitProtocol\n</code></pre> <p><code>instantiate</code> prints out the following DEBUG message to the logs:</p> <pre><code>Creating committer [className]; job [jobId]; output=[outputPath]; dynamic=[dynamicPartitionOverwrite]\n</code></pre> <p><code>instantiate</code> tries to find a constructor method that takes three arguments (two of type <code>String</code> and one <code>Boolean</code>) for the given <code>jobId</code>, <code>outputPath</code> and <code>dynamicPartitionOverwrite</code> flag. If found, <code>instantiate</code> prints out the following DEBUG message to the logs:</p> <pre><code>Using (String, String, Boolean) constructor\n</code></pre> <p>In case of <code>NoSuchMethodException</code>, <code>instantiate</code> prints out the following DEBUG message to the logs:</p> <pre><code>Falling back to (String, String) constructor\n</code></pre> <p><code>instantiate</code> tries to find a constructor method that takes two arguments (two of type <code>String</code>) for the given <code>jobId</code> and <code>outputPath</code>.</p> <p>With two <code>String</code> arguments, <code>instantiate</code> requires that the given <code>dynamicPartitionOverwrite</code> flag is disabled (<code>false</code>) or throws an <code>IllegalArgumentException</code>:</p> <pre><code>requirement failed: Dynamic Partition Overwrite is enabled but the committer [className] does not have the appropriate constructor\n</code></pre> <p><code>instantiate</code> is used when:</p> <ul> <li>HadoopMapRedWriteConfigUtil and HadoopMapReduceWriteConfigUtil are requested to create a HadoopMapReduceCommitProtocol committer</li> <li>(Spark SQL) <code>InsertIntoHadoopFsRelationCommand</code>, <code>InsertIntoHiveDirCommand</code>, and <code>InsertIntoHiveTable</code> logical commands are executed</li> <li>(Spark Structured Streaming) <code>FileStreamSink</code> is requested to write out a micro-batch data</li> </ul>","text":""},{"location":"FileCommitProtocol/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.internal.io.FileCommitProtocol</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.internal.io.FileCommitProtocol=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"HadoopMapRedCommitProtocol/","title":"HadoopMapRedCommitProtocol","text":"<p><code>HadoopMapRedCommitProtocol</code> is...FIXME</p>"},{"location":"HadoopMapRedWriteConfigUtil/","title":"HadoopMapRedWriteConfigUtil","text":"<p><code>HadoopMapRedWriteConfigUtil</code> is...FIXME</p> <p>== [[createCommitter]] <code>createCommitter</code> Method</p>"},{"location":"HadoopMapRedWriteConfigUtil/#source-scala","title":"[source, scala]","text":"<p>createCommitter(   jobId: Int): HadoopMapReduceCommitProtocol</p> <p>NOTE: <code>createCommitter</code> is part of the &lt;&gt; contract to...FIXME. <p><code>createCommitter</code>...FIXME</p>"},{"location":"HadoopMapReduceCommitProtocol/","title":"HadoopMapReduceCommitProtocol","text":"<p><code>HadoopMapReduceCommitProtocol</code> is...FIXME</p>"},{"location":"HadoopMapReduceWriteConfigUtil/","title":"HadoopMapReduceWriteConfigUtil","text":"<p><code>HadoopMapReduceWriteConfigUtil</code> is...FIXME</p> <p>== [[createCommitter]] <code>createCommitter</code> Method</p>"},{"location":"HadoopMapReduceWriteConfigUtil/#source-scala","title":"[source, scala]","text":"<p>createCommitter(   jobId: Int): HadoopMapReduceCommitProtocol</p> <p>NOTE: <code>createCommitter</code> is part of the &lt;&gt; contract to...FIXME. <p><code>createCommitter</code>...FIXME</p>"},{"location":"HadoopWriteConfigUtil/","title":"HadoopWriteConfigUtil","text":"<p><code>HadoopWriteConfigUtil[K, V]</code> is an abstraction of writer configurers for SparkHadoopWriter to write a key-value RDD (for RDD.saveAsNewAPIHadoopDataset and RDD.saveAsHadoopDataset operators).</p>"},{"location":"HadoopWriteConfigUtil/#contract","title":"Contract","text":""},{"location":"HadoopWriteConfigUtil/#assertconf","title":"assertConf <pre><code>assertConf(\n  jobContext: JobContext,\n  conf: SparkConf): Unit\n</code></pre>","text":""},{"location":"HadoopWriteConfigUtil/#closewriter","title":"closeWriter <pre><code>closeWriter(\n  taskContext: TaskAttemptContext): Unit\n</code></pre>","text":""},{"location":"HadoopWriteConfigUtil/#createcommitter","title":"createCommitter <pre><code>createCommitter(\n  jobId: Int): HadoopMapReduceCommitProtocol\n</code></pre> <p>Creates a HadoopMapReduceCommitProtocol committer</p> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> is requested to write data out</li> </ul>","text":""},{"location":"HadoopWriteConfigUtil/#createjobcontext","title":"createJobContext <pre><code>createJobContext(\n  jobTrackerId: String,\n  jobId: Int): JobContext\n</code></pre>","text":""},{"location":"HadoopWriteConfigUtil/#createtaskattemptcontext","title":"createTaskAttemptContext <pre><code>createTaskAttemptContext(\n  jobTrackerId: String,\n  jobId: Int,\n  splitId: Int,\n  taskAttemptId: Int): TaskAttemptContext\n</code></pre> <p>Creates a Hadoop TaskAttemptContext</p>","text":""},{"location":"HadoopWriteConfigUtil/#initoutputformat","title":"initOutputFormat <pre><code>initOutputFormat(\n  jobContext: JobContext): Unit\n</code></pre>","text":""},{"location":"HadoopWriteConfigUtil/#initwriter","title":"initWriter <pre><code>initWriter(\n  taskContext: TaskAttemptContext,\n  splitId: Int): Unit\n</code></pre>","text":""},{"location":"HadoopWriteConfigUtil/#write","title":"write <pre><code>write(\n  pair: (K, V)): Unit\n</code></pre> <p>Writes out the key-value pair</p> <p>Used when:</p> <ul> <li><code>SparkHadoopWriter</code> is requested to executeTask</li> </ul>","text":""},{"location":"HadoopWriteConfigUtil/#implementations","title":"Implementations","text":"<ul> <li>HadoopMapReduceWriteConfigUtil</li> <li>HadoopMapRedWriteConfigUtil</li> </ul>"},{"location":"HeartbeatReceiver/","title":"HeartbeatReceiver RPC Endpoint","text":"<p><code>HeartbeatReceiver</code> is a ThreadSafeRpcEndpoint that is registered on the driver as HeartbeatReceiver.</p> <p><code>HeartbeatReceiver</code> receives Heartbeat messages from executors for accumulator updates (with task metrics and a Spark application's accumulators) and pass them along to TaskScheduler.</p> <p></p> <p><code>HeartbeatReceiver</code> is registered immediately after a Spark application is started (i.e. when SparkContext is created).</p> <p><code>HeartbeatReceiver</code> is a SparkListener to get notified about new executors or executors that are no longer available.</p>"},{"location":"HeartbeatReceiver/#creating-instance","title":"Creating Instance","text":"<p><code>HeartbeatReceiver</code> takes the following to be created:</p> <ul> <li> SparkContext <li> <code>Clock</code> (default: <code>SystemClock</code>) <p><code>HeartbeatReceiver</code> is created\u00a0when <code>SparkContext</code> is created</p>"},{"location":"HeartbeatReceiver/#taskscheduler","title":"TaskScheduler <p><code>HeartbeatReceiver</code> manages a reference to TaskScheduler.</p>","text":""},{"location":"HeartbeatReceiver/#rpc-messages","title":"RPC Messages","text":""},{"location":"HeartbeatReceiver/#executorremoved","title":"ExecutorRemoved <p>Attributes:</p> <ul> <li>Executor ID</li> </ul> <p>Posted when <code>HeartbeatReceiver</code> is notified that an executor is no longer available</p> <p>When received, <code>HeartbeatReceiver</code> removes the executor (from executorLastSeen internal registry).</p>","text":""},{"location":"HeartbeatReceiver/#executorregistered","title":"ExecutorRegistered <p>Attributes:</p> <ul> <li>Executor ID</li> </ul> <p>Posted when <code>HeartbeatReceiver</code> is notified that a new executor has been registered</p> <p>When received, <code>HeartbeatReceiver</code> registers the executor and the current time (in executorLastSeen internal registry).</p>","text":""},{"location":"HeartbeatReceiver/#expiredeadhosts","title":"ExpireDeadHosts <p>No attributes</p> <p>When received, <code>HeartbeatReceiver</code> prints out the following TRACE message to the logs:</p> <pre><code>Checking for hosts with no recent heartbeats in HeartbeatReceiver.\n</code></pre> <p>Each executor (in executorLastSeen internal registry) is checked whether the time it was last seen is not past spark.network.timeout.</p> <p>For any such executor, <code>HeartbeatReceiver</code> prints out the following WARN message to the logs:</p> <pre><code>Removing executor [executorId] with no recent heartbeats: [time] ms exceeds timeout [timeout] ms\n</code></pre> <p><code>HeartbeatReceiver</code> TaskScheduler.executorLost (with <code>SlaveLost(\"Executor heartbeat timed out after [timeout] ms\"</code>).</p> <p><code>SparkContext.killAndReplaceExecutor</code> is asynchronously called for the executor (i.e. on killExecutorThread).</p> <p>The executor is removed from the executorLastSeen internal registry.</p>","text":""},{"location":"HeartbeatReceiver/#heartbeat","title":"Heartbeat <p>Attributes:</p> <ul> <li>Executor ID</li> <li>AccumulatorV2 updates (by task ID)</li> <li>BlockManagerId</li> <li><code>ExecutorMetrics</code> peaks (by stage and stage attempt IDs)</li> </ul> <p>Posted when <code>Executor</code> informs that it is alive and reports task metrics.</p> <p>When received, <code>HeartbeatReceiver</code> finds the <code>executorId</code> executor (in executorLastSeen internal registry).</p> <p>When the executor is found, <code>HeartbeatReceiver</code> updates the time the heartbeat was received (in executorLastSeen internal registry).</p> <p><code>HeartbeatReceiver</code> uses the Clock to know the current time.</p> <p><code>HeartbeatReceiver</code> then submits an asynchronous task to notify <code>TaskScheduler</code> that the heartbeat was received from the executor (using TaskScheduler internal reference). <code>HeartbeatReceiver</code> posts a <code>HeartbeatResponse</code> back to the executor (with the response from <code>TaskScheduler</code> whether the executor has been registered already or not so it may eventually need to re-register).</p> <p>If however the executor was not found (in executorLastSeen internal registry), i.e. the executor was not registered before, you should see the following DEBUG message in the logs and the response is to notify the executor to re-register.</p> <pre><code>Received heartbeat from unknown executor [executorId]\n</code></pre> <p>In a very rare case, when TaskScheduler is not yet assigned to <code>HeartbeatReceiver</code>, you should see the following WARN message in the logs and the response is to notify the executor to re-register.</p> <pre><code>Dropping [heartbeat] because TaskScheduler is not ready yet\n</code></pre>","text":""},{"location":"HeartbeatReceiver/#taskschedulerisset","title":"TaskSchedulerIsSet <p>No attributes</p> <p>Posted when <code>SparkContext</code> informs that <code>TaskScheduler</code> is available.</p> <p>When received, <code>HeartbeatReceiver</code> sets the internal reference to <code>TaskScheduler</code>.</p>","text":""},{"location":"HeartbeatReceiver/#onexecutoradded","title":"onExecutorAdded <pre><code>onExecutorAdded(\n  executorAdded: SparkListenerExecutorAdded): Unit\n</code></pre> <p><code>onExecutorAdded</code> sends an ExecutorRegistered message to itself.</p> <p><code>onExecutorAdded</code>\u00a0is part of the SparkListener abstraction.</p>","text":""},{"location":"HeartbeatReceiver/#addexecutor","title":"addExecutor <pre><code>addExecutor(\n  executorId: String): Option[Future[Boolean]]\n</code></pre> <p><code>addExecutor</code>...FIXME</p>","text":""},{"location":"HeartbeatReceiver/#onexecutorremoved","title":"onExecutorRemoved <pre><code>onExecutorRemoved(\n  executorRemoved: SparkListenerExecutorRemoved): Unit\n</code></pre> <p><code>onExecutorRemoved</code> removes the executor.</p> <p><code>onExecutorRemoved</code>\u00a0is part of the SparkListener abstraction.</p>","text":""},{"location":"HeartbeatReceiver/#removeexecutor","title":"removeExecutor <pre><code>removeExecutor(\n  executorId: String): Option[Future[Boolean]]\n</code></pre> <p><code>removeExecutor</code>...FIXME</p>","text":""},{"location":"HeartbeatReceiver/#starting-heartbeatreceiver","title":"Starting HeartbeatReceiver <pre><code>onStart(): Unit\n</code></pre> <p><code>onStart</code> sends a blocking ExpireDeadHosts every spark.network.timeoutInterval on eventLoopThread.</p> <p><code>onStart</code>\u00a0is part of the RpcEndpoint abstraction.</p>","text":""},{"location":"HeartbeatReceiver/#stopping-heartbeatreceiver","title":"Stopping HeartbeatReceiver <pre><code>onStop(): Unit\n</code></pre> <p><code>onStop</code> shuts down the eventLoopThread and killExecutorThread thread pools.</p> <p><code>onStop</code>\u00a0is part of the RpcEndpoint abstraction.</p>","text":""},{"location":"HeartbeatReceiver/#handling-two-way-messages","title":"Handling Two-Way Messages <pre><code>receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n</code></pre> <p><code>receiveAndReply</code>...FIXME</p> <p><code>receiveAndReply</code>\u00a0is part of the RpcEndpoint abstraction.</p>","text":""},{"location":"HeartbeatReceiver/#thread-pools","title":"Thread Pools","text":""},{"location":"HeartbeatReceiver/#kill-executor-thread","title":"kill-executor-thread <p><code>killExecutorThread</code> is a daemon ScheduledThreadPoolExecutor with a single thread.</p> <p>The name of the thread pool is kill-executor-thread.</p>","text":""},{"location":"HeartbeatReceiver/#heartbeat-receiver-event-loop-thread","title":"heartbeat-receiver-event-loop-thread <p><code>eventLoopThread</code> is a daemon ScheduledThreadPoolExecutor with a single thread.</p> <p>The name of the thread pool is heartbeat-receiver-event-loop-thread.</p>","text":""},{"location":"HeartbeatReceiver/#expiring-dead-hosts","title":"Expiring Dead Hosts <pre><code>expireDeadHosts(): Unit\n</code></pre> <p><code>expireDeadHosts</code>...FIXME</p> <p><code>expireDeadHosts</code>\u00a0is used when <code>HeartbeatReceiver</code> is requested to receives an ExpireDeadHosts message.</p>","text":""},{"location":"HeartbeatReceiver/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.HeartbeatReceiver</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.HeartbeatReceiver=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"InterruptibleIterator/","title":"InterruptibleIterator","text":"<p>== [[InterruptibleIterator]] InterruptibleIterator -- Iterator With Support For Task Cancellation</p> <p><code>InterruptibleIterator</code> is a custom Scala https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator[Iterator] that supports task cancellation, i.e. &lt;&gt;. <p>Quoting the official Scala https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator[Iterator] documentation:</p> <p>Iterators are data structures that allow to iterate over a sequence of elements. They have a <code>hasNext</code> method for checking if there is a next element available, and a <code>next</code> method which returns the next element and discards it from the iterator.</p> <p><code>InterruptibleIterator</code> is &lt;&gt; when: <ul> <li> <p><code>RDD</code> is requested to rdd:RDD.md#getOrCompute[get or compute a RDD partition]</p> </li> <li> <p>CoGroupedRDD, rdd:HadoopRDD.md#compute[HadoopRDD], rdd:NewHadoopRDD.md#compute[NewHadoopRDD], rdd:ParallelCollectionRDD.md#compute[ParallelCollectionRDD] are requested to <code>compute</code> a partition</p> </li> <li> <p><code>BlockStoreShuffleReader</code> is requested to shuffle:BlockStoreShuffleReader.md#read[read combined key-value records for a reduce task]</p> </li> <li> <p><code>PairRDDFunctions</code> is requested to rdd:PairRDDFunctions.md#combineByKeyWithClassTag[combineByKeyWithClassTag]</p> </li> <li> <p>Spark SQL's <code>DataSourceRDD</code> and <code>JDBCRDD</code> are requested to <code>compute</code> a partition</p> </li> <li> <p>Spark SQL's <code>RangeExec</code> physical operator is requested to <code>doExecute</code></p> </li> <li> <p>PySpark's <code>BasePythonRunner</code> is requested to <code>compute</code></p> </li> </ul> <p>[[creating-instance]] <code>InterruptibleIterator</code> takes the following when created:</p> <ul> <li>[[context]] TaskContext</li> <li>[[delegate]] Scala <code>Iterator[T]</code></li> </ul> <p>NOTE: <code>InterruptibleIterator</code> is a Developer API which is a lower-level, unstable API intended for Spark developers that may change or be removed in minor versions of Apache Spark.</p> <p>=== [[hasNext]] <code>hasNext</code> Method</p>"},{"location":"InterruptibleIterator/#source-scala","title":"[source, scala]","text":""},{"location":"InterruptibleIterator/#hasnext-boolean","title":"hasNext: Boolean","text":"<p>NOTE: <code>hasNext</code> is part of ++https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator@hasNext:Boolean++[Iterator Contract] to test whether this iterator can provide another element.</p> <p><code>hasNext</code> requests the &lt;&gt; to kill the task if interrupted (that simply throws a <code>TaskKilledException</code> that in turn breaks the task execution). <p>In the end, <code>hasNext</code> requests the &lt;&gt; to <code>hasNext</code>. <p>=== [[next]] <code>next</code> Method</p>"},{"location":"InterruptibleIterator/#source-scala_1","title":"[source, scala]","text":""},{"location":"InterruptibleIterator/#next-t","title":"next(): T","text":"<p>NOTE: <code>next</code> is part of ++https://www.scala-lang.org/api/2.11.x/index.html#scala.collection.Iterator@next():A++[Iterator Contract] to produce the next element of this iterator.</p> <p><code>next</code> simply requests the &lt;&gt; to <code>next</code>."},{"location":"ListenerBus/","title":"ListenerBus","text":"<p><code>ListenerBus</code> is an abstraction of event buses that can notify listeners about scheduling events.</p>"},{"location":"ListenerBus/#contract","title":"Contract","text":""},{"location":"ListenerBus/#notifying-listener-about-event","title":"Notifying Listener about Event <pre><code>doPostEvent(\n  listener: L,\n  event: E): Unit\n</code></pre> <p>Used when <code>ListenerBus</code> is requested to postToAll</p>","text":""},{"location":"ListenerBus/#implementations","title":"Implementations","text":"<ul> <li>ExecutionListenerBus</li> <li>ExternalCatalogWithListener</li> <li>SparkListenerBus</li> <li>StreamingListenerBus</li> <li>StreamingQueryListenerBus</li> </ul>"},{"location":"ListenerBus/#posting-event-to-all-listeners","title":"Posting Event To All Listeners <pre><code>postToAll(\n  event: E): Unit\n</code></pre> <p><code>postToAll</code>...FIXME</p> <p><code>postToAll</code>\u00a0is used when:</p> <ul> <li><code>AsyncEventQueue</code> is requested to dispatch an event</li> <li><code>ReplayListenerBus</code> is requested to replay events</li> </ul>","text":""},{"location":"ListenerBus/#registering-listener","title":"Registering Listener <pre><code>addListener(\n  listener: L): Unit\n</code></pre> <p><code>addListener</code>...FIXME</p> <p><code>addListener</code>\u00a0is used when:</p> <ul> <li><code>LiveListenerBus</code> is requested to addToQueue</li> <li><code>EventLogFileCompactor</code> is requested to <code>initializeBuilders</code></li> <li><code>FsHistoryProvider</code> is requested to doMergeApplicationListing and rebuildAppStore</li> </ul>","text":""},{"location":"OutputCommitCoordinator/","title":"OutputCommitCoordinator","text":"<p>From the scaladoc (it's a <code>private[spark]</code> class so no way to find it outside the code):</p> <p>Authority that decides whether tasks can commit output to HDFS. Uses a \"first committer wins\" policy.</p> <p>OutputCommitCoordinator is instantiated in both the drivers and executors. On executors, it is configured with a reference to the driver's OutputCommitCoordinatorEndpoint, so requests to commit output will be forwarded to the driver's OutputCommitCoordinator.</p> <p>This class was introduced in SPARK-4879; see that JIRA issue (and the associated pull requests) for an extensive design discussion.</p>"},{"location":"OutputCommitCoordinator/#creating-instance","title":"Creating Instance","text":"<p><code>OutputCommitCoordinator</code> takes the following to be created:</p> <ul> <li> SparkConf <li> <code>isDriver</code> flag <p><code>OutputCommitCoordinator</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv on the driver</li> </ul>"},{"location":"OutputCommitCoordinator/#outputcommitcoordinator-rpc-endpoint","title":"OutputCommitCoordinator RPC Endpoint <pre><code>coordinatorRef: Option[RpcEndpointRef]\n</code></pre> <p><code>OutputCommitCoordinator</code> is registered as OutputCommitCoordinator (with <code>OutputCommitCoordinatorEndpoint</code> RPC Endpoint) in the RPC Environment on the driver (when <code>SparkEnv</code> utility is used to create \"base\" SparkEnv). Executors have an RpcEndpointRef to the endpoint on the driver.</p> <p><code>coordinatorRef</code> is used to post an <code>AskPermissionToCommitOutput</code> (by executors) to the <code>OutputCommitCoordinator</code> (when canCommit).</p> <p><code>coordinatorRef</code> is used to stop the <code>OutputCommitCoordinator</code> on the driver (when stop).</p>","text":""},{"location":"OutputCommitCoordinator/#cancommit","title":"canCommit <pre><code>canCommit(\n  stage: Int,\n  stageAttempt: Int,\n  partition: Int,\n  attemptNumber: Int): Boolean\n</code></pre> <p><code>canCommit</code> creates a <code>AskPermissionToCommitOutput</code> message and sends it (asynchronously) to the OutputCommitCoordinator RPC Endpoint.</p> <p><code>canCommit</code>\u00a0is used when:</p> <ul> <li><code>SparkHadoopMapRedUtil</code> is requested to <code>commitTask</code> (with <code>spark.hadoop.outputCommitCoordination.enabled</code> configuration property enabled)</li> <li><code>DataWritingSparkTask</code> (Spark SQL) utility is used to <code>run</code></li> </ul>","text":""},{"location":"OutputCommitCoordinator/#handleaskpermissiontocommit","title":"handleAskPermissionToCommit <pre><code>handleAskPermissionToCommit(\n  stage: Int,\n  stageAttempt: Int,\n  partition: Int,\n  attemptNumber: Int): Boolean\n</code></pre> <p><code>handleAskPermissionToCommit</code>...FIXME</p> <p><code>handleAskPermissionToCommit</code>\u00a0is used when:</p> <ul> <li><code>OutputCommitCoordinatorEndpoint</code> is requested to handle a <code>AskPermissionToCommitOutput</code> message (that happens after it was sent out in canCommit)</li> </ul>","text":""},{"location":"OutputCommitCoordinator/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.OutputCommitCoordinator</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.OutputCommitCoordinator=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"SparkConf/","title":"SparkConf","text":"<p><code>SparkConf</code> is <code>Serializable</code> (Java).</p>"},{"location":"SparkConf/#creating-instance","title":"Creating Instance","text":"<p><code>SparkConf</code> takes the following to be created:</p> <ul> <li>loadDefaults flag</li> </ul>"},{"location":"SparkConf/#loaddefaults-flag","title":"loadDefaults Flag <p><code>SparkConf</code> can be given <code>loadDefaults</code> flag when created.</p> <p>Default: <code>true</code></p> <p>When <code>true</code>, <code>SparkConf</code> loads spark properties (with <code>silent</code> flag disabled) when created.</p>","text":""},{"location":"SparkConf/#getallwithprefix","title":"getAllWithPrefix <pre><code>getAllWithPrefix(\n  prefix: String): Array[(String, String)]\n</code></pre> <p><code>getAllWithPrefix</code> collects the keys with the given <code>prefix</code> in getAll.</p> <p>In the end, <code>getAllWithPrefix</code> removes the given <code>prefix</code> from the keys.</p>  <p><code>getAllWithPrefix</code> is used when:</p> <ul> <li><code>SparkConf</code> is requested to getExecutorEnv (<code>spark.executorEnv.</code> prefix), fillMissingMagicCommitterConfsIfNeeded (<code>spark.hadoop.fs.s3a.bucket.</code> prefix)</li> <li><code>ExecutorPluginContainer</code> is requested for the executorPlugins (<code>spark.plugins.internal.conf.</code> prefix)</li> <li><code>ResourceUtils</code> is requested to parseResourceRequest, listResourceIds, addTaskResourceRequests, parseResourceRequirements</li> <li><code>SortShuffleManager</code> is requested to loadShuffleExecutorComponents (<code>spark.shuffle.plugin.__config__.</code> prefix)</li> <li><code>ServerInfo</code> is requested to <code>addFilters</code></li> </ul>","text":""},{"location":"SparkConf/#loading-spark-properties","title":"Loading Spark Properties <pre><code>loadFromSystemProperties(\n  silent: Boolean): SparkConf\n</code></pre> <p><code>loadFromSystemProperties</code> records all the <code>spark.</code>-prefixed system properties in this <code>SparkConf</code>.</p>  <p>Silently loading system properties</p> <p>Loading system properties silently is possible using the following:</p> <pre><code>new SparkConf(loadDefaults = false).loadFromSystemProperties(silent = true)\n</code></pre>   <p><code>loadFromSystemProperties</code> is used when:</p> <ul> <li><code>SparkConf</code> is created (with loadDefaults enabled)</li> <li><code>SparkHadoopUtil</code> is created</li> </ul>","text":""},{"location":"SparkConf/#executor-settings","title":"Executor Settings <p><code>SparkConf</code> uses <code>spark.executorEnv.</code> prefix for executor settings.</p>","text":""},{"location":"SparkConf/#getexecutorenv","title":"getExecutorEnv <pre><code>getExecutorEnv: Seq[(String, String)]\n</code></pre> <p><code>getExecutorEnv</code> gets all the settings with spark.executorEnv. prefix.</p>  <p><code>getExecutorEnv</code> is used when:</p> <ul> <li><code>SparkContext</code> is created (and requested for executorEnvs)</li> </ul>","text":""},{"location":"SparkConf/#setexecutorenv","title":"setExecutorEnv <pre><code>setExecutorEnv(\n  variables: Array[(String, String)]): SparkConf\nsetExecutorEnv(\n  variables: Seq[(String, String)]): SparkConf\nsetExecutorEnv(\n  variable: String, value: String): SparkConf\n</code></pre> <p><code>setExecutorEnv</code> sets the given (key-value) variables with the keys with spark.executorEnv. prefix added.</p>  <p><code>setExecutorEnv</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to updatedConf</li> </ul>","text":""},{"location":"SparkConf/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.SparkConf</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.SparkConf=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"SparkContext-creating-instance-internals/","title":"Inside Creating SparkContext","text":"<p>This document describes the internals of what happens when a new <code>SparkContext</code> is created.</p> <pre><code>import org.apache.spark.{SparkConf, SparkContext}\n\n// 1. Create Spark configuration\nval conf = new SparkConf()\n  .setAppName(\"SparkMe Application\")\n  .setMaster(\"local[*]\")\n\n// 2. Create Spark context\nval sc = new SparkContext(conf)\n</code></pre>"},{"location":"SparkContext-creating-instance-internals/#creationsite","title":"creationSite <pre><code>creationSite: CallSite\n</code></pre> <p><code>SparkContext</code> determines call site.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#assertondriver","title":"assertOnDriver <p><code>SparkContext</code>...FIXME</p>","text":""},{"location":"SparkContext-creating-instance-internals/#markpartiallyconstructed","title":"markPartiallyConstructed <p><code>SparkContext</code>...FIXME</p>","text":""},{"location":"SparkContext-creating-instance-internals/#starttime","title":"startTime <pre><code>startTime: Long\n</code></pre> <p><code>SparkContext</code> records the current time (in ms).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#stopped","title":"stopped <pre><code>stopped: AtomicBoolean\n</code></pre> <p><code>SparkContext</code> initializes <code>stopped</code> flag to <code>false</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#printing-out-spark-version","title":"Printing Out Spark Version <p><code>SparkContext</code> prints out the following INFO message to the logs:</p> <pre><code>Running Spark version [SPARK_VERSION]\n</code></pre>","text":""},{"location":"SparkContext-creating-instance-internals/#sparkuser","title":"sparkUser <pre><code>sparkUser: String\n</code></pre> <p><code>SparkContext</code> determines Spark user.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#sparkconf","title":"SparkConf <pre><code>_conf: SparkConf\n</code></pre> <p><code>SparkContext</code> clones the SparkConf and requests it to validateSettings.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#enforcing-mandatory-configuration-properties","title":"Enforcing Mandatory Configuration Properties <p><code>SparkContext</code> asserts that spark.master and spark.app.name are defined (in the SparkConf).</p> <pre><code>A master URL must be set in your configuration\n</code></pre> <pre><code>An application name must be set in your configuration\n</code></pre>","text":""},{"location":"SparkContext-creating-instance-internals/#driverlogger","title":"DriverLogger <pre><code>_driverLogger: Option[DriverLogger]\n</code></pre> <p><code>SparkContext</code> creates a <code>DriverLogger</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#resourceinformation","title":"ResourceInformation <pre><code>_resources: Map[String, ResourceInformation]\n</code></pre> <p><code>SparkContext</code> uses spark.driver.resourcesFile configuration property to discovery driver resources and prints out the following INFO message to the logs:</p> <pre><code>==============================================================\nResources for [componentName]:\n[resources]\n==============================================================\n</code></pre>","text":""},{"location":"SparkContext-creating-instance-internals/#submitted-application","title":"Submitted Application <p><code>SparkContext</code> prints out the following INFO message to the logs (with the value of spark.app.name configuration property):</p> <pre><code>Submitted application: [appName]\n</code></pre>","text":""},{"location":"SparkContext-creating-instance-internals/#spark-on-yarn-and-sparkyarnappid","title":"Spark on YARN and spark.yarn.app.id <p>For Spark on YARN in cluster deploy mode], <code>SparkContext</code> checks whether <code>spark.yarn.app.id</code> configuration property is defined. <code>SparkException</code> is thrown if it does not exist.</p> <pre><code>Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.\n</code></pre>","text":""},{"location":"SparkContext-creating-instance-internals/#displaying-spark-configuration","title":"Displaying Spark Configuration <p>With spark.logConf configuration property enabled, <code>SparkContext</code> prints out the following INFO message to the logs:</p> <pre><code>Spark configuration:\n[conf.toDebugString]\n</code></pre>  <p>Note</p> <p><code>SparkConf.toDebugString</code> is used very early in the initialization process and other settings configured afterwards are not included. Use <code>SparkContext.getConf.toDebugString</code> once <code>SparkContext</code> is initialized.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#setting-configuration-properties","title":"Setting Configuration Properties <ul> <li>spark.driver.host to the current value of the property (to override the default)</li> <li>spark.driver.port to <code>0</code> unless defined already</li> <li>spark.executor.id to <code>driver</code></li> </ul>","text":""},{"location":"SparkContext-creating-instance-internals/#user-defined-jar-files","title":"User-Defined Jar Files <pre><code>_jars: Seq[String]\n</code></pre> <p><code>SparkContext</code> sets the <code>_jars</code> to spark.jars configuration property.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#user-defined-files","title":"User-Defined Files <pre><code>_files: Seq[String]\n</code></pre> <p><code>SparkContext</code> sets the <code>_files</code> to spark.files configuration property.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#sparkeventlogdir","title":"spark.eventLog.dir <pre><code>_eventLogDir: Option[URI]\n</code></pre> <p>If spark-history-server:EventLoggingListener.md[event logging] is enabled, i.e. EventLoggingListener.md#spark_eventLog_enabled[spark.eventLog.enabled] flag is <code>true</code>, the internal field <code>_eventLogDir</code> is set to the value of EventLoggingListener.md#spark_eventLog_dir[spark.eventLog.dir] setting or the default value <code>/tmp/spark-events</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#sparkeventlogcompress","title":"spark.eventLog.compress <pre><code>_eventLogCodec: Option[String]\n</code></pre> <p>Also, if spark-history-server:EventLoggingListener.md#spark_eventLog_compress[spark.eventLog.compress] is enabled (it is not by default), the short name of the <code>CompressionCodec</code> is assigned to <code>_eventLogCodec</code>. The config key is spark.io.compression.codec (default: <code>lz4</code>).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-livelistenerbus","title":"Creating LiveListenerBus <pre><code>_listenerBus: LiveListenerBus\n</code></pre> <p><code>SparkContext</code> creates a LiveListenerBus.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-appstatusstore-and-appstatussource","title":"Creating AppStatusStore (and AppStatusSource) <pre><code>_statusStore: AppStatusStore\n</code></pre> <p><code>SparkContext</code> creates an in-memory store (with an optional AppStatusSource if enabled) and requests the LiveListenerBus to register the AppStatusListener with the status queue.</p> <p>The <code>AppStatusStore</code> is available using the statusStore property of the <code>SparkContext</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-sparkenv","title":"Creating SparkEnv <pre><code>_env: SparkEnv\n</code></pre> <p><code>SparkContext</code> creates a SparkEnv and requests <code>SparkEnv</code> to use the instance as the default SparkEnv.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#sparkreplclassuri","title":"spark.repl.class.uri <p>With spark.repl.class.outputDir configuration property defined, <code>SparkContext</code> sets spark.repl.class.uri configuration property to be...FIXME</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-sparkstatustracker","title":"Creating SparkStatusTracker <pre><code>_statusTracker: SparkStatusTracker\n</code></pre> <p><code>SparkContext</code> creates a SparkStatusTracker (with itself and the AppStatusStore).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-consoleprogressbar","title":"Creating ConsoleProgressBar <pre><code>_progressBar: Option[ConsoleProgressBar]\n</code></pre> <p><code>SparkContext</code> creates a ConsoleProgressBar only when spark.ui.showConsoleProgress configuration property is enabled.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-sparkui","title":"Creating SparkUI <pre><code>_ui: Option[SparkUI]\n</code></pre> <p><code>SparkContext</code> creates a SparkUI only when spark.ui.enabled configuration property is enabled.</p> <p><code>SparkContext</code> requests the <code>SparkUI</code> to bind.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#hadoop-configuration","title":"Hadoop Configuration <pre><code>_hadoopConfiguration: Configuration\n</code></pre> <p><code>SparkContext</code> creates a new Hadoop <code>Configuration</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#adding-user-defined-jar-files","title":"Adding User-Defined Jar Files <p>If there are jars given through the <code>SparkContext</code> constructor, they are added using <code>addJar</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#adding-user-defined-files","title":"Adding User-Defined Files <p><code>SparkContext</code> adds the files in spark.files configuration property.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#_executormemory","title":"_executorMemory <pre><code>_executorMemory: Int\n</code></pre> <p><code>SparkContext</code> determines the amount of memory to allocate to each executor. It is the value of executor:Executor.md#spark.executor.memory[spark.executor.memory] setting, or SparkContext.md#environment-variables[SPARK_EXECUTOR_MEMORY] environment variable (or currently-deprecated <code>SPARK_MEM</code>), or defaults to <code>1024</code>.</p> <p><code>_executorMemory</code> is later available as <code>sc.executorMemory</code> and used for LOCAL_CLUSTER_REGEX, <code>SparkDeploySchedulerBackend</code>, to set <code>executorEnvs(\"SPARK_EXECUTOR_MEMORY\")</code>, MesosSchedulerBackend, CoarseMesosSchedulerBackend.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#spark_prepend_classes-environment-variable","title":"SPARK_PREPEND_CLASSES Environment Variable <p>The value of <code>SPARK_PREPEND_CLASSES</code> environment variable is included in <code>executorEnvs</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#for-mesos-schedulerbackend-only","title":"For Mesos SchedulerBackend Only <p>The Mesos scheduler backend's configuration is included in <code>executorEnvs</code>, i.e. SparkContext.md#environment-variables[SPARK_EXECUTOR_MEMORY], <code>_conf.getExecutorEnv</code>, and <code>SPARK_USER</code>.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#shuffledrivercomponents","title":"ShuffleDriverComponents <pre><code>_shuffleDriverComponents: ShuffleDriverComponents\n</code></pre> <p><code>SparkContext</code>...FIXME</p>","text":""},{"location":"SparkContext-creating-instance-internals/#registering-heartbeatreceiver","title":"Registering HeartbeatReceiver <p><code>SparkContext</code> registers HeartbeatReceiver RPC endpoint.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#plugincontainer","title":"PluginContainer <pre><code>_plugins: Option[PluginContainer]\n</code></pre> <p><code>SparkContext</code> creates a PluginContainer (with itself and the _resources).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#creating-schedulerbackend-and-taskscheduler","title":"Creating SchedulerBackend and TaskScheduler <p><code>SparkContext</code> object is requested to SparkContext.md#createTaskScheduler[create the SchedulerBackend with the TaskScheduler] (for the given master URL) and the result becomes the internal <code>_schedulerBackend</code> and <code>_taskScheduler</code>.</p> <p>scheduler:DAGScheduler.md#creating-instance[DAGScheduler is created] (as <code>_dagScheduler</code>).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#sending-blocking-taskschedulerisset","title":"Sending Blocking TaskSchedulerIsSet <p><code>SparkContext</code> sends a blocking <code>TaskSchedulerIsSet</code> message to HeartbeatReceiver RPC endpoint (to inform that the <code>TaskScheduler</code> is now available).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#executormetricssource","title":"ExecutorMetricsSource <p><code>SparkContext</code> creates an ExecutorMetricsSource when the spark.metrics.executorMetricsSource.enabled is enabled.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#heartbeater","title":"Heartbeater <p><code>SparkContext</code> creates a <code>Heartbeater</code> and starts it.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#starting-taskscheduler","title":"Starting TaskScheduler <p><code>SparkContext</code> requests the TaskScheduler to start.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#setting-spark-applications-and-execution-attempts-ids","title":"Setting Spark Application's and Execution Attempt's IDs <p><code>SparkContext</code> sets the internal fields -- <code>_applicationId</code> and <code>_applicationAttemptId</code> -- (using <code>applicationId</code> and <code>applicationAttemptId</code> methods from the scheduler:TaskScheduler.md#contract[TaskScheduler Contract]).</p> <p>NOTE: <code>SparkContext</code> requests <code>TaskScheduler</code> for the scheduler:TaskScheduler.md#applicationId[unique identifier of a Spark application] (that is currently only implemented by scheduler:TaskSchedulerImpl.md#applicationId[TaskSchedulerImpl] that uses <code>SchedulerBackend</code> to scheduler:SchedulerBackend.md#applicationId[request the identifier]).</p> <p>NOTE: The unique identifier of a Spark application is used to initialize spark-webui-SparkUI.md#setAppId[SparkUI] and storage:BlockManager.md#initialize[BlockManager].</p> <p>NOTE: <code>_applicationAttemptId</code> is used when <code>SparkContext</code> is requested for the SparkContext.md#applicationAttemptId[unique identifier of execution attempt of a Spark application] and when <code>EventLoggingListener</code> spark-history-server:EventLoggingListener.md#creating-instance[is created].</p>","text":""},{"location":"SparkContext-creating-instance-internals/#setting-sparkappid-spark-property-in-sparkconf","title":"Setting spark.app.id Spark Property in SparkConf <p><code>SparkContext</code> sets SparkConf.md#spark.app.id[spark.app.id] property to be the &lt;&lt;_applicationId, unique identifier of a Spark application&gt;&gt; and, if enabled, spark-webui-SparkUI.md#setAppId[passes it on to <code>SparkUI</code>].</p>","text":""},{"location":"SparkContext-creating-instance-internals/#sparkuiproxybase","title":"spark.ui.proxyBase","text":""},{"location":"SparkContext-creating-instance-internals/#initializing-sparkui","title":"Initializing SparkUI <p><code>SparkContext</code> requests the SparkUI (if defined) to setAppId with the _applicationId.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#initializing-blockmanager","title":"Initializing BlockManager <p>The storage:BlockManager.md#initialize[BlockManager (for the driver) is initialized] (with <code>_applicationId</code>).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#starting-metricssystem","title":"Starting MetricsSystem <p><code>SparkContext</code> requests the <code>MetricsSystem</code> to start (with the value of thespark.metrics.staticSources.enabled configuration property).</p>  <p>Note</p> <p><code>SparkContext</code> starts the <code>MetricsSystem</code> after &lt;&gt; as <code>MetricsSystem</code> uses it to build unique identifiers fo metrics sources.","text":""},{"location":"SparkContext-creating-instance-internals/#attaching-servlet-handlers-to-web-ui","title":"Attaching Servlet Handlers to web UI <p><code>SparkContext</code> requests the <code>MetricsSystem</code> for servlet handlers and requests the SparkUI to attach them.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#starting-eventlogginglistener-with-event-log-enabled","title":"Starting EventLoggingListener (with Event Log Enabled) <pre><code>_eventLogger: Option[EventLoggingListener]\n</code></pre> <p>With spark.eventLog.enabled configuration property enabled, <code>SparkContext</code> creates an EventLoggingListener and requests it to start.</p> <p><code>SparkContext</code> requests the LiveListenerBus to add the <code>EventLoggingListener</code> to <code>eventLog</code> event queue.</p> <p>With <code>spark.eventLog.enabled</code> disabled, <code>_eventLogger</code> is <code>None</code> (undefined).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#contextcleaner","title":"ContextCleaner <pre><code>_cleaner: Option[ContextCleaner]\n</code></pre> <p>With spark.cleaner.referenceTracking configuration property enabled, <code>SparkContext</code> creates a ContextCleaner (with itself and the _shuffleDriverComponents).</p> <p><code>SparkContext</code> requests the <code>ContextCleaner</code> to start</p>","text":""},{"location":"SparkContext-creating-instance-internals/#executorallocationmanager","title":"ExecutorAllocationManager <pre><code>_executorAllocationManager: Option[ExecutorAllocationManager]\n</code></pre> <p><code>SparkContext</code> initializes <code>_executorAllocationManager</code> internal registry.</p> <p><code>SparkContext</code> creates an ExecutorAllocationManager when:</p> <ul> <li> <p>Dynamic Allocation of Executors is enabled (based on spark.dynamicAllocation.enabled configuration property and the master URL)</p> </li> <li> <p>SchedulerBackend is an ExecutorAllocationClient</p> </li> </ul> <p>The <code>ExecutorAllocationManager</code> is requested to start.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#registering-user-defined-sparklisteners","title":"Registering User-Defined SparkListeners <p><code>SparkContext</code> registers user-defined listeners and starts <code>SparkListenerEvent</code> event delivery to the listeners.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#postenvironmentupdate","title":"postEnvironmentUpdate <p><code>postEnvironmentUpdate</code> is called that posts SparkListener.md#SparkListenerEnvironmentUpdate[SparkListenerEnvironmentUpdate] message on scheduler:LiveListenerBus.md[] with information about Task Scheduler's scheduling mode, added jar and file paths, and other environmental details.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#postapplicationstart","title":"postApplicationStart <p>SparkListener.md#SparkListenerApplicationStart[SparkListenerApplicationStart] message is posted to scheduler:LiveListenerBus.md[] (using the internal <code>postApplicationStart</code> method).</p>","text":""},{"location":"SparkContext-creating-instance-internals/#poststarthook","title":"postStartHook <p><code>TaskScheduler</code> scheduler:TaskScheduler.md#postStartHook[is notified that <code>SparkContext</code> is almost fully initialized].</p> <p>NOTE: scheduler:TaskScheduler.md#postStartHook[TaskScheduler.postStartHook] does nothing by default, but custom implementations offer more advanced features, i.e. <code>TaskSchedulerImpl</code> scheduler:TaskSchedulerImpl.md#postStartHook[blocks the current thread until <code>SchedulerBackend</code> is ready]. There is also <code>YarnClusterScheduler</code> for Spark on YARN in <code>cluster</code> deploy mode.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#registering-metrics-sources","title":"Registering Metrics Sources <p><code>SparkContext</code> requests <code>MetricsSystem</code> to register metrics sources for the following services:</p> <ul> <li>DAGScheduler</li> <li>BlockManager</li> <li>ExecutorAllocationManager</li> </ul>","text":""},{"location":"SparkContext-creating-instance-internals/#adding-shutdown-hook","title":"Adding Shutdown Hook <p><code>SparkContext</code> adds a shutdown hook (using <code>ShutdownHookManager.addShutdownHook()</code>).</p> <p><code>SparkContext</code> prints out the following DEBUG message to the logs:</p> <pre><code>Adding shutdown hook\n</code></pre> <p>CAUTION: FIXME ShutdownHookManager.addShutdownHook()</p> <p>Any non-fatal Exception leads to termination of the Spark context instance.</p> <p>CAUTION: FIXME What does <code>NonFatal</code> represent in Scala?</p> <p>CAUTION: FIXME Finish me</p>","text":""},{"location":"SparkContext-creating-instance-internals/#initializing-nextshuffleid-and-nextrddid-internal-counters","title":"Initializing nextShuffleId and nextRddId Internal Counters <p><code>nextShuffleId</code> and <code>nextRddId</code> start with <code>0</code>.</p> <p>CAUTION: FIXME Where are <code>nextShuffleId</code> and <code>nextRddId</code> used?</p> <p>A new instance of Spark context is created and ready for operation.</p>","text":""},{"location":"SparkContext-creating-instance-internals/#loading-external-cluster-manager-for-url-getclustermanager-method","title":"Loading External Cluster Manager for URL (getClusterManager method) <pre><code>getClusterManager(\n  url: String): Option[ExternalClusterManager]\n</code></pre> <p><code>getClusterManager</code> loads scheduler:ExternalClusterManager.md[] that scheduler:ExternalClusterManager.md#canCreate[can handle the input <code>url</code>].</p> <p>If there are two or more external cluster managers that could handle <code>url</code>, a <code>SparkException</code> is thrown:</p> <pre><code>Multiple Cluster Managers ([serviceLoaders]) registered for the url [url].\n</code></pre> <p>NOTE: <code>getClusterManager</code> uses Java's ++https://docs.oracle.com/javase/8/docs/api/java/util/ServiceLoader.html#load-java.lang.Class-java.lang.ClassLoader-++[ServiceLoader.load] method.</p> <p>NOTE: <code>getClusterManager</code> is used to find a cluster manager for a master URL when SparkContext.md#createTaskScheduler[creating a <code>SchedulerBackend</code> and a <code>TaskScheduler</code> for the driver].</p>","text":""},{"location":"SparkContext-creating-instance-internals/#setupandstartlistenerbus","title":"setupAndStartListenerBus <pre><code>setupAndStartListenerBus(): Unit\n</code></pre> <p><code>setupAndStartListenerBus</code> is an internal method that reads configuration-properties.md#spark.extraListeners[spark.extraListeners] configuration property from the current SparkConf.md[SparkConf] to create and register SparkListenerInterface listeners.</p> <p>It expects that the class name represents a <code>SparkListenerInterface</code> listener with one of the following constructors (in this order):</p> <ul> <li>a single-argument constructor that accepts SparkConf.md[SparkConf]</li> <li>a zero-argument constructor</li> </ul> <p><code>setupAndStartListenerBus</code> scheduler:LiveListenerBus.md#ListenerBus-addListener[registers every listener class].</p> <p>You should see the following INFO message in the logs:</p> <pre><code>INFO Registered listener [className]\n</code></pre> <p>It scheduler:LiveListenerBus.md#start[starts LiveListenerBus] and records it in the internal <code>_listenerBusStarted</code>.</p> <p>When no single-<code>SparkConf</code> or zero-argument constructor could be found for a class name in configuration-properties.md#spark.extraListeners[spark.extraListeners] configuration property, a <code>SparkException</code> is thrown with the message:</p> <pre><code>[className] did not have a zero-argument constructor or a single-argument constructor that accepts SparkConf. Note: if the class is defined inside of another Scala class, then its constructors may accept an implicit parameter that references the enclosing class; in this case, you must define the listener as a top-level class in order to prevent this extra parameter from breaking Spark's ability to find a valid constructor.\n</code></pre> <p>Any exception while registering a SparkListenerInterface listener stops the SparkContext and a <code>SparkException</code> is thrown and the source exception's message.</p> <pre><code>Exception when registering SparkListener\n</code></pre>  <p>Tip</p> <p>Set <code>INFO</code> logging level for <code>org.apache.spark.SparkContext</code> logger to see the extra listeners being registered.</p> <pre><code>Registered listener pl.japila.spark.CustomSparkListener\n</code></pre>","text":""},{"location":"SparkContext/","title":"SparkContext","text":"<p><code>SparkContext</code> is the entry point to all of the components of Apache Spark (execution engine) and so the heart of a Spark application. In fact, you can consider an application a Spark application only when it uses a SparkContext (directly or indirectly).</p> <p></p> <p>Important</p> <p>There should be one active <code>SparkContext</code> per JVM and Spark developers should use SparkContext.getOrCreate utility for sharing it (e.g. across threads).</p>"},{"location":"SparkContext/#creating-instance","title":"Creating Instance","text":"<p><code>SparkContext</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>SparkContext</code> is created (directly or indirectly using getOrCreate utility).</p> <p>While being created, <code>SparkContext</code> sets up core services and establishes a connection to a cluster manager.</p>"},{"location":"SparkContext/#checkpoint-directory","title":"Checkpoint Directory <p><code>SparkContext</code> defines <code>checkpointDir</code> internal registry for the path to a checkpoint directory.</p> <p><code>checkpointDir</code> is undefined (<code>None</code>) when <code>SparkContext</code> is created and is set using setCheckpointDir.</p> <p><code>checkpointDir</code> is required for Reliable Checkpointing.</p> <p><code>checkpointDir</code> is available using getCheckpointDir.</p>","text":""},{"location":"SparkContext/#getcheckpointdir","title":"getCheckpointDir <pre><code>getCheckpointDir: Option[String]\n</code></pre> <p><code>getCheckpointDir</code> returns the checkpointDir.</p> <p><code>getCheckpointDir</code> is used when:</p> <ul> <li><code>ReliableRDDCheckpointData</code> is requested for the checkpoint path</li> </ul>","text":""},{"location":"SparkContext/#submitting-mapstage-for-execution","title":"Submitting MapStage for Execution <pre><code>submitMapStage[K, V, C](\n  dependency: ShuffleDependency[K, V, C]): SimpleFutureAction[MapOutputStatistics]\n</code></pre> <p><code>submitMapStage</code> requests the DAGScheduler to submit the given ShuffleDependency for execution (that eventually produces a MapOutputStatistics).</p> <p><code>submitMapStage</code> is used when:</p> <ul> <li><code>ShuffleExchangeExec</code> (Spark SQL) unary physical operator is executed</li> </ul>","text":""},{"location":"SparkContext/#executormetricssource","title":"ExecutorMetricsSource <p><code>SparkContext</code> creates an ExecutorMetricsSource when created with spark.metrics.executorMetricsSource.enabled enabled.</p> <p><code>SparkContext</code> requests the <code>ExecutorMetricsSource</code> to register with the MetricsSystem.</p> <p><code>SparkContext</code> uses the <code>ExecutorMetricsSource</code> to create the Heartbeater.</p>","text":""},{"location":"SparkContext/#services","title":"Services <ul> <li> <p> ExecutorAllocationManager (optional)  <li> <p> SchedulerBackend","text":""},{"location":"SparkContext/#resourceprofilemanager","title":"ResourceProfileManager <p><code>SparkContext</code> creates a ResourceProfileManager when created.</p>","text":""},{"location":"SparkContext/#resourceprofilemanager_1","title":"resourceProfileManager <pre><code>resourceProfileManager: ResourceProfileManager\n</code></pre> <p><code>resourceProfileManager</code> returns the ResourceProfileManager.</p> <p><code>resourceProfileManager</code> is used when:</p> <ul> <li><code>KubernetesClusterSchedulerBackend</code> (Spark on Kubernetes) is created</li> <li>others</li> </ul>","text":""},{"location":"SparkContext/#driverlogger","title":"DriverLogger <p><code>SparkContext</code> can create a DriverLogger when created.</p> <p><code>SparkContext</code> requests the <code>DriverLogger</code> to startSync in postApplicationStart.</p>","text":""},{"location":"SparkContext/#appstatussource","title":"AppStatusSource <p><code>SparkContext</code> can create an AppStatusSource when created (based on the spark.metrics.appStatusSource.enabled configuration property).</p> <p><code>SparkContext</code> uses the <code>AppStatusSource</code> to create the AppStatusStore.</p> <p>If configured, <code>SparkContext</code> registers the <code>AppStatusSource</code> with the MetricsSystem.</p>","text":""},{"location":"SparkContext/#appstatusstore","title":"AppStatusStore <p><code>SparkContext</code> creates an AppStatusStore when created (with itself and the AppStatusSource).</p> <p><code>SparkContext</code> requests <code>AppStatusStore</code> for the AppStatusListener and requests the LiveListenerBus to add it to the application status queue.</p> <p><code>SparkContext</code> uses the <code>AppStatusStore</code> to create the following:</p> <ul> <li>SparkStatusTracker</li> <li>SparkUI</li> </ul> <p><code>AppStatusStore</code> is requested to status/AppStatusStore.md#close in stop.</p>","text":""},{"location":"SparkContext/#statusstore","title":"statusStore <pre><code>statusStore: AppStatusStore\n</code></pre> <p><code>statusStore</code> returns the AppStatusStore.</p> <p><code>statusStore</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to getRDDStorageInfo</li> <li><code>ConsoleProgressBar</code> is requested to refresh</li> <li><code>HiveThriftServer2</code> is requested to <code>createListenerAndUI</code></li> <li><code>SharedState</code> (Spark SQL) is requested for a <code>SQLAppStatusStore</code> and a <code>StreamingQueryStatusListener</code></li> </ul>","text":""},{"location":"SparkContext/#sparkstatustracker","title":"SparkStatusTracker <p><code>SparkContext</code> creates a SparkStatusTracker when created (with itself and the AppStatusStore).</p>","text":""},{"location":"SparkContext/#statustracker","title":"statusTracker <pre><code>statusTracker: SparkStatusTracker\n</code></pre> <p><code>statusTracker</code> returns the SparkStatusTracker.</p>","text":""},{"location":"SparkContext/#local-properties","title":"Local Properties <pre><code>localProperties: InheritableThreadLocal[Properties]\n</code></pre> <p><code>SparkContext</code> uses an <code>InheritableThreadLocal</code> (Java) of key-value pairs of thread-local properties to pass extra information from a parent thread (on the driver) to child threads.</p> <p><code>localProperties</code> is meant to be used by developers using SparkContext.setLocalProperty and SparkContext.getLocalProperty.</p> <p>Local Properties are available using TaskContext.getLocalProperty.</p> <p>Local Properties are available to SparkListeners using the following events:</p> <ul> <li>SparkListenerJobStart</li> <li>SparkListenerStageSubmitted</li> </ul> <p><code>localProperties</code> are passed down when <code>SparkContext</code> is requested for the following:</p> <ul> <li>Running Job (that in turn makes the local properties available to the DAGScheduler to run a job)</li> <li>Running Approximate Job</li> <li>Submitting Job</li> <li>Submitting MapStage</li> </ul> <p><code>DAGScheduler</code> passes down local properties when scheduling:</p> <ul> <li>ShuffleMapTasks</li> <li>ResultTasks</li> <li>TaskSets</li> </ul> <p>Spark (Core) defines the following local properties.</p>    Name Default Value Setter      <code>callSite.long</code>      <code>callSite.short</code>  <code>SparkContext.setCallSite</code>    <code>spark.job.description</code> callSite.short <code>SparkContext.setJobDescription</code>  (<code>SparkContext.setJobGroup</code>)    <code>spark.job.interruptOnCancel</code>  <code>SparkContext.setJobGroup</code>    <code>spark.jobGroup.id</code>  <code>SparkContext.setJobGroup</code>    <code>spark.scheduler.pool</code>","text":""},{"location":"SparkContext/#shuffledrivercomponents","title":"ShuffleDriverComponents <p><code>SparkContext</code> creates a ShuffleDriverComponents when created.</p> <p><code>SparkContext</code> loads the ShuffleDataIO that is in turn requested for the ShuffleDriverComponents. <code>SparkContext</code> requests the <code>ShuffleDriverComponents</code> to initialize.</p> <p>The <code>ShuffleDriverComponents</code> is used when:</p> <ul> <li><code>ShuffleDependency</code> is created</li> <li><code>SparkContext</code> creates the ContextCleaner (if enabled)</li> </ul> <p><code>SparkContext</code> requests the <code>ShuffleDriverComponents</code> to clean up when stopping.</p>","text":""},{"location":"SparkContext/#static-files","title":"Static Files","text":""},{"location":"SparkContext/#addfile","title":"addFile <pre><code>addFile(\n  path: String,\n  recursive: Boolean): Unit\n// recursive = false\naddFile(\n  path: String): Unit\n</code></pre> <p>Firstly, <code>addFile</code> validate the schema of given <code>path</code>. For a no-schema path, <code>addFile</code> converts it to a canonical form. For a local schema path, <code>addFile</code> prints out the following WARN message to the logs and exits.</p> <p><pre><code>File with 'local' scheme is not supported to add to file server, since it is already available on every node.\n</code></pre> And for other schema path, <code>addFile</code> creates a Hadoop Path from the given path.</p> <p><code>addFile</code> Will validate the URL if the path is an HTTP, HTTPS or FTP URI.</p> <p><code>addFile</code> Will throw <code>SparkException</code> with below message if path is local directories but not in local mode.</p> <pre><code>addFile does not support local directories when not running local mode.\n</code></pre> <p><code>addFile</code> Will throw <code>SparkException</code> with below message if path is directories but not turn on <code>recursive</code> flag.</p> <pre><code>Added file $hadoopPath is a directory and recursive is not turned on.\n</code></pre> <p>In the end, <code>addFile</code> adds the file to the addedFiles internal registry (with the current timestamp):</p> <ul> <li> <p>For new files, <code>addFile</code> prints out the following INFO message to the logs, fetches the file (to the root directory and without using the cache) and postEnvironmentUpdate.</p> <pre><code>Added file [path] at [key] with timestamp [timestamp]\n</code></pre> </li> <li> <p>For files that were already added, <code>addFile</code> prints out the following WARN message to the logs:</p> <pre><code>The path [path] has been added already. Overwriting of added paths is not supported in the current version.\n</code></pre> </li> </ul> <p><code>addFile</code> is used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"SparkContext/#listfiles","title":"listFiles <pre><code>listFiles(): Seq[String]\n</code></pre> <p><code>listFiles</code> is the files added.</p>","text":""},{"location":"SparkContext/#addedfiles-internal-registry","title":"addedFiles Internal Registry <pre><code>addedFiles: Map[String, Long]\n</code></pre> <p><code>addedFiles</code> is a collection of static files by the timestamp the were added at.</p> <p><code>addedFiles</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to postEnvironmentUpdate and listFiles</li> <li><code>TaskSetManager</code> is created (and resourceOffer)</li> </ul>","text":""},{"location":"SparkContext/#files","title":"files <pre><code>files: Seq[String]\n</code></pre> <p><code>files</code> is a collection of file paths defined by spark.files configuration property.</p>","text":""},{"location":"SparkContext/#posting-sparklistenerenvironmentupdate-event","title":"Posting SparkListenerEnvironmentUpdate Event <pre><code>postEnvironmentUpdate(): Unit\n</code></pre> <p><code>postEnvironmentUpdate</code>...FIXME</p> <p><code>postEnvironmentUpdate</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to addFile and addJar</li> </ul>","text":""},{"location":"SparkContext/#getorcreate-utility","title":"getOrCreate Utility <pre><code>getOrCreate(): SparkContext\ngetOrCreate(\n  config: SparkConf): SparkContext\n</code></pre> <p><code>getOrCreate</code>...FIXME</p>","text":""},{"location":"SparkContext/#plugincontainer","title":"PluginContainer <p><code>SparkContext</code> creates a PluginContainer when created.</p> <p><code>PluginContainer</code> is created (for the driver where <code>SparkContext</code> lives) using PluginContainer.apply utility.</p> <p><code>PluginContainer</code> is then requested to registerMetrics with the applicationId.</p> <p><code>PluginContainer</code> is requested to shutdown when <code>SparkContext</code> is requested to stop.</p>","text":""},{"location":"SparkContext/#creating-schedulerbackend-and-taskscheduler","title":"Creating SchedulerBackend and TaskScheduler <pre><code>createTaskScheduler(\n  sc: SparkContext,\n  master: String,\n  deployMode: String): (SchedulerBackend, TaskScheduler)\n</code></pre> <p><code>createTaskScheduler</code> creates a SchedulerBackend and a TaskScheduler for the given master URL and deployment mode.</p> <p></p> <p>Internally, <code>createTaskScheduler</code> branches off per the given master URL to select the requested implementations.</p> <p><code>createTaskScheduler</code> accepts the following master URLs:</p> <ul> <li><code>local</code> - local mode with 1 thread only</li> <li><code>local[n]</code> or <code>local[*]</code> - local mode with <code>n</code> threads</li> <li><code>local[n, m]</code> or <code>local[*, m]</code> -- local mode with <code>n</code> threads and <code>m</code> number of failures</li> <li><code>spark://hostname:port</code> for Spark Standalone</li> <li><code>local-cluster[n, m, z]</code> -- local cluster with <code>n</code> workers, <code>m</code> cores per worker, and <code>z</code> memory per worker</li> <li>Other URLs are simply handed over to getClusterManager to load an external cluster manager if available</li> </ul> <p><code>createTaskScheduler</code> is used when <code>SparkContext</code> is created.</p>","text":""},{"location":"SparkContext/#loading-externalclustermanager","title":"Loading ExternalClusterManager <pre><code>getClusterManager(\n  url: String): Option[ExternalClusterManager]\n</code></pre> <p><code>getClusterManager</code> uses Java's ServiceLoader to find and load an ExternalClusterManager that supports the given master URL.</p>  <p>ExternalClusterManager Service Discovery</p> <p>For ServiceLoader to find ExternalClusterManagers, they have to be registered using the following file:</p> <pre><code>META-INF/services/org.apache.spark.scheduler.ExternalClusterManager\n</code></pre>  <p><code>getClusterManager</code> throws a <code>SparkException</code> when multiple cluster managers were found:</p> <pre><code>Multiple external cluster managers registered for the url [url]: [serviceLoaders]\n</code></pre> <p><code>getClusterManager</code>\u00a0is used when <code>SparkContext</code> is requested for a SchedulerBackend and TaskScheduler.</p>","text":""},{"location":"SparkContext/#running-job-synchronously","title":"Running Job Synchronously <pre><code>runJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) =&gt; U): Array[U]\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  processPartition: (TaskContext, Iterator[T]) =&gt; U,\n  resultHandler: (Int, U) =&gt; Unit): Unit\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) =&gt; U,\n  partitions: Seq[Int]): Array[U]\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) =&gt; U,\n  partitions: Seq[Int],\n  resultHandler: (Int, U) =&gt; Unit): Unit\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: Iterator[T] =&gt; U): Array[U]\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  processPartition: Iterator[T] =&gt; U,\n  resultHandler: (Int, U) =&gt; Unit): Unit\nrunJob[T, U: ClassTag](\n  rdd: RDD[T],\n  func: Iterator[T] =&gt; U,\n  partitions: Seq[Int]): Array[U]\n</code></pre>  <p><code>runJob</code> finds the call site and cleans up the given <code>func</code> function.</p> <p><code>runJob</code> prints out the following INFO message to the logs:</p> <pre><code>Starting job: [callSite]\n</code></pre> <p>With spark.logLineage enabled, <code>runJob</code> requests the given <code>RDD</code> for the recursive dependencies and prints out the following INFO message to the logs:</p> <pre><code>RDD's recursive dependencies:\n[toDebugString]\n</code></pre> <p><code>runJob</code> requests the DAGScheduler to run a job.</p> <p><code>runJob</code> requests the ConsoleProgressBar to finishAll if defined.</p> <p>In the end, <code>runJob</code> requests the given <code>RDD</code> to doCheckpoint.</p> <p><code>runJob</code> throws an <code>IllegalStateException</code> when <code>SparkContext</code> is stopped:</p> <pre><code>SparkContext has been shutdown\n</code></pre>","text":""},{"location":"SparkContext/#demo","title":"Demo <p><code>runJob</code> is essentially executing a <code>func</code> function on all or a subset of partitions of an RDD and returning the result as an array (with elements being the results per partition).</p> <pre><code>sc.setLocalProperty(\"callSite.short\", \"runJob Demo\")\n\nval partitionsNumber = 4\nval rdd = sc.parallelize(\n  Seq(\"hello world\", \"nice to see you\"),\n  numSlices = partitionsNumber)\n\nimport org.apache.spark.TaskContext\nval func = (t: TaskContext, ss: Iterator[String]) =&gt; 1\nval result = sc.runJob(rdd, func)\nassert(result.length == partitionsNumber)\n\nsc.clearCallSite()\n</code></pre>","text":""},{"location":"SparkContext/#call-site","title":"Call Site <pre><code>getCallSite(): CallSite\n</code></pre> <p><code>getCallSite</code>...FIXME</p> <p><code>getCallSite</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is requested to broadcast, runJob, runApproximateJob, submitJob and submitMapStage</li> <li><code>AsyncRDDActions</code> is requested to takeAsync</li> <li><code>RDD</code> is created</li> </ul>","text":""},{"location":"SparkContext/#closure-cleaning","title":"Closure Cleaning <pre><code>clean(\n  f: F,\n  checkSerializable: Boolean = true): F\n</code></pre> <p><code>clean</code> cleans up the given <code>f</code> closure (using <code>ClosureCleaner.clean</code> utility).</p>  <p>Tip</p> <p>Enable <code>DEBUG</code> logging level for <code>org.apache.spark.util.ClosureCleaner</code> logger to see what happens inside the class.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.util.ClosureCleaner=DEBUG\n</code></pre> <p>Refer to Logging.</p>  <p>With <code>DEBUG</code> logging level you should see the following messages in the logs:</p> <pre><code>+++ Cleaning closure [func] ([func.getClass.getName]) +++\n + declared fields: [declaredFields.size]\n     [field]\n ...\n+++ closure [func] ([func.getClass.getName]) is now cleaned +++\n</code></pre>","text":""},{"location":"SparkContext/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks <pre><code>maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n</code></pre> <p><code>maxNumConcurrentTasks</code> requests the SchedulerBackend for the maximum number of tasks that can be launched concurrently (with the given ResourceProfile).</p>  <p><code>maxNumConcurrentTasks</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to checkBarrierStageWithNumSlots</li> </ul>","text":""},{"location":"SparkContext/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.SparkContext</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.SparkContext.name = org.apache.spark.SparkContext\nlogger.SparkContext.level = all\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"SparkCoreErrors/","title":"SparkCoreErrors","text":""},{"location":"SparkCoreErrors/#numPartitionsGreaterThanMaxNumConcurrentTasksError","title":"numPartitionsGreaterThanMaxNumConcurrentTasksError","text":"<pre><code>numPartitionsGreaterThanMaxNumConcurrentTasksError(\n  numPartitions: Int,\n  maxNumConcurrentTasks: Int): Throwable\n</code></pre> <p><code>numPartitionsGreaterThanMaxNumConcurrentTasksError</code> creates a BarrierJobSlotsNumberCheckFailed with the given input arguments.</p> <p><code>numPartitionsGreaterThanMaxNumConcurrentTasksError</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to checkBarrierStageWithNumSlots</li> </ul>"},{"location":"SparkEnv/","title":"SparkEnv","text":"<p><code>SparkEnv</code> is a handle to Spark Execution Environment with the core services of Apache Spark (that interact with each other to establish a distributed computing platform for a Spark application).</p> <p>There are two separate <code>SparkEnv</code>s of the driver and executors.</p>","tags":["DeveloperApi"]},{"location":"SparkEnv/#core-services","title":"Core Services    Property Service      blockManager BlockManager    broadcastManager BroadcastManager    closureSerializer Serializer    conf SparkConf    mapOutputTracker MapOutputTracker    memoryManager MemoryManager    metricsSystem MetricsSystem    outputCommitCoordinator OutputCommitCoordinator    rpcEnv RpcEnv    securityManager SecurityManager    serializer Serializer    serializerManager SerializerManager    shuffleManager ShuffleManager","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-instance","title":"Creating Instance <p><code>SparkEnv</code> takes the following to be created:</p> <ul> <li> Executor ID <li>RpcEnv</li> <li>Serializer</li> <li>Serializer</li> <li>SerializerManager</li> <li>MapOutputTracker</li> <li>ShuffleManager</li> <li>BroadcastManager</li> <li>BlockManager</li> <li>SecurityManager</li> <li>MetricsSystem</li> <li>MemoryManager</li> <li>OutputCommitCoordinator</li> <li>SparkConf</li>  <p><code>SparkEnv</code> is created using create utility.</p>","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#drivers-temporary-directory","title":"Driver's Temporary Directory <pre><code>driverTmpDir: Option[String]\n</code></pre> <p><code>SparkEnv</code> defines <code>driverTmpDir</code> internal registry for the driver to be used as the root directory of files added using SparkContext.addFile.</p> <p><code>driverTmpDir</code> is undefined initially and is defined for the driver only when <code>SparkEnv</code> utility is used to create a \"base\" SparkEnv.</p>","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#demo","title":"Demo <pre><code>import org.apache.spark.SparkEnv\n</code></pre> <pre><code>// :pa -raw\n// BEGIN\npackage org.apache.spark\nobject BypassPrivateSpark {\n  def driverTmpDir(sparkEnv: SparkEnv) = {\n    sparkEnv.driverTmpDir\n  }\n}\n// END\n</code></pre> <pre><code>val driverTmpDir = org.apache.spark.BypassPrivateSpark.driverTmpDir(SparkEnv.get).get\n</code></pre> <p>The above is equivalent to the following snippet.</p> <pre><code>import org.apache.spark.SparkFiles\nSparkFiles.getRootDirectory\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-sparkenv-for-driver","title":"Creating SparkEnv for Driver <pre><code>createDriverEnv(\n  conf: SparkConf,\n  isLocal: Boolean,\n  listenerBus: LiveListenerBus,\n  numCores: Int,\n  mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv\n</code></pre> <p><code>createDriverEnv</code> creates a SparkEnv execution environment for the driver.</p> <p></p> <p><code>createDriverEnv</code> accepts an instance of SparkConf, whether it runs in local mode or not, scheduler:LiveListenerBus.md[], the number of cores to use for execution in local mode or <code>0</code> otherwise, and a OutputCommitCoordinator (default: none).</p> <p><code>createDriverEnv</code> ensures that spark-driver.md#spark_driver_host[spark.driver.host] and spark-driver.md#spark_driver_port[spark.driver.port] settings are defined.</p> <p>It then passes the call straight on to the &lt;&gt; (with <code>driver</code> executor id, <code>isDriver</code> enabled, and the input parameters). <p><code>createDriverEnv</code> is used when <code>SparkContext</code> is created.</p>","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-sparkenv-for-executor","title":"Creating SparkEnv for Executor <pre><code>createExecutorEnv(\n  conf: SparkConf,\n  executorId: String,\n  hostname: String,\n  numCores: Int,\n  ioEncryptionKey: Option[Array[Byte]],\n  isLocal: Boolean): SparkEnv\ncreateExecutorEnv(\n  conf: SparkConf,\n  executorId: String,\n  bindAddress: String,\n  hostname: String,\n  numCores: Int,\n  ioEncryptionKey: Option[Array[Byte]],\n  isLocal: Boolean): SparkEnv\n</code></pre> <p><code>createExecutorEnv</code> creates an executor's (execution) environment that is the Spark execution environment for an executor.</p> <p></p> <p><code>createExecutorEnv</code> simply &lt;&gt; (passing in all the input parameters) and &lt;&gt;. <p>NOTE: The number of cores <code>numCores</code> is configured using <code>--cores</code> command-line option of <code>CoarseGrainedExecutorBackend</code> and is specific to a cluster manager.</p> <p><code>createExecutorEnv</code> is used when <code>CoarseGrainedExecutorBackend</code> utility is requested to <code>run</code>.</p>","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#creating-base-sparkenv","title":"Creating \"Base\" SparkEnv <pre><code>create(\n  conf: SparkConf,\n  executorId: String,\n  bindAddress: String,\n  advertiseAddress: String,\n  port: Option[Int],\n  isLocal: Boolean,\n  numUsableCores: Int,\n  ioEncryptionKey: Option[Array[Byte]],\n  listenerBus: LiveListenerBus = null,\n  mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv\n</code></pre> <p><code>create</code> creates the \"base\" <code>SparkEnv</code> (that is common across the driver and executors).</p> <p><code>create</code> creates a RpcEnv as sparkDriver on the driver and sparkExecutor on executors.</p> <p><code>create</code> creates a Serializer (based on spark.serializer configuration property). <code>create</code> prints out the following DEBUG message to the logs:</p> <pre><code>Using serializer: [serializer]\n</code></pre> <p><code>create</code> creates a SerializerManager.</p> <p><code>create</code> creates a <code>JavaSerializer</code> as the closure serializer.</p> <p><code>creates</code> creates a BroadcastManager.</p> <p><code>creates</code> creates a MapOutputTrackerMaster (on the driver) or a MapOutputTrackerWorker (on executors). <code>creates</code> registers or looks up a MapOutputTrackerMasterEndpoint under the name of MapOutputTracker. <code>creates</code> prints out the following INFO message to the logs (on the driver only):</p> <pre><code>Registering MapOutputTracker\n</code></pre> <p><code>creates</code> creates a ShuffleManager (based on spark.shuffle.manager configuration property).</p> <p><code>create</code> creates a UnifiedMemoryManager.</p> <p>With spark.shuffle.service.enabled configuration property enabled, <code>create</code> creates an ExternalBlockStoreClient.</p> <p><code>create</code> creates a BlockManagerMaster.</p> <p><code>create</code> creates a NettyBlockTransferService.</p> <p></p> <p></p> <p><code>create</code> creates a BlockManager.</p> <p><code>create</code> creates a MetricsSystem.</p> <p><code>create</code> creates a OutputCommitCoordinator and registers or looks up a <code>OutputCommitCoordinatorEndpoint</code> under the name of OutputCommitCoordinator.</p> <p><code>create</code> creates a SparkEnv (with all the services \"stitched\" together).</p>","text":"","tags":["DeveloperApi"]},{"location":"SparkEnv/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.SparkEnv</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.SparkEnv=ALL\n</code></pre> <p>Refer to Logging.</p>","text":"","tags":["DeveloperApi"]},{"location":"SparkFiles/","title":"SparkFiles","text":"<p><code>SparkFiles</code> is an utility to work with files added using SparkContext.addFile.</p>"},{"location":"SparkFiles/#absolute-path-of-added-file","title":"Absolute Path of Added File <pre><code>get(\n  filename: String): String\n</code></pre> <p><code>get</code> gets the absolute path of the given file in the root directory.</p>","text":""},{"location":"SparkFiles/#root-directory","title":"Root Directory <pre><code>getRootDirectory(): String\n</code></pre> <p><code>getRootDirectory</code> requests the current <code>SparkEnv</code> for driverTmpDir (if defined) or defaults to the current directory (<code>.</code>).</p> <p><code>getRootDirectory</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is requested to addFile</li> <li><code>Executor</code> is requested to updateDependencies</li> <li><code>SparkFiles</code> utility is requested to get the absolute path of a file</li> </ul>","text":""},{"location":"SparkHadoopWriter/","title":"SparkHadoopWriter Utility","text":""},{"location":"SparkHadoopWriter/#writing-key-value-rdd-out-as-hadoop-outputformat","title":"Writing Key-Value RDD Out (As Hadoop OutputFormat) <pre><code>write[K, V: ClassTag](\n  rdd: RDD[(K, V)],\n  config: HadoopWriteConfigUtil[K, V]): Unit\n</code></pre> <p><code>write</code> runs a Spark job to write out partition records (for all partitions of the given key-value <code>RDD</code>) with the given HadoopWriteConfigUtil and a HadoopMapReduceCommitProtocol committer.</p> <p>The number of writer tasks (parallelism) is the number of the partitions in the given key-value <code>RDD</code>.</p>","text":""},{"location":"SparkHadoopWriter/#internals","title":"Internals <p> Internally, <code>write</code> uses the id of the given RDD as the <code>commitJobId</code>. <p> <code>write</code> creates a <code>jobTrackerId</code> with the current date. <p> <code>write</code> requests the given <code>HadoopWriteConfigUtil</code> to create a Hadoop JobContext (for the jobTrackerId and commitJobId). <p><code>write</code> requests the given <code>HadoopWriteConfigUtil</code> to initOutputFormat with the Hadoop JobContext.</p> <p><code>write</code> requests the given <code>HadoopWriteConfigUtil</code> to assertConf.</p> <p><code>write</code> requests the given <code>HadoopWriteConfigUtil</code> to create a HadoopMapReduceCommitProtocol committer for the commitJobId.</p> <p><code>write</code> requests the <code>HadoopMapReduceCommitProtocol</code> to setupJob (with the jobContext).</p> <p> <code>write</code> uses the <code>SparkContext</code> (of the given RDD) to run a Spark job asynchronously for the given RDD with the executeTask partition function. <p> In the end, <code>write</code> requests the HadoopMapReduceCommitProtocol to commit the job and prints out the following INFO message to the logs: <pre><code>Job [getJobID] committed.\n</code></pre>","text":""},{"location":"SparkHadoopWriter/#throwables","title":"Throwables <p>In case of any <code>Throwable</code>, <code>write</code> prints out the following ERROR message to the logs:</p> <pre><code>Aborting job [getJobID].\n</code></pre> <p> <code>write</code> requests the HadoopMapReduceCommitProtocol to abort the job and throws a <code>SparkException</code>: <pre><code>Job aborted.\n</code></pre>","text":""},{"location":"SparkHadoopWriter/#usage","title":"Usage <p><code>write</code>\u00a0is used when:</p> <ul> <li>PairRDDFunctions.saveAsNewAPIHadoopDataset</li> <li>PairRDDFunctions.saveAsHadoopDataset</li> </ul>","text":""},{"location":"SparkHadoopWriter/#writing-rdd-partition","title":"Writing RDD Partition <pre><code>executeTask[K, V: ClassTag](\n  context: TaskContext,\n  config: HadoopWriteConfigUtil[K, V],\n  jobTrackerId: String,\n  commitJobId: Int,\n  sparkPartitionId: Int,\n  sparkAttemptNumber: Int,\n  committer: FileCommitProtocol,\n  iterator: Iterator[(K, V)]): TaskCommitMessage\n</code></pre>  <p>Fixme</p> <p>Review Me</p>  <p><code>executeTask</code> requests the given <code>HadoopWriteConfigUtil</code> to create a TaskAttemptContext.</p> <p><code>executeTask</code> requests the given <code>FileCommitProtocol</code> to set up a task with the <code>TaskAttemptContext</code>.</p> <p><code>executeTask</code> requests the given <code>HadoopWriteConfigUtil</code> to initWriter (with the <code>TaskAttemptContext</code> and the given <code>sparkPartitionId</code>).</p> <p><code>executeTask</code> initHadoopOutputMetrics.</p> <p><code>executeTask</code> writes all rows of the RDD partition (from the given <code>Iterator[(K, V)]</code>). <code>executeTask</code> requests the given <code>HadoopWriteConfigUtil</code> to write. In the end, <code>executeTask</code> requests the given <code>HadoopWriteConfigUtil</code> to closeWriter and the given <code>FileCommitProtocol</code> to commit the task.</p> <p><code>executeTask</code> updates metrics about writing data to external systems (bytesWritten and recordsWritten) every few records and at the end.</p> <p>In case of any errors, <code>executeTask</code> requests the given <code>HadoopWriteConfigUtil</code> to closeWriter and the given <code>FileCommitProtocol</code> to abort the task. In the end, <code>executeTask</code> prints out the following ERROR message to the logs:</p> <pre><code>Task [taskAttemptID] aborted.\n</code></pre> <p><code>executeTask</code> is used when:</p> <ul> <li><code>SparkHadoopWriter</code> utility is used to write</li> </ul>","text":""},{"location":"SparkHadoopWriter/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.internal.io.SparkHadoopWriter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.internal.io.SparkHadoopWriter=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"SparkListener/","title":"SparkListener","text":"<p><code>SparkListener</code>\u00a0is an extension of the SparkListenerInterface abstraction for event listeners with a no-op implementation for callback methods.</p>","tags":["DeveloperApi"]},{"location":"SparkListener/#implementations","title":"Implementations","text":"<ul> <li>BarrierCoordinator</li> <li>SparkSession (Spark SQL)</li> <li>AppListingListener (Spark History Server)</li> <li>AppStatusListener</li> <li>BasicEventFilterBuilder (Spark History Server)</li> <li>EventLoggingListener (Spark History Server)</li> <li>ExecutionListenerBus</li> <li>ExecutorAllocationListener</li> <li>ExecutorMonitor</li> <li>HeartbeatReceiver</li> <li>HiveThriftServer2Listener (Spark Thrift Server)</li> <li>SpillListener</li> <li>SQLAppStatusListener (Spark SQL)</li> <li>SQLEventFilterBuilder</li> <li>StatsReportListener</li> <li>StreamingQueryListenerBus (Spark Structured Streaming)</li> </ul>","tags":["DeveloperApi"]},{"location":"SparkListenerBus/","title":"SparkListenerBus","text":"<p><code>SparkListenerBus</code>\u00a0is an extension of the ListenerBus abstraction for event buses for SparkListenerInterfaces to be notified about SparkListenerEvents.</p>"},{"location":"SparkListenerBus/#posting-event-to-sparklistener","title":"Posting Event to SparkListener <pre><code>doPostEvent(\n  listener: SparkListenerInterface,\n  event: SparkListenerEvent): Unit\n</code></pre> <p><code>doPostEvent</code>\u00a0is part of the ListenerBus abstraction.</p> <p><code>doPostEvent</code> notifies the given SparkListenerInterface about the SparkListenerEvent.</p> <p><code>doPostEvent</code> calls an event-specific method of SparkListenerInterface or falls back to onOtherEvent.</p>","text":""},{"location":"SparkListenerBus/#implementations","title":"Implementations","text":"<ul> <li><code>AsyncEventQueue</code></li> <li>ReplayListenerBus</li> </ul>"},{"location":"SparkListenerEvent/","title":"SparkListenerEvent","text":"<p><code>SparkListenerEvent</code> is an abstraction of scheduling events.</p>"},{"location":"SparkListenerEvent/#dispatching-sparklistenerevents","title":"Dispatching SparkListenerEvents","text":"<p>SparkListenerBus in general (and <code>AsyncEventQueue</code> are event buses used to dispatch <code>SparkListenerEvent</code>s to registered SparkListeners.</p> <p>LiveListenerBus is an event bus to dispatch <code>SparkListenerEvent</code>s to registered SparkListeners.</p>"},{"location":"SparkListenerEvent/#spark-history-server","title":"Spark History Server","text":"<p>Once logged, Spark History Server uses <code>JsonProtocol</code> utility to sparkEventFromJson.</p>"},{"location":"SparkListenerEvent/#contract","title":"Contract","text":""},{"location":"SparkListenerEvent/#logevent","title":"logEvent <pre><code>logEvent: Boolean\n</code></pre> <p><code>logEvent</code> controls whether EventLoggingListener should save the event to an event log.</p> <p>Default: <code>true</code></p> <p><code>logEvent</code>\u00a0is used when:</p> <ul> <li><code>EventLoggingListener</code> is requested to handle \"other\" events</li> </ul>","text":""},{"location":"SparkListenerEvent/#implementations","title":"Implementations","text":""},{"location":"SparkListenerEvent/#sparklistenerapplicationend","title":"SparkListenerApplicationEnd","text":""},{"location":"SparkListenerEvent/#sparklistenerapplicationstart","title":"SparkListenerApplicationStart","text":""},{"location":"SparkListenerEvent/#sparklistenerblockmanageradded","title":"SparkListenerBlockManagerAdded","text":""},{"location":"SparkListenerEvent/#sparklistenerblockmanagerremoved","title":"SparkListenerBlockManagerRemoved","text":""},{"location":"SparkListenerEvent/#sparklistenerblockupdated","title":"SparkListenerBlockUpdated","text":""},{"location":"SparkListenerEvent/#sparklistenerenvironmentupdate","title":"SparkListenerEnvironmentUpdate","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutoradded","title":"SparkListenerExecutorAdded","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorblacklisted","title":"SparkListenerExecutorBlacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorblacklistedforstage","title":"SparkListenerExecutorBlacklistedForStage","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutormetricsupdate","title":"SparkListenerExecutorMetricsUpdate","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorremoved","title":"SparkListenerExecutorRemoved","text":""},{"location":"SparkListenerEvent/#sparklistenerexecutorunblacklisted","title":"SparkListenerExecutorUnblacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenerjobend","title":"SparkListenerJobEnd","text":""},{"location":"SparkListenerEvent/#sparklistenerjobstart","title":"SparkListenerJobStart","text":""},{"location":"SparkListenerEvent/#sparklistenerlogstart","title":"SparkListenerLogStart","text":""},{"location":"SparkListenerEvent/#sparklistenernodeblacklisted","title":"SparkListenerNodeBlacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenernodeblacklistedforstage","title":"SparkListenerNodeBlacklistedForStage","text":""},{"location":"SparkListenerEvent/#sparklistenernodeunblacklisted","title":"SparkListenerNodeUnblacklisted","text":""},{"location":"SparkListenerEvent/#sparklistenerspeculativetasksubmitted","title":"SparkListenerSpeculativeTaskSubmitted","text":""},{"location":"SparkListenerEvent/#sparklistenerstagecompleted","title":"SparkListenerStageCompleted","text":""},{"location":"SparkListenerEvent/#sparklistenerstageexecutormetrics","title":"SparkListenerStageExecutorMetrics","text":""},{"location":"SparkListenerEvent/#sparklistenerstagesubmitted","title":"SparkListenerStageSubmitted","text":""},{"location":"SparkListenerEvent/#sparklistenertaskend","title":"SparkListenerTaskEnd <p>SparkListenerTaskEnd</p>","text":""},{"location":"SparkListenerEvent/#sparklistenertaskgettingresult","title":"SparkListenerTaskGettingResult","text":""},{"location":"SparkListenerEvent/#sparklistenertaskstart","title":"SparkListenerTaskStart","text":""},{"location":"SparkListenerEvent/#sparklistenerunpersistrdd","title":"SparkListenerUnpersistRDD","text":""},{"location":"SparkListenerInterface/","title":"SparkListenerInterface","text":"<p><code>SparkListenerInterface</code> is an abstraction of event listeners (that <code>SparkListenerBus</code> notifies about scheduling events).</p> <p><code>SparkListenerInterface</code> is a way to intercept scheduling events from the Spark Scheduler that are emitted over the course of execution of a Spark application.</p> <p><code>SparkListenerInterface</code> is used heavily to manage communication between internal components in the distributed environment for a Spark application (e.g. web UI, event persistence for History Server, dynamic allocation of executors, keeping track of executors).</p> <p><code>SparkListenerInterface</code> can be registered in a Spark application using SparkContext.addSparkListener method or spark.extraListeners configuration property.</p> <p>Tip</p> <p>Enable <code>INFO</code> logging level for org.apache.spark.SparkContext logger to see what and when custom Spark listeners are registered.</p>"},{"location":"SparkListenerInterface/#onapplicationend","title":"onApplicationEnd <pre><code>onApplicationEnd(\n  applicationEnd: SparkListenerApplicationEnd): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerApplicationEnd event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onapplicationstart","title":"onApplicationStart <pre><code>onApplicationStart(\n  applicationStart: SparkListenerApplicationStart): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerApplicationStart event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onblockmanageradded","title":"onBlockManagerAdded <pre><code>onBlockManagerAdded(\n  blockManagerAdded: SparkListenerBlockManagerAdded): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerBlockManagerAdded event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onblockmanagerremoved","title":"onBlockManagerRemoved <pre><code>onBlockManagerRemoved(\n  blockManagerRemoved: SparkListenerBlockManagerRemoved): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerBlockManagerRemoved event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onblockupdated","title":"onBlockUpdated <pre><code>onBlockUpdated(\n  blockUpdated: SparkListenerBlockUpdated): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerBlockUpdated event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onenvironmentupdate","title":"onEnvironmentUpdate <pre><code>onEnvironmentUpdate(\n  environmentUpdate: SparkListenerEnvironmentUpdate): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerEnvironmentUpdate event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onexecutoradded","title":"onExecutorAdded <pre><code>onExecutorAdded(\n  executorAdded: SparkListenerExecutorAdded): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerExecutorAdded event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onexecutorblacklisted","title":"onExecutorBlacklisted <pre><code>onExecutorBlacklisted(\n  executorBlacklisted: SparkListenerExecutorBlacklisted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerExecutorBlacklisted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onexecutorblacklistedforstage","title":"onExecutorBlacklistedForStage <pre><code>onExecutorBlacklistedForStage(\n  executorBlacklistedForStage: SparkListenerExecutorBlacklistedForStage): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerExecutorBlacklistedForStage event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onexecutormetricsupdate","title":"onExecutorMetricsUpdate <pre><code>onExecutorMetricsUpdate(\n  executorMetricsUpdate: SparkListenerExecutorMetricsUpdate): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerExecutorMetricsUpdate event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onexecutorremoved","title":"onExecutorRemoved <pre><code>onExecutorRemoved(\n  executorRemoved: SparkListenerExecutorRemoved): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerExecutorRemoved event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onexecutorunblacklisted","title":"onExecutorUnblacklisted <pre><code>onExecutorUnblacklisted(\n  executorUnblacklisted: SparkListenerExecutorUnblacklisted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerExecutorUnblacklisted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onjobend","title":"onJobEnd <pre><code>onJobEnd(\n  jobEnd: SparkListenerJobEnd): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerJobEnd event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onjobstart","title":"onJobStart <pre><code>onJobStart(\n  jobStart: SparkListenerJobStart): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerJobStart event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onnodeblacklisted","title":"onNodeBlacklisted <pre><code>onNodeBlacklisted(\n  nodeBlacklisted: SparkListenerNodeBlacklisted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerNodeBlacklisted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onnodeblacklistedforstage","title":"onNodeBlacklistedForStage <pre><code>onNodeBlacklistedForStage(\n  nodeBlacklistedForStage: SparkListenerNodeBlacklistedForStage): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerNodeBlacklistedForStage event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onnodeunblacklisted","title":"onNodeUnblacklisted <pre><code>onNodeUnblacklisted(\n  nodeUnblacklisted: SparkListenerNodeUnblacklisted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerNodeUnblacklisted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onotherevent","title":"onOtherEvent <pre><code>onOtherEvent(\n  event: SparkListenerEvent): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a custom SparkListenerEvent</li> </ul>","text":""},{"location":"SparkListenerInterface/#onspeculativetasksubmitted","title":"onSpeculativeTaskSubmitted <pre><code>onSpeculativeTaskSubmitted(\n  speculativeTask: SparkListenerSpeculativeTaskSubmitted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerSpeculativeTaskSubmitted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onstagecompleted","title":"onStageCompleted <pre><code>onStageCompleted(\n  stageCompleted: SparkListenerStageCompleted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerStageCompleted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onstageexecutormetrics","title":"onStageExecutorMetrics <pre><code>onStageExecutorMetrics(\n  executorMetrics: SparkListenerStageExecutorMetrics): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerStageExecutorMetrics event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onstagesubmitted","title":"onStageSubmitted <pre><code>onStageSubmitted(\n  stageSubmitted: SparkListenerStageSubmitted): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerStageSubmitted event</li> </ul>","text":""},{"location":"SparkListenerInterface/#ontaskend","title":"onTaskEnd <pre><code>onTaskEnd(\n  taskEnd: SparkListenerTaskEnd): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerTaskEnd event</li> </ul>","text":""},{"location":"SparkListenerInterface/#ontaskgettingresult","title":"onTaskGettingResult <pre><code>onTaskGettingResult(\n  taskGettingResult: SparkListenerTaskGettingResult): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerTaskGettingResult event</li> </ul>","text":""},{"location":"SparkListenerInterface/#ontaskstart","title":"onTaskStart <pre><code>onTaskStart(\n  taskStart: SparkListenerTaskStart): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerTaskStart event</li> </ul>","text":""},{"location":"SparkListenerInterface/#onunpersistrdd","title":"onUnpersistRDD <pre><code>onUnpersistRDD(\n  unpersistRDD: SparkListenerUnpersistRDD): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkListenerBus</code> is requested to post a SparkListenerUnpersistRDD event</li> </ul>","text":""},{"location":"SparkListenerInterface/#implementations","title":"Implementations <ul> <li>EventFilterBuilder</li> <li>SparkFirehoseListener</li> <li>SparkListener</li> </ul>","text":""},{"location":"SparkListenerTaskEnd/","title":"SparkListenerTaskEnd","text":"<p><code>SparkListenerTaskEnd</code> is a SparkListenerEvent.</p> <p><code>SparkListenerTaskEnd</code> is posted (and created) when:</p> <ul> <li><code>DAGScheduler</code> is requested to postTaskEnd</li> </ul> <p><code>SparkListenerTaskEnd</code> is intercepted using SparkListenerInterface.onTaskEnd</p>"},{"location":"SparkListenerTaskEnd/#creating-instance","title":"Creating Instance","text":"<p><code>SparkListenerTaskEnd</code> takes the following to be created:</p> <ul> <li> Stage ID <li> Stage Attempt ID <li> Task Type <li> <code>TaskEndReason</code> <li> TaskInfo <li> <code>ExecutorMetrics</code> <li> TaskMetrics"},{"location":"SparkStatusTracker/","title":"SparkStatusTracker","text":"<p><code>SparkStatusTracker</code> is created for SparkContext for Spark developers to access the AppStatusStore and the following:</p> <ul> <li>All active job IDs</li> <li>All active stage IDs</li> <li>All known job IDs (and possibly limited to a particular job group)</li> <li><code>SparkExecutorInfo</code>s of all known executors</li> <li><code>SparkJobInfo</code> of a job ID</li> <li><code>SparkStageInfo</code> of a stage ID</li> </ul>"},{"location":"SparkStatusTracker/#creating-instance","title":"Creating Instance","text":"<p><code>SparkStatusTracker</code> takes the following to be created:</p> <ul> <li> SparkContext (unused) <li> AppStatusStore <p><code>SparkStatusTracker</code> is created\u00a0when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>"},{"location":"SpillListener/","title":"SpillListener","text":"<p><code>SpillListener</code> is a SparkListener that intercepts (listens to) the following events for detecting spills in jobs:</p> <ul> <li>onTaskEnd</li> <li>onStageCompleted</li> </ul> <p><code>SpillListener</code> is used for testing only.</p>"},{"location":"SpillListener/#creating-instance","title":"Creating Instance","text":"<p><code>SpillListener</code> takes no input arguments to be created.</p> <p><code>SpillListener</code> is created when <code>TestUtils</code> is requested to <code>assertSpilled</code> and <code>assertNotSpilled</code>.</p>"},{"location":"SpillListener/#ontaskend-callback","title":"onTaskEnd Callback <pre><code>onTaskEnd(\n  taskEnd: SparkListenerTaskEnd): Unit\n</code></pre> <p><code>onTaskEnd</code>...FIXME</p> <p><code>onTaskEnd</code> is part of the SparkListener abstraction.</p>","text":""},{"location":"SpillListener/#onstagecompleted-callback","title":"onStageCompleted Callback <pre><code>onStageCompleted(\n  stageComplete: SparkListenerStageCompleted): Unit\n</code></pre> <p><code>onStageCompleted</code>...FIXME</p> <p><code>onStageCompleted</code> is part of the SparkListener abstraction.</p>","text":""},{"location":"StatsReportListener/","title":"StatsReportListener \u2014 Logging Summary Statistics","text":"<p><code>org.apache.spark.scheduler.StatsReportListener</code> (see https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.StatsReportListener[the listener's scaladoc]) is a SparkListener.md[] that logs summary statistics when each stage completes.</p> <p><code>StatsReportListener</code> listens to SparkListenerTaskEnd and SparkListenerStageCompleted events and prints them out at <code>INFO</code> logging level.</p>","tags":["DeveloperApi"]},{"location":"StatsReportListener/#tip","title":"[TIP]","text":"<p>Enable <code>INFO</code> logging level for <code>org.apache.spark.scheduler.StatsReportListener</code> logger to see Spark events.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.StatsReportListener=INFO\n</code></pre>","tags":["DeveloperApi"]},{"location":"StatsReportListener/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":"<p>=== [[onStageCompleted]] Intercepting Stage Completed Events -- <code>onStageCompleted</code> Callback</p> <p>CAUTION: FIXME</p> <p>=== [[example]] Example</p> <pre><code>$ ./bin/spark-shell -c spark.extraListeners=org.apache.spark.scheduler.StatsReportListener\n...\nINFO SparkContext: Registered listener org.apache.spark.scheduler.StatsReportListener\n...\n\nscala&gt; spark.read.text(\"README.md\").count\n...\nINFO StatsReportListener: Finished stage: Stage(0, 0); Name: 'count at &lt;console&gt;:24'; Status: succeeded; numTasks: 1; Took: 212 msec\nINFO StatsReportListener: task runtime:(count: 1, mean: 198.000000, stdev: 0.000000, max: 198.000000, min: 198.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms    198.0 ms\nINFO StatsReportListener: shuffle bytes written:(count: 1, mean: 59.000000, stdev: 0.000000, max: 59.000000, min: 59.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B  59.0 B\nINFO StatsReportListener: fetch wait time:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms\nINFO StatsReportListener: remote bytes read:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B\nINFO StatsReportListener: task result size:(count: 1, mean: 1885.000000, stdev: 0.000000, max: 1885.000000, min: 1885.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B    1885.0 B\nINFO StatsReportListener: executor (non-fetch) time pct: (count: 1, mean: 73.737374, stdev: 0.000000, max: 73.737374, min: 73.737374)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   74 %    74 %    74 %    74 %    74 %    74 %    74 %    74 %    74 %\nINFO StatsReportListener: fetch wait time pct: (count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:    0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %\nINFO StatsReportListener: other time pct: (count: 1, mean: 26.262626, stdev: 0.000000, max: 26.262626, min: 26.262626)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   26 %    26 %    26 %    26 %    26 %    26 %    26 %    26 %    26 %\nINFO StatsReportListener: Finished stage: Stage(1, 0); Name: 'count at &lt;console&gt;:24'; Status: succeeded; numTasks: 1; Took: 34 msec\nINFO StatsReportListener: task runtime:(count: 1, mean: 33.000000, stdev: 0.000000, max: 33.000000, min: 33.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms 33.0 ms\nINFO StatsReportListener: shuffle bytes written:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B\nINFO StatsReportListener: fetch wait time:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms\nINFO StatsReportListener: remote bytes read:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B   0.0 B\nINFO StatsReportListener: task result size:(count: 1, mean: 1960.000000, stdev: 0.000000, max: 1960.000000, min: 1960.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B    1960.0 B\nINFO StatsReportListener: executor (non-fetch) time pct: (count: 1, mean: 75.757576, stdev: 0.000000, max: 75.757576, min: 75.757576)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   76 %    76 %    76 %    76 %    76 %    76 %    76 %    76 %    76 %\nINFO StatsReportListener: fetch wait time pct: (count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:    0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %\nINFO StatsReportListener: other time pct: (count: 1, mean: 24.242424, stdev: 0.000000, max: 24.242424, min: 24.242424)\nINFO StatsReportListener:   0%  5%  10% 25% 50% 75% 90% 95% 100%\nINFO StatsReportListener:   24 %    24 %    24 %    24 %    24 %    24 %    24 %    24 %    24 %\nres0: Long = 99\n</code></pre>","tags":["DeveloperApi"]},{"location":"TaskCompletionListener/","title":"TaskCompletionListener","text":"<p><code>TaskCompletionListener</code>\u00a0is an extension of the <code>EventListener</code> (Java) abstraction for task listeners that can be notified on task completion.</p>","tags":["DeveloperApi"]},{"location":"TaskCompletionListener/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"TaskCompletionListener/#ontaskcompletion","title":"onTaskCompletion <pre><code>onTaskCompletion(\n  context: TaskContext): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskContextImpl</code> is requested to addTaskCompletionListener (and a task has already completed) and markTaskCompleted</li> <li><code>ShuffleFetchCompletionListener</code> is requested to onComplete</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"TaskFailureListener/","title":"TaskFailureListener","text":"<p><code>TaskFailureListener</code>\u00a0is an extension of the <code>EventListener</code> (Java) abstraction for task listeners that can be notified on task failure.</p>","tags":["DeveloperApi"]},{"location":"TaskFailureListener/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"TaskFailureListener/#ontaskfailure","title":"onTaskFailure <pre><code>onTaskFailure(\n  context: TaskContext,\n  error: Throwable): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskContextImpl</code> is requested to addTaskFailureListener (and a task has already failed) and markTaskFailed</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"Utils/","title":"Utils Utility","text":""},{"location":"Utils/#getdynamicallocationinitialexecutors","title":"getDynamicAllocationInitialExecutors <pre><code>getDynamicAllocationInitialExecutors(\n  conf: SparkConf): Int\n</code></pre> <p><code>getDynamicAllocationInitialExecutors</code> gives the maximum value of the following configuration properties (for the initial number of executors):</p> <ul> <li>spark.dynamicAllocation.initialExecutors</li> <li>spark.dynamicAllocation.minExecutors</li> <li>spark.executor.instances</li> </ul> <p><code>getDynamicAllocationInitialExecutors</code> prints out the following INFO message to the logs:</p> <pre><code>Using initial executors = [initialExecutors],\nmax of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances\n</code></pre>  <p>With spark.dynamicAllocation.initialExecutors less than spark.dynamicAllocation.minExecutors, <code>getDynamicAllocationInitialExecutors</code> prints out the following WARN message to the logs:</p> <pre><code>spark.dynamicAllocation.initialExecutors less than spark.dynamicAllocation.minExecutors is invalid,\nignoring its setting, please update your configs.\n</code></pre> <p>With spark.executor.instances less than spark.dynamicAllocation.minExecutors, <code>getDynamicAllocationInitialExecutors</code> prints out the following WARN message to the logs:</p> <pre><code>spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid,\nignoring its setting, please update your configs.\n</code></pre> <p><code>getDynamicAllocationInitialExecutors</code> is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is created</li> <li><code>SchedulerBackendUtils</code> utility is used to getInitialTargetExecutorNumber</li> </ul>","text":""},{"location":"Utils/#local-directories-for-scratch-space","title":"Local Directories for Scratch Space <pre><code>getConfiguredLocalDirs(\n  conf: SparkConf): Array[String]\n</code></pre> <p><code>getConfiguredLocalDirs</code> returns the local directories where Spark can write files to.</p>  <p><code>getConfiguredLocalDirs</code> uses the given SparkConf to know if External Shuffle Service is enabled or not (based on spark.shuffle.service.enabled configuration property).</p> <p>When in a YARN container (<code>CONTAINER_ID</code>), <code>getConfiguredLocalDirs</code> uses <code>LOCAL_DIRS</code> environment variable for YARN-approved local directories.</p> <p>In non-YARN mode (or for the driver in yarn-client mode), <code>getConfiguredLocalDirs</code> checks the following environment variables (in order) and returns the value of the first found:</p> <ol> <li><code>SPARK_EXECUTOR_DIRS</code></li> <li><code>SPARK_LOCAL_DIRS</code></li> <li><code>MESOS_DIRECTORY</code> (only when External Shuffle Service is not used)</li> </ol> <p>The environment variables are a comma-separated list of local directory paths.</p> <p>In the end, when no earlier environment variables were found, <code>getConfiguredLocalDirs</code> uses spark.local.dir configuration property (with <code>java.io.tmpdir</code> System property as the default value).</p>  <p><code>getConfiguredLocalDirs</code> is used when:</p> <ul> <li><code>DiskBlockManager</code> is requested to createLocalDirs and createLocalDirsForMergedShuffleBlocks</li> <li><code>Utils</code> utility is used to get a single random local root directory and create a spark directory in every local root directory</li> </ul>","text":""},{"location":"Utils/#random-local-directory-path","title":"Random Local Directory Path <pre><code>getLocalDir(\n  conf: SparkConf): String\n</code></pre> <p><code>getLocalDir</code> takes a random directory path out of the configured local root directories</p> <p><code>getLocalDir</code> throws an <code>IOException</code> if no local directory is defined:</p> <pre><code>Failed to get a temp directory under [[configuredLocalDirs]].\n</code></pre> <p><code>getLocalDir</code> is used when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a base SparkEnv for the driver</li> <li><code>Utils</code> utility is used to fetchFile</li> <li><code>DriverLogger</code> is created</li> <li><code>RocksDBStateStoreProvider</code> (Spark Structured Streaming) is requested for a <code>RocksDB</code></li> <li><code>PythonBroadcast</code> (PySpark) is requested to <code>readObject</code></li> <li><code>AggregateInPandasExec</code> (PySpark) is requested to <code>doExecute</code></li> <li><code>EvalPythonExec</code> (PySpark) is requested to <code>doExecute</code></li> <li><code>WindowInPandasExec</code> (PySpark) is requested to <code>doExecute</code></li> <li><code>PythonForeachWriter</code> (PySpark) is requested for a <code>UnsafeRowBuffer</code></li> <li><code>Client</code> (Spark on YARN) is requested to <code>prepareLocalResources</code> and <code>createConfArchive</code></li> </ul>","text":""},{"location":"Utils/#localrootdirs-registry","title":"localRootDirs Registry <p><code>Utils</code> utility uses <code>localRootDirs</code> internal registry so getOrCreateLocalRootDirsImpl is executed just once (when first requested).</p> <p><code>localRootDirs</code> is available using <code>getOrCreateLocalRootDirs</code> method.</p> <pre><code>getOrCreateLocalRootDirs(\n  conf: SparkConf): Array[String]\n</code></pre> <p><code>getOrCreateLocalRootDirs</code> is used when:</p> <ul> <li><code>Utils</code> is used to getLocalDir</li> <li><code>Worker</code> (Spark Standalone) is requested to launch an executor</li> </ul>","text":""},{"location":"Utils/#creating-spark-directory-in-every-local-root-directory","title":"Creating spark Directory in Every Local Root Directory <pre><code>getOrCreateLocalRootDirsImpl(\n  conf: SparkConf): Array[String]\n</code></pre> <p><code>getOrCreateLocalRootDirsImpl</code> creates a <code>spark-[randomUUID]</code> directory under every root directory for local storage (and registers a shutdown hook to delete the directories at shutdown).</p> <p><code>getOrCreateLocalRootDirsImpl</code> prints out the following WARN message to the logs when there is a local root directories as a URI (with a scheme):</p> <pre><code>The configured local directories are not expected to be URIs;\nhowever, got suspicious values [[uris]].\nPlease check your configured local directories.\n</code></pre>","text":""},{"location":"Utils/#local-uri-scheme","title":"Local URI Scheme <p><code>Utils</code> defines a <code>local</code> URI scheme for files that are locally available on worker nodes in the cluster.</p> <p>The <code>local</code> URL scheme is used when:</p> <ul> <li><code>Utils</code> is used to isLocalUri</li> <li><code>Client</code> (Spark on YARN) is used</li> </ul>","text":""},{"location":"Utils/#islocaluri","title":"isLocalUri <pre><code>isLocalUri(\n  uri: String): Boolean\n</code></pre> <p><code>isLocalUri</code> is <code>true</code> when the URI is a <code>local:</code> URI (the given <code>uri</code> starts with local: scheme).</p> <p><code>isLocalUri</code> is used when:</p> <ul> <li>FIXME</li> </ul>","text":""},{"location":"Utils/#getcurrentusername","title":"getCurrentUserName <pre><code>getCurrentUserName(): String\n</code></pre> <p><code>getCurrentUserName</code> computes the user name who has started the SparkContext.md[SparkContext] instance.</p> <p>NOTE: It is later available as SparkContext.md#sparkUser[SparkContext.sparkUser].</p> <p>Internally, it reads SparkContext.md#SPARK_USER[SPARK_USER] environment variable and, if not set, reverts to Hadoop Security API's <code>UserGroupInformation.getCurrentUser().getShortUserName()</code>.</p> <p>NOTE: It is another place where Spark relies on Hadoop API for its operation.</p>","text":""},{"location":"Utils/#localhostname","title":"localHostName <pre><code>localHostName(): String\n</code></pre> <p><code>localHostName</code> computes the local host name.</p> <p>It starts by checking <code>SPARK_LOCAL_HOSTNAME</code> environment variable for the value. If it is not defined, it uses <code>SPARK_LOCAL_IP</code> to find the name (using <code>InetAddress.getByName</code>). If it is not defined either, it calls <code>InetAddress.getLocalHost</code> for the name.</p> <p>NOTE: <code>Utils.localHostName</code> is executed while SparkContext.md#creating-instance[<code>SparkContext</code> is created] and also to compute the default value of spark-driver.md#spark_driver_host[spark.driver.host Spark property].</p>","text":""},{"location":"Utils/#getuserjars","title":"getUserJars <pre><code>getUserJars(\n  conf: SparkConf): Seq[String]\n</code></pre> <p><code>getUserJars</code> is the spark.jars configuration property with non-empty entries.</p> <p><code>getUserJars</code> is used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"Utils/#extracthostportfromsparkurl","title":"extractHostPortFromSparkUrl <pre><code>extractHostPortFromSparkUrl(\n  sparkUrl: String): (String, Int)\n</code></pre> <p><code>extractHostPortFromSparkUrl</code> creates a Java URI with the input <code>sparkUrl</code> and takes the host and port parts.</p> <p><code>extractHostPortFromSparkUrl</code> asserts that the input <code>sparkURL</code> uses spark scheme.</p> <p><code>extractHostPortFromSparkUrl</code> throws a <code>SparkException</code> for unparseable spark URLs:</p> <pre><code>Invalid master URL: [sparkUrl]\n</code></pre> <p><code>extractHostPortFromSparkUrl</code> is used when:</p> <ul> <li><code>StandaloneSubmitRequestServlet</code> is requested to <code>buildDriverDescription</code></li> <li><code>RpcAddress</code> is requested to extract an RpcAddress from a Spark master URL</li> </ul>","text":""},{"location":"Utils/#isDynamicAllocationEnabled","title":"isDynamicAllocationEnabled <pre><code>isDynamicAllocationEnabled(\n  conf: SparkConf): Boolean\n</code></pre> <p><code>isDynamicAllocationEnabled</code> checks whether Dynamic Allocation of Executors is enabled (<code>true</code>) or not (<code>false</code>).</p>  <p><code>isDynamicAllocationEnabled</code> is positive (<code>true</code>) when all the following hold:</p> <ol> <li>spark.dynamicAllocation.enabled configuration property is <code>true</code></li> <li>spark.master is non-<code>local</code></li> </ol>  <p><code>isDynamicAllocationEnabled</code> is used when:</p> <ul> <li><code>SparkContext</code> is created (to start an ExecutorAllocationManager)</li> <li><code>TaskResourceProfile</code> is requested for custom executor resources</li> <li><code>ResourceProfileManager</code> is created</li> <li><code>DAGScheduler</code> is requested to checkBarrierStageWithDynamicAllocation</li> <li><code>TaskSchedulerImpl</code> is requested to resourceOffers</li> <li><code>SchedulerBackendUtils</code> is requested to getInitialTargetExecutorNumber</li> <li><code>StandaloneSchedulerBackend</code> (Spark Standalone) is requested to <code>start</code> (for reporting purposes)</li> <li><code>ExecutorPodsAllocator</code> (Spark on Kubernetes) is created (<code>maxPVCs</code>)</li> <li><code>ApplicationMaster</code> (Spark on YARN) is created (<code>maxNumExecutorFailures</code>)</li> <li><code>YarnSchedulerBackend</code> (Spark on YARN) is requested to <code>getShufflePushMergerLocations</code></li> </ul>","text":""},{"location":"Utils/#checkandgetk8smasterurl","title":"checkAndGetK8sMasterUrl <pre><code>checkAndGetK8sMasterUrl(\n  rawMasterURL: String): String\n</code></pre> <p><code>checkAndGetK8sMasterUrl</code>...FIXME</p> <p><code>checkAndGetK8sMasterUrl</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment (for Kubernetes cluster manager)</li> </ul>","text":""},{"location":"Utils/#fetching-file","title":"Fetching File <pre><code>fetchFile(\n  url: String,\n  targetDir: File,\n  conf: SparkConf,\n  securityMgr: SecurityManager,\n  hadoopConf: Configuration,\n  timestamp: Long,\n  useCache: Boolean): File\n</code></pre> <p><code>fetchFile</code>...FIXME</p> <p><code>fetchFile</code> is used when:</p> <ul> <li> <p><code>SparkContext</code> is requested to SparkContext.md#addFile[addFile]</p> </li> <li> <p><code>Executor</code> is requested to executor:Executor.md#updateDependencies[updateDependencies]</p> </li> <li> <p>Spark Standalone's <code>DriverRunner</code> is requested to <code>downloadUserJar</code></p> </li> </ul>","text":""},{"location":"Utils/#ispushbasedshuffleenabled","title":"isPushBasedShuffleEnabled <pre><code>isPushBasedShuffleEnabled(\n  conf: SparkConf,\n  isDriver: Boolean,\n  checkSerializer: Boolean = true): Boolean\n</code></pre> <p><code>isPushBasedShuffleEnabled</code> takes the value of spark.shuffle.push.enabled configuration property (from the given SparkConf).</p> <p>If <code>false</code>, <code>isPushBasedShuffleEnabled</code> does nothing and returns <code>false</code> as well.</p> <p>Otherwise, <code>isPushBasedShuffleEnabled</code> returns whether it is even possible to use push-based shuffle or not based on the following:</p> <ol> <li>External Shuffle Service is used (based on spark.shuffle.service.enabled that should be <code>true</code>)</li> <li>spark.master is <code>yarn</code></li> <li>(only with <code>checkSerializer</code> enabled) spark.serializer is a Serializer that supportsRelocationOfSerializedObjects</li> <li>spark.io.encryption.enabled is <code>false</code></li> </ol> <p>In case spark.shuffle.push.enabled configuration property is enabled but the above requirements did not hold, <code>isPushBasedShuffleEnabled</code> prints out the following WARN message to the logs:</p> <pre><code>Push-based shuffle can only be enabled\nwhen the application is submitted to run in YARN mode,\nwith external shuffle service enabled, IO encryption disabled,\nand relocation of serialized objects supported.\n</code></pre> <p><code>isPushBasedShuffleEnabled</code>\u00a0is used when:</p> <ul> <li><code>ShuffleDependency</code> is requested to canShuffleMergeBeEnabled</li> <li><code>MapOutputTrackerMaster</code> is created</li> <li><code>MapOutputTrackerWorker</code> is created</li> <li><code>DAGScheduler</code> is created</li> <li><code>ShuffleBlockPusher</code> utility is used to create a <code>BLOCK_PUSHER_POOL</code> thread pool</li> <li><code>BlockManager</code> is requested to initialize and registerWithExternalShuffleServer</li> <li><code>BlockManagerMasterEndpoint</code> is created</li> <li><code>DiskBlockManager</code> is requested to createLocalDirsForMergedShuffleBlocks</li> </ul>","text":""},{"location":"Utils/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.util.Utils</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.util.Utils=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"architecture/","title":"Architecture","text":"<p>= Spark Architecture</p> <p>Spark uses a master/worker architecture. There is a spark-driver.md[driver] that talks to a single coordinator called spark-master.md[master] that manages spark-workers.md[workers] in which executor:Executor.md[executors] run.</p> <p>.Spark architecture image::driver-sparkcontext-clustermanager-workers-executors.png[align=\"center\"]</p> <p>The driver and the executors run in their own Java processes. You can run them all on the same (horizontal cluster) or separate machines (vertical cluster) or in a mixed machine configuration.</p> <p>.Spark architecture in detail image::sparkapp-sparkcontext-master-slaves.png[align=\"center\"]</p> <p>Physical machines are called hosts or nodes.</p>"},{"location":"configuration-properties/","title":"Configuration Properties","text":""},{"location":"configuration-properties/#sparkappid","title":"spark.app.id <p>Unique identifier of a Spark application that Spark uses to uniquely identify metric sources.</p> <p>Default: TaskScheduler.applicationId()</p> <p>Set when SparkContext is created</p>","text":""},{"location":"configuration-properties/#sparkbroadcastblocksize","title":"spark.broadcast.blockSize <p>The size of each piece of a block  (in kB unless the unit is specified)</p> <p>Default: <code>4m</code></p> <p>Too large a value decreases parallelism during broadcast (makes it slower); however, if it is too small, BlockManager might take a performance hit</p> <p>Used when:</p> <ul> <li><code>TorrentBroadcast</code> is requested to setConf</li> </ul>","text":""},{"location":"configuration-properties/#sparkbroadcastcompress","title":"spark.broadcast.compress <p>Controls broadcast variable compression (before sending them over the wire)</p> <p>Default: <code>true</code></p> <p>Generally a good idea. Compression will use spark.io.compression.codec</p> <p>Used when:</p> <ul> <li><code>TorrentBroadcast</code> is requested to setConf</li> <li><code>SerializerManager</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#spark.buffer.pageSize","title":"spark.buffer.pageSize <p>spark.buffer.pageSize</p> <p>The amount of memory used per page (in bytes)</p> <p>Default: (undefined)</p> <p>Used when:</p> <ul> <li><code>MemoryManager</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparkcleanerreferencetracking","title":"spark.cleaner.referenceTracking <p>Controls whether to enable ContextCleaner</p> <p>Default: <code>true</code></p>","text":""},{"location":"configuration-properties/#sparkdiskstoresubdirectories","title":"spark.diskStore.subDirectories <p>Number of subdirectories inside each path listed in spark.local.dir for hashing block files into.</p> <p>Default: <code>64</code></p> <p>Used by BlockManager and DiskBlockManager</p>","text":""},{"location":"configuration-properties/#sparkdriverhost","title":"spark.driver.host <p>Address of the driver (endpoints)</p> <p>Default: Utils.localCanonicalHostName</p>","text":""},{"location":"configuration-properties/#sparkdriverlogallowerasurecoding","title":"spark.driver.log.allowErasureCoding <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>DfsAsyncWriter</code> is requested to <code>init</code></li> </ul>","text":""},{"location":"configuration-properties/#sparkdriverlogdfsdir","title":"spark.driver.log.dfsDir <p>The directory on a Hadoop DFS-compliant file system where DriverLogger copies driver logs to</p> <p>Default: (undefined)</p> <p>Used when:</p> <ul> <li><code>FsHistoryProvider</code> is requested to startPolling (and cleanDriverLogs)</li> <li><code>DfsAsyncWriter</code> is requested to <code>init</code></li> <li><code>DriverLogger</code> utility is used to create a DriverLogger (for a SparkContext)</li> </ul>","text":""},{"location":"configuration-properties/#sparkdriverlogpersisttodfsenabled","title":"spark.driver.log.persistToDfs.enabled <p>Enables DriverLogger</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>DriverLogger</code> utility is used to create a DriverLogger (for a SparkContext)</li> </ul>","text":""},{"location":"configuration-properties/#sparkdrivermaxresultsize","title":"spark.driver.maxResultSize <p>Maximum size of task results (in bytes)</p> <p>Default: <code>1g</code></p> <p>Used when:</p> <ul> <li> <p><code>TaskRunner</code> is requested to run a task (and decide on the type of a serialized task result)</p> </li> <li> <p><code>TaskSetManager</code> is requested to check available memory for task results</p> </li> </ul>","text":""},{"location":"configuration-properties/#sparkdriverport","title":"spark.driver.port <p>Port of the driver (endpoints)</p> <p>Default: <code>0</code></p>","text":""},{"location":"configuration-properties/#sparkexecutorcores","title":"spark.executor.cores <p>Number of CPU cores for Executor</p> <p>Default: <code>1</code></p>","text":""},{"location":"configuration-properties/#sparkexecutorheartbeatmaxfailures","title":"spark.executor.heartbeat.maxFailures <p>Number of times an Executor tries sending heartbeats to the driver before it gives up and exits (with exit code <code>56</code>).</p> <p>Default: <code>60</code></p> <p>For example, with max failures <code>60</code> (the default) and spark.executor.heartbeatInterval <code>10s</code>, then <code>Executor</code> will try to send heartbeats for up to <code>600s</code> (10 minutes).</p> <p>Used when:</p> <ul> <li><code>Executor</code> is created (and reportHeartBeat)</li> </ul>","text":""},{"location":"configuration-properties/#sparkexecutorheartbeatinterval","title":"spark.executor.heartbeatInterval <p>Interval between Executor heartbeats (to the driver)</p> <p>Default: <code>10s</code></p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is created</li> <li><code>Executor</code> is created and requested to reportHeartBeat</li> <li><code>HeartbeatReceiver</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparkexecutorid","title":"spark.executor.id <p>Default: (undefined)</p>","text":""},{"location":"configuration-properties/#sparkexecutorinstances","title":"spark.executor.instances <p>Number of executors to use</p> <p>Default: (undefined)</p>","text":""},{"location":"configuration-properties/#sparkexecutormemory","title":"spark.executor.memory <p>Amount of memory to use for an Executor</p> <p>Default: <code>1g</code></p> <p>Equivalent to SPARK_EXECUTOR_MEMORY environment variable.</p>","text":""},{"location":"configuration-properties/#sparkexecutormemoryoverhead","title":"spark.executor.memoryOverhead <p>The amount of non-heap memory (in MiB) to be allocated per executor</p> <p>Used when:</p> <ul> <li><code>ResourceProfile</code> is requested for the default executor resources</li> <li><code>Client</code> (Spark on YARN) is created</li> </ul>","text":""},{"location":"configuration-properties/#sparkexecutormetricsfilesystemschemes","title":"spark.executor.metrics.fileSystemSchemes <p>A comma-separated list of the file system schemes to report in executor metrics</p> <p>Default: <code>file,hdfs</code></p>","text":""},{"location":"configuration-properties/#sparkexecutormetricspollinginterval","title":"spark.executor.metrics.pollingInterval <p>How often to collect executor metrics (in ms):</p> <ul> <li><code>0</code> - the polling is done on executor heartbeats</li> <li>A positive number - the polling is done at this interval</li> </ul> <p>Default: <code>0</code></p> <p>Used when:</p> <ul> <li><code>Executor</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparkexecutoruserclasspathfirst","title":"spark.executor.userClassPathFirst <p>Controls whether to load classes in user-defined jars before those in Spark jars</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>CoarseGrainedExecutorBackend</code> is requested to create a ClassLoader</li> <li><code>Executor</code> is created</li> <li><code>Client</code> utility (Spark on YARN) is used to <code>isUserClassPathFirst</code></li> </ul>","text":""},{"location":"configuration-properties/#sparkextralisteners","title":"spark.extraListeners <p>A comma-separated list of fully-qualified class names of SparkListeners (to be registered when SparkContext is created)</p> <p>Default: (empty)</p>","text":""},{"location":"configuration-properties/#sparkfiletransferto","title":"spark.file.transferTo <p>Controls whether to use Java FileChannels (Java NIO) for copying data between two Java <code>FileInputStream</code>s to improve copy performance</p> <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li>BypassMergeSortShuffleWriter and UnsafeShuffleWriter are created</li> </ul>","text":""},{"location":"configuration-properties/#sparkfiles","title":"spark.files <p>The files to be added to a Spark application (that can be defined directly as a configuration property or indirectly using <code>--files</code> option of <code>spark-submit</code> script)</p> <p>Default: (empty)</p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparkioencryptionenabled","title":"spark.io.encryption.enabled <p>Controls local disk I/O encryption</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv for the driver (to create a IO encryption key)</li> <li><code>BlockStoreShuffleReader</code> is requested to read combined records (and fetchContinuousBlocksInBatch)</li> </ul>","text":""},{"location":"configuration-properties/#sparkjars","title":"spark.jars <p>Default: (empty)</p>","text":""},{"location":"configuration-properties/#sparkkryopool","title":"spark.kryo.pool <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li><code>KryoSerializer</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparkkryounsafe","title":"spark.kryo.unsafe <p>Whether KryoSerializer should use Unsafe-based IO for serialization</p> <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparklocaldir","title":"spark.local.dir <p>A comma-separated list of directory paths for \"scratch\" space (a temporary storage for map output files, RDDs that get stored on disk, etc.). It is recommended to use paths on fast local disks in your system (e.g. SSDs).</p> <p>Default: <code>java.io.tmpdir</code> System property</p>","text":""},{"location":"configuration-properties/#sparklocalitywait","title":"spark.locality.wait <p>How long to wait until an executor is available for locality-aware delay scheduling (for <code>PROCESS_LOCAL</code>, <code>NODE_LOCAL</code>, and <code>RACK_LOCAL</code> TaskLocalities) unless locality-specific setting is set (i.e., spark.locality.wait.process, spark.locality.wait.node, and spark.locality.wait.rack, respectively)</p> <p>Default: <code>3s</code></p>","text":""},{"location":"configuration-properties/#sparklocalitywaitlegacyresetontasklaunch","title":"spark.locality.wait.legacyResetOnTaskLaunch <p>(internal) Whether to use the legacy behavior of locality wait, which resets the delay timer anytime a task is scheduled.</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is created</li> <li><code>TaskSetManager</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparklocalitywaitnode","title":"spark.locality.wait.node <p>Scheduling delay for TaskLocality.NODE_LOCAL</p> <p>Default: spark.locality.wait</p> <p>Used when:</p> <ul> <li><code>TaskSetManager</code> is requested for the locality wait (of <code>TaskLocality.NODE_LOCAL</code>)</li> </ul>","text":""},{"location":"configuration-properties/#sparklocalitywaitprocess","title":"spark.locality.wait.process <p>Scheduling delay for TaskLocality.PROCESS_LOCAL</p> <p>Default: spark.locality.wait</p> <p>Used when:</p> <ul> <li><code>TaskSetManager</code> is requested for the locality wait (of <code>TaskLocality.PROCESS_LOCAL</code>)</li> </ul>","text":""},{"location":"configuration-properties/#sparklocalitywaitrack","title":"spark.locality.wait.rack <p>Scheduling delay for TaskLocality.RACK_LOCAL</p> <p>Default: spark.locality.wait</p> <p>Used when:</p> <ul> <li><code>TaskSetManager</code> is requested for the locality wait (of <code>TaskLocality.RACK_LOCAL</code>)</li> </ul>","text":""},{"location":"configuration-properties/#sparklogconf","title":"spark.logConf <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparkloglineage","title":"spark.logLineage <p>Enables printing out the RDD lineage graph (using RDD.toDebugString) when executing an action (and running a job)</p> <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparkmaster","title":"spark.master <p>Master URL of the cluster manager to connect the Spark application to</p>","text":""},{"location":"configuration-properties/#sparkmemoryfraction","title":"spark.memory.fraction <p>Fraction of JVM heap space used for execution and storage.</p> <p>Default: <code>0.6</code></p> <p>The lower the more frequent spills and cached data eviction. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. Leaving this at the default value is recommended.</p>","text":""},{"location":"configuration-properties/#sparkmemoryoffheapenabled","title":"spark.memory.offHeap.enabled <p>Controls whether Tungsten memory will be allocated on the JVM heap (<code>false</code>) or off-heap (<code>true</code> / using <code>sun.misc.Unsafe</code>).</p> <p>Default: <code>false</code></p> <p>When enabled, spark.memory.offHeap.size must be greater than 0.</p> <p>Used when:</p> <ul> <li><code>MemoryManager</code> is requested for tungstenMemoryMode</li> </ul>","text":""},{"location":"configuration-properties/#sparkmemoryoffheapsize","title":"spark.memory.offHeap.size <p>Maximum memory (in bytes) for off-heap memory allocation</p> <p>Default: <code>0</code></p> <p>This setting has no impact on heap memory usage, so if your executors' total memory consumption must fit within some hard limit then be sure to shrink your JVM heap size accordingly.</p> <p>Must not be negative and be set to a positive value when spark.memory.offHeap.enabled is enabled</p>","text":""},{"location":"configuration-properties/#sparkmemorystoragefraction","title":"spark.memory.storageFraction <p>Amount of storage memory immune to eviction, expressed as a fraction of the size of the region set aside by spark.memory.fraction.</p> <p>Default: <code>0.5</code></p> <p>The higher the less working memory may be available to execution and tasks may spill to disk more often. The default value is recommended.</p> <p>Must be in <code>[0,1)</code></p> <p>Used when:</p> <ul> <li><code>UnifiedMemoryManager</code> is created</li> <li><code>MemoryManager</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparknetworkiopreferdirectbufs","title":"spark.network.io.preferDirectBufs <p>Default: <code>true</code></p>","text":""},{"location":"configuration-properties/#sparknetworkmaxremoteblocksizefetchtomem","title":"spark.network.maxRemoteBlockSizeFetchToMem <p>Remote block will be fetched to disk when size of the block is above this threshold in bytes</p> <p>This is to avoid a giant request takes too much memory. Note this configuration will affect both shuffle fetch and block manager remote block fetch.</p> <p>With an external shuffle service use at least 2.3.0</p> <p>Default: <code>200m</code></p> <p>Used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> <li><code>NettyBlockTransferService</code> is requested to uploadBlock</li> <li><code>BlockManager</code> is requested to fetchRemoteManagedBuffer</li> </ul>","text":""},{"location":"configuration-properties/#sparknetworksharedbytebufallocatorsenabled","title":"spark.network.sharedByteBufAllocators.enabled <p>Default: <code>true</code></p>","text":""},{"location":"configuration-properties/#sparknetworktimeout","title":"spark.network.timeout <p>Network timeout (in seconds) to use for RPC remote endpoint lookup</p> <p>Default: <code>120s</code></p>","text":""},{"location":"configuration-properties/#sparknetworktimeoutinterval","title":"spark.network.timeoutInterval <p>(in millis)</p> <p>Default: spark.storage.blockManagerTimeoutIntervalMs</p>","text":""},{"location":"configuration-properties/#sparkrddcompress","title":"spark.rdd.compress <p>Controls whether to compress RDD partitions when stored serialized</p> <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparkreducermaxblocksinflightperaddress","title":"spark.reducer.maxBlocksInFlightPerAddress <p>Maximum number of remote blocks being fetched per reduce task from a given host port</p> <p>When a large number of blocks are being requested from a given address in a single fetch or simultaneously, this could crash the serving executor or a Node Manager. This is especially useful to reduce the load on the Node Manager when external shuffle is enabled. You can mitigate the issue by setting it to a lower value.</p> <p>Default: (unlimited)</p> <p>Used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> </ul>","text":""},{"location":"configuration-properties/#sparkreducermaxreqsinflight","title":"spark.reducer.maxReqsInFlight <p>Maximum number of remote requests to fetch blocks at any given point</p> <p>When the number of hosts in the cluster increase, it might lead to very large number of inbound connections to one or more nodes, causing the workers to fail under load. By allowing it to limit the number of fetch requests, this scenario can be mitigated</p> <p>Default: (unlimited)</p> <p>Used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> </ul>","text":""},{"location":"configuration-properties/#sparkreducermaxsizeinflight","title":"spark.reducer.maxSizeInFlight <p>Maximum size of all map outputs to fetch simultaneously from each reduce task (in MiB unless otherwise specified)</p> <p>Since each output requires us to create a buffer to receive it, this represents a fixed memory overhead per reduce task, so keep it small unless you have a large amount of memory</p> <p>Default: <code>48m</code></p> <p>Used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> </ul>","text":""},{"location":"configuration-properties/#sparkreplclassuri","title":"spark.repl.class.uri <p>Controls whether to compress RDD partitions when stored serialized</p> <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparkrpclookuptimeout","title":"spark.rpc.lookupTimeout <p>Default Endpoint Lookup Timeout</p> <p>Default: <code>120s</code></p>","text":""},{"location":"configuration-properties/#sparkrpcmessagemaxsize","title":"spark.rpc.message.maxSize <p>Maximum allowed message size for RPC communication (in <code>MB</code> unless specified)</p> <p>Default: <code>128</code></p> <p>Must be below 2047MB (<code>Int.MaxValue / 1024 / 1024</code>)</p> <p>Used when:</p> <ul> <li><code>CoarseGrainedSchedulerBackend</code> is requested to launch tasks</li> <li><code>RpcUtils</code> is requested for the maximum message size<ul> <li><code>Executor</code> is created</li> <li><code>MapOutputTrackerMaster</code> is created (and makes sure that spark.shuffle.mapOutput.minSizeForBroadcast is below the threshold)</li> </ul> </li> </ul>","text":""},{"location":"configuration-properties/#sparkscheduler","title":"spark.scheduler","text":""},{"location":"configuration-properties/#spark.scheduler.barrier.maxConcurrentTasksCheck.interval","title":"barrier.maxConcurrentTasksCheck.interval","text":"<p>spark.scheduler.barrier.maxConcurrentTasksCheck.interval</p>"},{"location":"configuration-properties/#spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures","title":"barrier.maxConcurrentTasksCheck.maxFailures","text":"<p>spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures</p>"},{"location":"configuration-properties/#spark.scheduler.minRegisteredResourcesRatio","title":"minRegisteredResourcesRatio","text":"<p>spark.scheduler.minRegisteredResourcesRatio</p> <p>Minimum ratio of (registered resources / total expected resources) before submitting tasks</p> <p>Default: (undefined)</p>"},{"location":"configuration-properties/#spark.scheduler.revive.interval","title":"spark.scheduler.revive.interval <p>spark.scheduler.revive.interval</p> <p>The time (in millis) between resource offers revives</p> <p>Default: <code>1s</code></p> <p>Used when:</p> <ul> <li><code>DriverEndpoint</code> is requested to onStart</li> </ul>","text":""},{"location":"configuration-properties/#sparkserializer","title":"spark.serializer <p>The fully-qualified class name of the Serializer (of the driver and executors)</p> <p>Default: <code>org.apache.spark.serializer.JavaSerializer</code></p> <p>Used when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv</li> <li><code>SparkConf</code> is requested to registerKryoClasses (as a side-effect)</li> </ul>","text":""},{"location":"configuration-properties/#sparkshuffle","title":"spark.shuffle","text":""},{"location":"configuration-properties/#spark.shuffle.sort.io.plugin.class","title":"sort.io.plugin.class <p>spark.shuffle.sort.io.plugin.class</p> <p>Name of the class to use for shuffle IO</p> <p>Default: LocalDiskShuffleDataIO</p> <p>Used when:</p> <ul> <li><code>ShuffleDataIOUtils</code> is requested to loadShuffleDataIO</li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.checksum.enabled","title":"checksum.enabled <p>spark.shuffle.checksum.enabled</p> <p>Controls checksuming of shuffle data. If enabled, Spark will calculate the checksum values for each partition data within the map output file and store the values in a checksum file on the disk. When there's shuffle data corruption detected, Spark will try to diagnose the cause (e.g., network issue, disk issue, etc.) of the corruption by using the checksum file.</p> <p>Default: <code>true</code></p>","text":""},{"location":"configuration-properties/#spark.shuffle.compress","title":"compress <p>spark.shuffle.compress</p> <p>Enables compressing shuffle output when stored</p> <p>Default: <code>true</code></p>","text":""},{"location":"configuration-properties/#spark.shuffle.detectCorrupt","title":"detectCorrupt <p>spark.shuffle.detectCorrupt</p> <p>Controls corruption detection in fetched blocks</p> <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.detectCorrupt.useExtraMemory","title":"detectCorrupt.useExtraMemory <p>spark.shuffle.detectCorrupt.useExtraMemory</p> <p>If enabled, part of a compressed/encrypted stream will be de-compressed/de-crypted by using extra memory to detect early corruption. Any <code>IOException</code> thrown will cause the task to be retried once and if it fails again with same exception, then <code>FetchFailedException</code> will be thrown to retry previous stage</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.file.buffer","title":"file.buffer <p>spark.shuffle.file.buffer</p> <p>Size of the in-memory buffer for each shuffle file output stream, in KiB unless otherwise specified. These buffers reduce the number of disk seeks and system calls made in creating intermediate shuffle files.</p> <p>Default: <code>32k</code></p> <p>Must be greater than <code>0</code> and less than or equal to <code>2097151</code> (<code>(Integer.MAX_VALUE - 15) / 1024</code>)</p> <p>Used when the following are created:</p> <ul> <li>BypassMergeSortShuffleWriter</li> <li>ShuffleExternalSorter</li> <li>UnsafeShuffleWriter</li> <li>ExternalAppendOnlyMap</li> <li>ExternalSorter</li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.manager","title":"manager <p>spark.shuffle.manager</p> <p>A fully-qualified class name or the alias of the ShuffleManager in a Spark application</p> <p>Default: <code>sort</code></p> <p>Supported aliases:</p> <ul> <li><code>sort</code></li> <li><code>tungsten-sort</code></li> </ul> <p>Used when <code>SparkEnv</code> object is requested to create a \"base\" SparkEnv for a driver or an executor</p>","text":""},{"location":"configuration-properties/#spark.shuffle.mapOutput.parallelAggregationThreshold","title":"mapOutput.parallelAggregationThreshold <p>spark.shuffle.mapOutput.parallelAggregationThreshold</p> <p>(internal) Multi-thread is used when the number of mappers * shuffle partitions is greater than or equal to this threshold. Note that the actual parallelism is calculated by number of mappers * shuffle partitions / this threshold + 1, so this threshold should be positive.</p> <p>Default: <code>10000000</code></p> <p>Used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested for the statistics of a ShuffleDependency</li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.minNumPartitionsToHighlyCompress","title":"minNumPartitionsToHighlyCompress <p>spark.shuffle.minNumPartitionsToHighlyCompress</p> <p>(internal) Minimum number of partitions (threshold) for <code>MapStatus</code> utility to prefer a HighlyCompressedMapStatus (over CompressedMapStatus) (for ShuffleWriters).</p> <p>Default: <code>2000</code></p> <p>Must be a positive integer (above <code>0</code>)</p>","text":""},{"location":"configuration-properties/#spark.shuffle.push.enabled","title":"push.enabled <p>spark.shuffle.push.enabled</p> <p>Enables push-based shuffle on the client side</p> <p>Default: <code>false</code></p> <p>Works in conjunction with the server side flag <code>spark.shuffle.push.server.mergedShuffleFileManagerImpl</code> which needs to be set with the appropriate <code>org.apache.spark.network.shuffle.MergedShuffleFileManager</code> implementation for push-based shuffle to be enabled</p> <p>Used when:</p> <ul> <li><code>Utils</code> utility is used to determine whether push-based shuffle is enabled or not</li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.readHostLocalDisk","title":"readHostLocalDisk <p>spark.shuffle.readHostLocalDisk</p> <p>If enabled (with spark.shuffle.useOldFetchProtocol disabled and spark.shuffle.service.enabled enabled), shuffle blocks requested from those block managers which are running on the same host are read from the disk directly instead of being fetched as remote blocks over the network.</p> <p>Default: <code>true</code></p>","text":""},{"location":"configuration-properties/#spark.shuffle.registration.maxAttempts","title":"registration.maxAttempts <p>spark.shuffle.registration.maxAttempts</p> <p>How many attempts to register a BlockManager with External Shuffle Service</p> <p>Default: <code>3</code></p> <p>Used when <code>BlockManager</code> is requested to register with External Shuffle Server</p>","text":""},{"location":"configuration-properties/#spark.shuffle.sort.bypassMergeThreshold","title":"sort.bypassMergeThreshold <p>spark.shuffle.sort.bypassMergeThreshold</p> <p>Maximum number of reduce partitions below which SortShuffleManager avoids merge-sorting data for no map-side aggregation</p> <p>Default: <code>200</code></p> <p>Used when:</p> <ul> <li><code>SortShuffleWriter</code> utility is used to shouldBypassMergeSort</li> <li><code>ShuffleExchangeExec</code> (Spark SQL) physical operator is requested to <code>prepareShuffleDependency</code></li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.spill.initialMemoryThreshold","title":"spill.initialMemoryThreshold <p>spark.shuffle.spill.initialMemoryThreshold</p> <p>Initial threshold for the size of an in-memory collection</p> <p>Default: 5MB</p> <p>Used by Spillable</p>","text":""},{"location":"configuration-properties/#spark.shuffle.spill.numElementsForceSpillThreshold","title":"spill.numElementsForceSpillThreshold <p>spark.shuffle.spill.numElementsForceSpillThreshold</p> <p>(internal) The maximum number of elements in memory before forcing the shuffle sorter to spill.</p> <p>Default: <code>Integer.MAX_VALUE</code></p> <p>The default value is to never force the sorter to spill, until Spark reaches some limitations, like the max page size limitation for the pointer array in the sorter.</p> <p>Used when:</p> <ul> <li>ShuffleExternalSorter is created</li> <li>Spillable is created</li> <li>Spark SQL's <code>SortBasedAggregator</code> is requested for an <code>UnsafeKVExternalSorter</code></li> <li>Spark SQL's <code>ObjectAggregationMap</code> is requested to <code>dumpToExternalSorter</code></li> <li>Spark SQL's <code>UnsafeExternalRowSorter</code> is created</li> <li>Spark SQL's <code>UnsafeFixedWidthAggregationMap</code> is requested for an <code>UnsafeKVExternalSorter</code></li> </ul>","text":""},{"location":"configuration-properties/#spark.shuffle.sync","title":"sync <p>spark.shuffle.sync</p> <p>Controls whether <code>DiskBlockObjectWriter</code> should force outstanding writes to disk while committing a single atomic block (i.e. all operating system buffers should synchronize with the disk to ensure that all changes to a file are in fact recorded in the storage)</p> <p>Default: <code>false</code></p> <p>Used when <code>BlockManager</code> is requested for a DiskBlockObjectWriter</p>","text":""},{"location":"configuration-properties/#spark.shuffle.useOldFetchProtocol","title":"useOldFetchProtocol <p>spark.shuffle.useOldFetchProtocol</p> <p>Whether to use the old protocol while doing the shuffle block fetching. It is only enabled while we need the compatibility in the scenario of new Spark version job fetching shuffle blocks from old version external shuffle service.</p> <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparkspeculation","title":"spark.speculation <p>Controls Speculative Execution of Tasks</p> <p>Default: <code>false</code></p>","text":""},{"location":"configuration-properties/#sparkspeculationinterval","title":"spark.speculation.interval <p>The time interval to use before checking for speculative tasks in Speculative Execution of Tasks.</p> <p>Default: <code>100ms</code></p>","text":""},{"location":"configuration-properties/#sparkspeculationmultiplier","title":"spark.speculation.multiplier <p>Default: <code>1.5</code></p>","text":""},{"location":"configuration-properties/#sparkspeculationquantile","title":"spark.speculation.quantile <p>The percentage of tasks that has not finished yet at which to start speculation in Speculative Execution of Tasks.</p> <p>Default: <code>0.75</code></p>","text":""},{"location":"configuration-properties/#sparkstorageblockmanagerslavetimeoutms","title":"spark.storage.blockManagerSlaveTimeoutMs <p>(in millis)</p> <p>Default: spark.network.timeout</p>","text":""},{"location":"configuration-properties/#sparkstorageblockmanagertimeoutintervalms","title":"spark.storage.blockManagerTimeoutIntervalMs <p>(in millis)</p> <p>Default: <code>60s</code></p>","text":""},{"location":"configuration-properties/#sparkstoragelocaldiskbyexecutorscachesize","title":"spark.storage.localDiskByExecutors.cacheSize <p>The max number of executors for which the local dirs are stored. This size is both applied for the driver and both for the executors side to avoid having an unbounded store. This cache will be used to avoid the network in case of fetching disk persisted RDD blocks or shuffle blocks (when spark.shuffle.readHostLocalDisk is set) from the same host.</p> <p>Default: <code>1000</code></p>","text":""},{"location":"configuration-properties/#sparkstoragereplicationpolicy","title":"spark.storage.replication.policy <p>Default: RandomBlockReplicationPolicy</p>","text":""},{"location":"configuration-properties/#sparkstorageunrollmemorythreshold","title":"spark.storage.unrollMemoryThreshold <p>Initial memory threshold (in bytes) to unroll (materialize) a block to store in memory</p> <p>Default: <code>1024 * 1024</code></p> <p>Must be at most the total amount of memory available for storage</p> <p>Used when:</p> <ul> <li><code>MemoryStore</code> is created</li> </ul>","text":""},{"location":"configuration-properties/#sparksubmitdeploymode","title":"spark.submit.deployMode <ul> <li><code>client</code> (default)</li> <li><code>cluster</code></li> </ul>","text":""},{"location":"configuration-properties/#sparktaskcpus","title":"spark.task.cpus <p>The number of CPU cores to schedule (allocate) to a task</p> <p>Default: <code>1</code></p> <p>Used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is created</li> <li><code>TaskSchedulerImpl</code> is created</li> <li><code>AppStatusListener</code> is requested to handle a SparkListenerEnvironmentUpdate event</li> <li><code>SparkContext</code> utility is used to create a TaskScheduler</li> <li><code>ResourceProfile</code> is requested to getDefaultTaskResources</li> <li><code>LocalityPreferredContainerPlacementStrategy</code> is requested to <code>numExecutorsPending</code></li> </ul>","text":""},{"location":"configuration-properties/#sparktaskmaxdirectresultsize","title":"spark.task.maxDirectResultSize <p>Maximum size of a task result (in bytes) to be sent to the driver as a DirectTaskResult</p> <p>Default: <code>1048576B</code> (<code>1L &lt;&lt; 20</code>)</p> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run a task (and decide on the type of a serialized task result)</li> </ul>","text":""},{"location":"configuration-properties/#sparktaskmaxfailures","title":"spark.task.maxFailures <p>Number of failures of a single task (of a TaskSet) before giving up on the entire <code>TaskSet</code> and then the job</p> <p>Default: <code>4</code></p>","text":""},{"location":"configuration-properties/#sparkplugins","title":"spark.plugins <p>A comma-separated list of class names implementing org.apache.spark.api.plugin.SparkPlugin to load into a Spark application.</p> <p>Default: <code>(empty)</code></p> <p>Since: <code>3.0.0</code></p> <p>Set when SparkContext is created</p>","text":""},{"location":"configuration-properties/#sparkpluginsdefaultlist","title":"spark.plugins.defaultList <p>FIXME</p>","text":""},{"location":"configuration-properties/#sparkuishowconsoleprogress","title":"spark.ui.showConsoleProgress <p>Controls whether to enable ConsoleProgressBar and show the progress bar in the console</p> <p>Default: <code>false</code></p>","text":""},{"location":"developer-api/","title":"Developer API","text":""},{"location":"developer-api/#developerapi","title":"DeveloperApi","text":"<ul> <li>SparkEnv</li> <li>SparkListener</li> <li>StatsReportListener</li> <li>TaskCompletionListener</li> <li>TaskFailureListener</li> <li>ExecutorMetrics</li> <li>ShuffleReadMetrics</li> <li>ShuffleWriteMetrics</li> <li>TaskMetrics</li> <li>SparkPlugin</li> <li>Dependency</li> <li>NarrowDependency</li> <li>ShuffleDependency</li> <li>ShuffledRDD</li> <li>ResourceID</li> <li>SparkListenerResourceProfileAdded</li> <li>StorageLevel</li> </ul>"},{"location":"driver/","title":"Driver","text":"<p>A Spark driver (aka an application's driver process) is a JVM process that hosts SparkContext.md[SparkContext] for a Spark application. It is the master node in a Spark application.</p> <p>It is the cockpit of jobs and tasks execution (using scheduler:DAGScheduler.md[DAGScheduler] and scheduler:TaskScheduler.md[Task Scheduler]). It hosts spark-webui.md[Web UI] for the environment.</p> <p>.Driver with the services image::spark-driver.png[align=\"center\"]</p> <p>It splits a Spark application into tasks and schedules them to run on executors.</p> <p>A driver is where the task scheduler lives and spawns tasks across workers.</p> <p>A driver coordinates workers and overall execution of tasks.</p> <p>NOTE: spark-shell.md[Spark shell] is a Spark application and the driver. It creates a <code>SparkContext</code> that is available as <code>sc</code>.</p> <p>Driver requires the additional services (beside the common ones like shuffle:ShuffleManager.md[], memory:MemoryManager.md[], storage:BlockTransferService.md[], BroadcastManager:</p> <ul> <li>Listener Bus</li> <li>rpc:index.md[]</li> <li>scheduler:MapOutputTrackerMaster.md[] with the name MapOutputTracker</li> <li>storage:BlockManagerMaster.md[] with the name BlockManagerMaster</li> <li>MetricsSystem with the name driver</li> <li>OutputCommitCoordinator</li> </ul> <p>CAUTION: FIXME Diagram of RpcEnv for a driver (and later executors). Perhaps it should be in the notes about RpcEnv?</p> <ul> <li>High-level control flow of work</li> <li>Your Spark application runs as long as the Spark driver. ** Once the driver terminates, so does your Spark application.</li> <li>Creates <code>SparkContext</code>, <code>RDD</code>'s, and executes transformations and actions</li> <li>Launches scheduler:Task.md[tasks]</li> </ul> <p>=== [[driver-memory]] Driver's Memory</p> <p>It can be set first using spark-submit/index.md#command-line-options[spark-submit's <code>--driver-memory</code>] command-line option or &lt;&gt; and falls back to spark-submit/index.md#environment-variables[SPARK_DRIVER_MEMORY] if not set earlier. <p>NOTE: It is printed out to the standard error output in spark-submit/index.md#verbose-mode[spark-submit's verbose mode].</p>"},{"location":"driver/#driver-cores","title":"Driver Cores <p>It can be set first using spark-submit/index.md#driver-cores[spark-submit's <code>--driver-cores</code>] command-line option for <code>cluster</code> deploy mode.</p> <p>NOTE: In <code>client</code> deploy mode the driver's memory corresponds to the memory of the JVM process the Spark application runs on.</p> <p>NOTE: It is printed out to the standard error output in spark-submit/index.md#verbose-mode[spark-submit's verbose mode].</p> <p>=== [[settings]] Settings</p> <p>.Spark Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Spark Property | Default Value | Description | [[spark_driver_blockManager_port]] <code>spark.driver.blockManager.port</code> | storage:BlockManager.md#spark_blockManager_port[spark.blockManager.port] | Port to use for the storage:BlockManager.md[BlockManager] on the driver.</p> <p>More precisely, <code>spark.driver.blockManager.port</code> is used when core:SparkEnv.md#NettyBlockTransferService[<code>NettyBlockTransferService</code> is created] (while <code>SparkEnv</code> is created for the driver).</p> <p>| [[spark_driver_memory]] <code>spark.driver.memory</code> | <code>1g</code> | The driver's memory size (in MiBs).</p> <p>Refer to &lt;&gt;. <p>| [[spark_driver_cores]] <code>spark.driver.cores</code> | <code>1</code> | The number of CPU cores assigned to the driver in <code>cluster</code> deploy mode.</p> <p>NOTE: When yarn/spark-yarn-client.md#creating-instance[Client is created] (for Spark on YARN in cluster mode only), it sets the number of cores for <code>ApplicationManager</code> using <code>spark.driver.cores</code>.</p> <p>Refer to &lt;&gt;. <p>| [[spark_driver_extraLibraryPath]] <code>spark.driver.extraLibraryPath</code> | |</p> <p>| [[spark_driver_extraJavaOptions]] <code>spark.driver.extraJavaOptions</code> | | Additional JVM options for the driver.</p> <p>| [[spark.driver.appUIAddress]] spark.driver.appUIAddress</p> <p><code>spark.driver.appUIAddress</code> is used exclusively in yarn/README.md[Spark on YARN]. It is set when yarn/spark-yarn-client-yarnclientschedulerbackend.md#start[YarnClientSchedulerBackend starts] to yarn/spark-yarn-applicationmaster.md#runExecutorLauncher[run ExecutorLauncher] (and yarn/spark-yarn-applicationmaster.md#registerAM[register ApplicationMaster] for the Spark application).</p> <p>| [[spark_driver_libraryPath]] <code>spark.driver.libraryPath</code> | |</p> <p>|===</p>","text":""},{"location":"driver/#sparkdriverextraclasspath","title":"spark.driver.extraClassPath <p><code>spark.driver.extraClassPath</code> system property sets the additional classpath entries (e.g. jars and directories) that should be added to the driver's classpath in <code>cluster</code> deploy mode.</p>","text":""},{"location":"driver/#note","title":"[NOTE]","text":"<p>For <code>client</code> deploy mode you can use a properties file or command line to set <code>spark.driver.extraClassPath</code>.</p> <p>Do not use SparkConf.md[SparkConf] since it is too late for <code>client</code> deploy mode given the JVM has already been set up to start a Spark application.</p>"},{"location":"driver/#refer-to-spark-classmdbuildsparksubmitcommandbuildsparksubmitcommand-internal-method-for-the-very-low-level-details-of-how-it-is-handled-internally","title":"Refer to spark-class.md#buildSparkSubmitCommand[<code>buildSparkSubmitCommand</code> Internal Method] for the very low-level details of how it is handled internally.","text":"<p><code>spark.driver.extraClassPath</code> uses a OS-specific path separator.</p> <p>NOTE: Use <code>spark-submit</code>'s spark-submit/index.md#driver-class-path[<code>--driver-class-path</code> command-line option] on command line to override <code>spark.driver.extraClassPath</code> from a spark-properties.md#spark-defaults-conf[Spark properties file].</p>"},{"location":"local-properties/","title":"Local Properties","text":"<p><code>SparkContext.setLocalProperty</code> lets you set key-value pairs that will be propagated down to tasks and can be accessed there using TaskContext.getLocalProperty.</p>"},{"location":"local-properties/#creating-logical-job-groups","title":"Creating Logical Job Groups","text":"<p>One of the purposes of local properties is to create logical groups of Spark jobs by means of properties that (regardless of the threads used to submit the jobs) makes the separate jobs launched from different threads belong to a single logical group.</p> <p>A common use case for the local property concept is to set a local property in a thread, say spark-scheduler-FairSchedulableBuilder.md[spark.scheduler.pool], after which all jobs submitted within the thread will be grouped, say into a pool by FAIR job scheduler.</p> <pre><code>val data = sc.parallelize(0 to 9)\n\nsc.setLocalProperty(\"spark.scheduler.pool\", \"myPool\")\n\n// these two jobs (one per action) will run in the myPool pool\ndata.count\ndata.collect\n\nsc.setLocalProperty(\"spark.scheduler.pool\", null)\n\n// this job will run in the default pool\ndata.count\n</code></pre>"},{"location":"master/","title":"Master","text":"<p>== Master</p> <p>A master is a running Spark instance that connects to a cluster manager for resources.</p> <p>The master acquires cluster nodes to run executors.</p> <p>CAUTION: FIXME Add it to the Spark architecture figure above.</p>"},{"location":"overview/","title":"Spark Core","text":"<p>Apache Spark is an open-source distributed general-purpose cluster computing framework with (mostly) in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL.</p> <p></p> <p>You could also describe Spark as a distributed, data processing engine for batch and streaming modes featuring SQL queries, graph processing, and machine learning.</p> <p>In contrast to Hadoop\u2019s two-stage disk-based MapReduce computation engine, Spark's multi-stage (mostly) in-memory computing engine allows for running most computations in memory, and hence most of the time provides better performance for certain applications, e.g. iterative algorithms or interactive data mining (read Spark officially sets a new record in large-scale sorting).</p> <p>Spark aims at speed, ease of use, extensibility and interactive analytics.</p> <p>Spark is a distributed platform for executing complex multi-stage applications, like machine learning algorithms, and interactive ad hoc queries. Spark provides an efficient abstraction for in-memory cluster computing called Resilient Distributed Dataset.</p> <p>Using Spark Application Frameworks, Spark simplifies access to machine learning and predictive analytics at scale.</p> <p>Spark is mainly written in http://scala-lang.org/[Scala], but provides developer API for languages like Java, Python, and R.</p> <p>If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is a viable alternative.</p> <ul> <li>Access any data type across any data source.</li> <li>Huge demand for storage and data processing.</li> </ul> <p>The Apache Spark project is an umbrella for https://jaceklaskowski.gitbooks.io/mastering-spark-sql/[SQL] (with Datasets), https://jaceklaskowski.gitbooks.io/spark-structured-streaming/[streaming], http://spark.apache.org/mllib/[machine learning] (pipelines) and http://spark.apache.org/graphx/[graph] processing engines built on top of the Spark Core. You can run them all in a single application using a consistent API.</p> <p>Spark runs locally as well as in clusters, on-premises or in cloud. It runs on top of Hadoop YARN, Apache Mesos, standalone or in the cloud (Amazon EC2 or IBM Bluemix).</p> <p>Apache Spark's https://jaceklaskowski.gitbooks.io/spark-structured-streaming/[Structured Streaming] and https://jaceklaskowski.gitbooks.io/mastering-spark-sql/[SQL] programming models with MLlib and GraphX make it easier for developers and data scientists to build applications that exploit machine learning and graph analytics.</p> <p>At a high level, any Spark application creates RDDs out of some input, run rdd:index.md[(lazy) transformations] of these RDDs to some other form (shape), and finally perform rdd:index.md[actions] to collect or store data. Not much, huh?</p> <p>You can look at Spark from programmer's, data engineer's and administrator's point of view. And to be honest, all three types of people will spend quite a lot of their time with Spark to finally reach the point where they exploit all the available features. Programmers use language-specific APIs (and work at the level of RDDs using transformations and actions), data engineers use higher-level abstractions like DataFrames or Pipelines APIs or external tools (that connect to Spark), and finally it all can only be possible to run because administrators set up Spark clusters to deploy Spark applications to.</p> <p>It is Spark's goal to be a general-purpose computing platform with various specialized applications frameworks on top of a single unified engine.</p> <p>NOTE: When you hear \"Apache Spark\" it can be two things -- the Spark engine aka Spark Core or the Apache Spark open source project which is an \"umbrella\" term for Spark Core and the accompanying Spark Application Frameworks, i.e. Spark SQL, spark-streaming/spark-streaming.md[Spark Streaming], spark-mllib/spark-mllib.md[Spark MLlib] and spark-graphx.md[Spark GraphX] that sit on top of Spark Core and the main data abstraction in Spark called rdd:index.md[RDD - Resilient Distributed Dataset].</p>"},{"location":"overview/#why-spark","title":"Why Spark","text":"<p>Let's list a few of the many reasons for Spark. We are doing it first, and then comes the overview that lends a more technical helping hand.</p>"},{"location":"overview/#easy-to-get-started","title":"Easy to Get Started","text":"<p>Spark offers spark-shell that makes for a very easy head start to writing and running Spark applications on the command line on your laptop.</p> <p>You could then use Spark Standalone built-in cluster manager to deploy your Spark applications to a production-grade cluster to run on a full dataset.</p>"},{"location":"overview/#unified-engine-for-diverse-workloads","title":"Unified Engine for Diverse Workloads","text":"<p>As said by Matei Zaharia - the author of Apache Spark - in Introduction to AmpLab Spark Internals video (quoting with few changes):</p> <p>One of the Spark project goals was to deliver a platform that supports a very wide array of diverse workflows - not only MapReduce batch jobs (there were available in Hadoop already at that time), but also iterative computations like graph algorithms or Machine Learning.</p> <p>And also different scales of workloads from sub-second interactive jobs to jobs that run for many hours.</p> <p>Spark combines batch, interactive, and streaming workloads under one rich concise API.</p> <p>Spark supports near real-time streaming workloads via spark-streaming/spark-streaming.md[Spark Streaming] application framework.</p> <p>ETL workloads and Analytics workloads are different, however Spark attempts to offer a unified platform for a wide variety of workloads.</p> <p>Graph and Machine Learning algorithms are iterative by nature and less saves to disk or transfers over network means better performance.</p> <p>There is also support for interactive workloads using Spark shell.</p> <p>You should watch the video https://youtu.be/SxAxAhn-BDU[What is Apache Spark?] by Mike Olson, Chief Strategy Officer and Co-Founder at Cloudera, who provides a very exceptional overview of Apache Spark, its rise in popularity in the open source community, and how Spark is primed to replace MapReduce as the general processing engine in Hadoop.</p> <p>=== Leverages the Best in distributed batch data processing</p> <p>When you think about distributed batch data processing, varia/spark-hadoop.md[Hadoop] naturally comes to mind as a viable solution.</p> <p>Spark draws many ideas out of Hadoop MapReduce. They work together well - Spark on YARN and HDFS - while improving on the performance and simplicity of the distributed computing engine.</p> <p>For many, Spark is Hadoop++, i.e. MapReduce done in a better way.</p> <p>And it should not come as a surprise, without Hadoop MapReduce (its advances and deficiencies), Spark would not have been born at all.</p> <p>=== RDD - Distributed Parallel Scala Collections</p> <p>As a Scala developer, you may find Spark's RDD API very similar (if not identical) to http://www.scala-lang.org/docu/files/collections-api/collections.html[Scala's Collections API].</p> <p>It is also exposed in Java, Python and R (as well as SQL, i.e. SparkSQL, in a sense).</p> <p>So, when you have a need for distributed Collections API in Scala, Spark with RDD API should be a serious contender.</p> <p>=== [[rich-standard-library]] Rich Standard Library</p> <p>Not only can you use <code>map</code> and <code>reduce</code> (as in Hadoop MapReduce jobs) in Spark, but also a vast array of other higher-level operators to ease your Spark queries and application development.</p> <p>It expanded on the available computation styles beyond the only map-and-reduce available in Hadoop MapReduce.</p> <p>=== Unified development and deployment environment for all</p> <p>Regardless of the Spark tools you use - the Spark API for the many programming languages supported - Scala, Java, Python, R, or spark-shell.md[the Spark shell], or the many Spark Application Frameworks leveraging the concept of rdd:index.md[RDD], i.e. Spark SQL, spark-streaming/spark-streaming.md[Spark Streaming], spark-mllib/spark-mllib.md[Spark MLlib] and spark-graphx.md[Spark GraphX], you still use the same development and deployment environment to for large data sets to yield a result, be it a prediction (spark-mllib/spark-mllib.md[Spark MLlib]), a structured data queries (Spark SQL) or just a large distributed batch (Spark Core) or streaming (Spark Streaming) computation.</p> <p>It's also very productive of Spark that teams can exploit the different skills the team members have acquired so far. Data analysts, data scientists, Python programmers, or Java, or Scala, or R, can all use the same Spark platform using tailor-made API. It makes for bringing skilled people with their expertise in different programming languages together to a Spark project.</p> <p>=== Interactive Exploration / Exploratory Analytics</p> <p>It is also called ad hoc queries.</p> <p>Using spark-shell.md[the Spark shell] you can execute computations to process large amount of data (The Big Data). It's all interactive and very useful to explore the data before final production release.</p> <p>Also, using the Spark shell you can access any spark-cluster.md[Spark cluster] as if it was your local machine. Just point the Spark shell to a 20-node of 10TB RAM memory in total (using <code>--master</code>) and use all the components (and their abstractions) like Spark SQL, Spark MLlib, spark-streaming/spark-streaming.md[Spark Streaming], and Spark GraphX.</p> <p>Depending on your needs and skills, you may see a better fit for SQL vs programming APIs or apply machine learning algorithms (Spark MLlib) from data in graph data structures (Spark GraphX).</p> <p>=== Single Environment</p> <p>Regardless of which programming language you are good at, be it Scala, Java, Python, R or SQL, you can use the same single clustered runtime environment for prototyping, ad hoc queries, and deploying your applications leveraging the many ingestion data points offered by the Spark platform.</p> <p>You can be as low-level as using RDD API directly or leverage higher-level APIs of Spark SQL (Datasets), Spark MLlib (ML Pipelines), Spark GraphX (Graphs) or spark-streaming/spark-streaming.md[Spark Streaming] (DStreams).</p> <p>Or use them all in a single application.</p> <p>The single programming model and execution engine for different kinds of workloads simplify development and deployment architectures.</p> <p>=== Data Integration Toolkit with Rich Set of Supported Data Sources</p> <p>Spark can read from many types of data sources -- relational, NoSQL, file systems, etc. -- using many types of data formats - Parquet, Avro, CSV, JSON.</p> <p>Both, input and output data sources, allow programmers and data engineers use Spark as the platform with the large amount of data that is read from or saved to for processing, interactively (using Spark shell) or in applications.</p> <p>=== Tools unavailable then, at your fingertips now</p> <p>As much and often as it's recommended http://c2.com/cgi/wiki?PickTheRightToolForTheJob[to pick the right tool for the job], it's not always feasible. Time, personal preference, operating system you work on are all factors to decide what is right at a time (and using a hammer can be a reasonable choice).</p> <p>Spark embraces many concepts in a single unified development and runtime environment.</p> <ul> <li>Machine learning that is so tool- and feature-rich in Python, e.g. SciKit library, can now be used by Scala developers (as Pipeline API in Spark MLlib or calling <code>pipe()</code>).</li> <li>DataFrames from R are available in Scala, Java, Python, R APIs.</li> <li>Single node computations in machine learning algorithms are migrated to their distributed versions in Spark MLlib.</li> </ul> <p>This single platform gives plenty of opportunities for Python, Scala, Java, and R programmers as well as data engineers (SparkR) and scientists (using proprietary enterprise data warehouses with spark-sql-thrift-server.md[Thrift JDBC/ODBC Server] in Spark SQL).</p> <p>Mind the proverb https://en.wiktionary.org/wiki/if_all_you_have_is_a_hammer,_everything_looks_like_a_nail[if all you have is a hammer, everything looks like a nail], too.</p> <p>=== Low-level Optimizations</p> <p>Apache Spark uses a scheduler:DAGScheduler.md[directed acyclic graph (DAG) of computation stages] (aka execution DAG). It postpones any processing until really required for actions. Spark's lazy evaluation gives plenty of opportunities to induce low-level optimizations (so users have to know less to do more).</p> <p>Mind the proverb https://en.wiktionary.org/wiki/less_is_more[less is more].</p> <p>=== Excels at low-latency iterative workloads</p> <p>Spark supports diverse workloads, but successfully targets low-latency iterative ones. They are often used in Machine Learning and graph algorithms.</p> <p>Many Machine Learning algorithms require plenty of iterations before the result models get optimal, like logistic regression. The same applies to graph algorithms to traverse all the nodes and edges when needed. Such computations can increase their performance when the interim partial results are stored in memory or at very fast solid state drives.</p> <p>Spark can spark-rdd-caching.md[cache intermediate data in memory for faster model building and training]. Once the data is loaded to memory (as an initial step), reusing it multiple times incurs no performance slowdowns.</p> <p>Also, graph algorithms can traverse graphs one connection per iteration with the partial result in memory.</p> <p>Less disk access and network can make a huge difference when you need to process lots of data, esp. when it is a BIG Data.</p> <p>=== ETL done easier</p> <p>Spark gives Extract, Transform and Load (ETL) a new look with the many programming languages supported - Scala, Java, Python (less likely R). You can use them all or pick the best for a problem.</p> <p>Scala in Spark, especially, makes for a much less boiler-plate code (comparing to other languages and approaches like MapReduce in Java).</p> <p>=== [[unified-api]] Unified Concise High-Level API</p> <p>Spark offers a unified, concise, high-level APIs for batch analytics (RDD API), SQL queries (Dataset API), real-time analysis (DStream API), machine learning (ML Pipeline API) and graph processing (Graph API).</p> <p>Developers no longer have to learn many different processing engines and platforms, and let the time be spent on mastering framework APIs per use case (atop a single computation engine Spark).</p> <p>=== Different kinds of data processing using unified API</p> <p>Spark offers three kinds of data processing using batch, interactive, and stream processing with the unified API and data structures.</p> <p>=== Little to no disk use for better performance</p> <p>In the no-so-long-ago times, when the most prevalent distributed computing framework was varia/spark-hadoop.md[Hadoop MapReduce], you could reuse a data between computation (even partial ones!) only after you've written it to an external storage like varia/spark-hadoop.md[Hadoop Distributed Filesystem (HDFS)]. It can cost you a lot of time to compute even very basic multi-stage computations. It simply suffers from IO (and perhaps network) overhead.</p> <p>One of the many motivations to build Spark was to have a framework that is good at data reuse.</p> <p>Spark cuts it out in a way to keep as much data as possible in memory and keep it there until a job is finished. It doesn't matter how many stages belong to a job. What does matter is the available memory and how effective you are in using Spark API (so rdd:index.md[no shuffle occur]).</p> <p>The less network and disk IO, the better performance, and Spark tries hard to find ways to minimize both.</p> <p>=== Fault Tolerance included</p> <p>Faults are not considered a special case in Spark, but obvious consequence of being a parallel and distributed system. Spark handles and recovers from faults by default without particularly complex logic to deal with them.</p> <p>=== Small Codebase Invites Contributors</p> <p>Spark's design is fairly simple and the code that comes out of it is not huge comparing to the features it offers.</p> <p>The reasonably small codebase of Spark invites project contributors - programmers who extend the platform and fix bugs in a more steady pace.</p> <p>== [[i-want-more]] Further reading or watching</p> <ul> <li>(video) https://youtu.be/L029ZNBG7bk[Keynote: Spark 2.0 - Matei Zaharia, Apache Spark Creator and CTO of Databricks]</li> </ul>"},{"location":"push-based-shuffle/","title":"Push-Based Shuffle","text":"<p>Push-Based Shuffle is a new feature of Apache Spark 3.2.0 (cf. SPARK-30602) to improve shuffle efficiency.</p> <p>Push-based shuffle is enabled using spark.shuffle.push.enabled configuration property and can only be used in a Spark application submitted to YARN cluster manager, with external shuffle service enabled, IO encryption disabled, and relocation of serialized objects supported.</p>"},{"location":"spark-debugging/","title":"Debugging Spark","text":""},{"location":"spark-debugging/#using-spark-shell-and-intellij-idea","title":"Using spark-shell and IntelliJ IDEA","text":"<p>Start <code>spark-shell</code> with <code>SPARK_SUBMIT_OPTS</code> environment variable that configures the JVM's JDWP.</p> <pre><code>SPARK_SUBMIT_OPTS=\"-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005\" ./bin/spark-shell\n</code></pre> <p>Attach IntelliJ IDEA to the JVM process using Run &gt; Attach to Local Process menu.</p>"},{"location":"spark-debugging/#using-sbt","title":"Using sbt","text":"<p>Use <code>sbt -jvm-debug 5005</code>, connect to the remote JVM at the port <code>5005</code> using IntelliJ IDEA, place breakpoints on the desired lines of the source code of Spark.</p> <pre><code>$ sbt -jvm-debug 5005\nListening for transport dt_socket at address: 5005\n...\n</code></pre> <p>Run Spark context and the breakpoints get triggered.</p> <pre><code>scala&gt; val sc = new SparkContext(conf)\n15/11/14 22:58:46 INFO SparkContext: Running Spark version 1.6.0-SNAPSHOT\n</code></pre> <p>Tip</p> <p>Read Debugging chapter in IntelliJ IDEA's Help.</p>"},{"location":"spark-logging/","title":"Spark Logging","text":"<p>Apache Spark uses Apache Log4j 2 for logging.</p>"},{"location":"spark-logging/#conflog4j2properties","title":"conf/log4j2.properties","text":"<p>The default logging for Spark applications is in <code>conf/log4j2.properties</code>.</p> <p>Use <code>conf/log4j2.properties.template</code> as a starting point.</p>"},{"location":"spark-logging/#logging-levels","title":"Logging Levels <p>The valid logging levels are log4j's Levels (from most specific to least):</p>    Name Description     <code>OFF</code> No events will be logged   <code>FATAL</code> A fatal event that will prevent the application from continuing   <code>ERROR</code> An error in the application, possibly recoverable   <code>WARN</code> An event that might possible lead to an error   <code>INFO</code> An event for informational purposes   <code>DEBUG</code> A general debugging event   <code>TRACE</code> A fine-grained debug message, typically capturing the flow through the application   <code>ALL</code> All events should be logged    <p>The names of the logging levels are case-insensitive.</p>","text":""},{"location":"spark-logging/#turn-logging-off","title":"Turn Logging Off <p>The following sample <code>conf/log4j2.properties</code> turns all logging of Apache Spark (and Apache Hadoop) off.</p> <pre><code># Set to debug or trace if log4j initialization fails\nstatus = warn\n\n# Name of the configuration\nname = exploring-internals\n\n# Console appender configuration\nappender.console.type = Console\nappender.console.name = consoleLogger\nappender.console.layout.type = PatternLayout\nappender.console.layout.pattern = %d{YYYY-MM-dd HH:mm:ss} [%t] %-5p %c:%L - %m%n\nappender.console.target = SYSTEM_OUT\n\nrootLogger.level = off\nrootLogger.appenderRef.stdout.ref = consoleLogger\n\nlogger.spark.name = org.apache.spark\nlogger.spark.level = off\n\nlogger.hadoop.name = org.apache.hadoop\nlogger.hadoop.level = off\n</code></pre>","text":""},{"location":"spark-logging/#setting-default-log-level-programatically","title":"Setting Default Log Level Programatically <p>Setting Default Log Level Programatically</p>","text":""},{"location":"spark-logging/#setting-log-levels-in-spark-applications","title":"Setting Log Levels in Spark Applications <p>In standalone Spark applications or while in Spark Shell session, use the following:</p> <pre><code>import org.apache.log4j.{Level, Logger}\n\nLogger.getLogger(classOf[RackResolver]).getLevel\nLogger.getLogger(\"org\").setLevel(Level.OFF)\n</code></pre>","text":""},{"location":"spark-properties/","title":"Spark Properties and spark-defaults.conf Properties File","text":"<p>Spark properties are the means of tuning the execution environment of a Spark application.</p> <p>The default Spark properties file is &lt;$SPARK_HOME/conf/spark-defaults.conf&gt;&gt; that could be overriden using <code>spark-submit</code> with the spark-submit/index.md#properties-file[--properties-file] command-line option. <p>.Environment Variables [options=\"header\",width=\"100%\"] |=== | Environment Variable | Default Value | Description | <code>SPARK_CONF_DIR</code> | <code>$\\{SPARK_HOME}/conf</code> | Spark's configuration directory (with <code>spark-defaults.conf</code>) |===</p> <p>TIP: Read the official documentation of Apache Spark on http://spark.apache.org/docs/latest/configuration.html[Spark Configuration].</p> <p>=== [[spark-defaults-conf]] <code>spark-defaults.conf</code> -- Default Spark Properties File</p> <p><code>spark-defaults.conf</code> (under <code>SPARK_CONF_DIR</code> or <code>$SPARK_HOME/conf</code>) is the default properties file with the Spark properties of your Spark applications.</p> <p>NOTE: <code>spark-defaults.conf</code> is loaded by spark-AbstractCommandBuilder.md#loadPropertiesFile[AbstractCommandBuilder's <code>loadPropertiesFile</code> internal method].</p> <p>=== [[getDefaultPropertiesFile]] Calculating Path of Default Spark Properties -- <code>Utils.getDefaultPropertiesFile</code> method</p>"},{"location":"spark-properties/#source-scala","title":"[source, scala]","text":""},{"location":"spark-properties/#getdefaultpropertiesfileenv-mapstring-string-sysenv-string","title":"getDefaultPropertiesFile(env: Map[String, String] = sys.env): String","text":"<p><code>getDefaultPropertiesFile</code> calculates the absolute path to <code>spark-defaults.conf</code> properties file that can be either in directory specified by <code>SPARK_CONF_DIR</code> environment variable or <code>$SPARK_HOME/conf</code> directory.</p> <p>NOTE: <code>getDefaultPropertiesFile</code> is part of <code>private[spark]</code> <code>org.apache.spark.util.Utils</code> object.</p>"},{"location":"spark-tips-and-tricks-access-private-members-spark-shell/","title":"Access private members in Scala in Spark shell","text":"<p>== Access private members in Scala in Spark shell</p> <p>If you ever wanted to use <code>private[spark]</code> members in Spark using the Scala programming language, e.g. toy with <code>org.apache.spark.scheduler.DAGScheduler</code> or similar, you will have to use the following trick in Spark shell - use <code>:paste -raw</code> as described in https://issues.scala-lang.org/browse/SI-5299[REPL: support for package definition].</p> <p>Open <code>spark-shell</code> and execute <code>:paste -raw</code> that allows you to enter any valid Scala code, including <code>package</code>.</p> <p>The following snippet shows how to access <code>private[spark]</code> member <code>DAGScheduler.RESUBMIT_TIMEOUT</code>:</p> <pre><code>scala&gt; :paste -raw\n// Entering paste mode (ctrl-D to finish)\n\npackage org.apache.spark\n\nobject spark {\n  def test = {\n    import org.apache.spark.scheduler._\n    println(DAGScheduler.RESUBMIT_TIMEOUT == 200)\n  }\n}\n\nscala&gt; spark.test\ntrue\n\nscala&gt; sc.version\nres0: String = 1.6.0-SNAPSHOT\n</code></pre>"},{"location":"spark-tips-and-tricks-running-spark-windows/","title":"Running Spark Applications on Windows","text":"<p>== Running Spark Applications on Windows</p> <p>Running Spark applications on Windows in general is no different than running it on other operating systems like Linux or macOS.</p> <p>NOTE: A Spark application could be spark-shell.md[spark-shell] or your own custom Spark application.</p> <p>What makes the huge difference between the operating systems is Hadoop that is used internally for file system access in Spark.</p> <p>You may run into few minor issues when you are on Windows due to the way Hadoop works with Windows' POSIX-incompatible NTFS filesystem.</p> <p>NOTE: You do not have to install Apache Hadoop to work with Spark or run Spark applications.</p> <p>TIP: Read the Apache Hadoop project's https://wiki.apache.org/hadoop/WindowsProblems[Problems running Hadoop on Windows].</p> <p>Among the issues is the infamous <code>java.io.IOException</code> when running Spark Shell (below a stacktrace from Spark 2.0.2 on Windows 10 so the line numbers may be different in your case).</p> <pre><code>16/12/26 21:34:11 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path\njava.io.IOException: Could not locate executable null\\bin\\winutils.exe in the Hadoop binaries.\n  at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)\n  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)\n  at org.apache.hadoop.util.Shell.&lt;clinit&gt;(Shell.java:387)\n  at org.apache.hadoop.hive.conf.HiveConf$ConfVars.findHadoopBinary(HiveConf.java:2327)\n  at org.apache.hadoop.hive.conf.HiveConf$ConfVars.&lt;clinit&gt;(HiveConf.java:365)\n  at org.apache.hadoop.hive.conf.HiveConf.&lt;clinit&gt;(HiveConf.java:105)\n  at java.lang.Class.forName0(Native Method)\n  at java.lang.Class.forName(Class.java:348)\n  at org.apache.spark.util.Utils$.classForName(Utils.scala:228)\n  at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:963)\n  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:91)\n</code></pre>"},{"location":"spark-tips-and-tricks-running-spark-windows/#note","title":"[NOTE]","text":"<p>You need to have Administrator rights on your laptop. All the following commands must be executed in a command-line window (<code>cmd</code>) ran as Administrator, i.e. using Run as administrator option while executing <code>cmd</code>.</p>"},{"location":"spark-tips-and-tricks-running-spark-windows/#read-the-official-document-in-microsoft-technet-httpstechnetmicrosoftcomen-uslibrarycc947813vws10aspxstart-a-command-prompt-as-an-administrator","title":"Read the official document in Microsoft TechNet -- ++https://technet.microsoft.com/en-us/library/cc947813(v=ws.10).aspx++[Start a Command Prompt as an Administrator].","text":"<p>Download <code>winutils.exe</code> binary from https://github.com/steveloughran/winutils repository.</p> <p>NOTE: You should select the version of Hadoop the Spark distribution was compiled with, e.g. use <code>hadoop-2.7.1</code> for Spark 2 (https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe[here is the direct link to <code>winutils.exe</code> binary]).</p> <p>Save <code>winutils.exe</code> binary to a directory of your choice, e.g. <code>c:\\hadoop\\bin</code>.</p> <p>Set <code>HADOOP_HOME</code> to reflect the directory with <code>winutils.exe</code> (without <code>bin</code>).</p> <pre><code>set HADOOP_HOME=c:\\hadoop\n</code></pre> <p>Set <code>PATH</code> environment variable to include <code>%HADOOP_HOME%\\bin</code> as follows:</p> <pre><code>set PATH=%HADOOP_HOME%\\bin;%PATH%\n</code></pre> <p>TIP: Define <code>HADOOP_HOME</code> and <code>PATH</code> environment variables in Control Panel so any Windows program would use them.</p> <p>Create <code>C:\\tmp\\hive</code> directory.</p>"},{"location":"spark-tips-and-tricks-running-spark-windows/#note_1","title":"[NOTE]","text":"<p><code>c:\\tmp\\hive</code> directory is the default value of https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.scratchdir[<code>hive.exec.scratchdir</code> configuration property] in Hive 0.14.0 and later and Spark uses a custom build of Hive 1.2.1.</p>"},{"location":"spark-tips-and-tricks-running-spark-windows/#you-can-change-hiveexecscratchdir-configuration-property-to-another-directory-as-described-in-wzxhzdk27-configuration-property-in-this-document","title":"You can change <code>hive.exec.scratchdir</code> configuration property to another directory as described in &lt;hive.exec.scratchdir Configuration Property&gt;&gt; in this document. <p>Execute the following command in <code>cmd</code> that you started using the option Run as administrator.</p> <pre><code>winutils.exe chmod -R 777 C:\\tmp\\hive\n</code></pre> <p>Check the permissions (that is one of the commands that are executed under the covers):</p> <pre><code>winutils.exe ls -F C:\\tmp\\hive\n</code></pre> <p>Open <code>spark-shell</code> and observe the output (perhaps with few WARN messages that you can simply disregard).</p> <p>As a verification step, execute the following line to display the content of a <code>DataFrame</code>:</p>","text":""},{"location":"spark-tips-and-tricks-running-spark-windows/#source-scala","title":"[source, scala]","text":"<p>scala&gt; spark.range(1).withColumn(\"status\", lit(\"All seems fine. Congratulations!\")).show(false) +---+--------------------------------+ |id |status                          | +---+--------------------------------+ |0  |All seems fine. Congratulations!| +---+--------------------------------+</p>"},{"location":"spark-tips-and-tricks-running-spark-windows/#note_2","title":"[NOTE] <p>Disregard WARN messages when you start <code>spark-shell</code>. They are harmless.</p>","text":""},{"location":"spark-tips-and-tricks-running-spark-windows/#161226-220541-warn-general-plugin-bundle-orgdatanucleus-is-already-registered-ensure-you-dont-have-multiple-jar-versions-of-the-same-plugin-in-the-classpath-the-url-filecspark-202-bin-hadoop27jarsdatanucleus-core-3210jar-is-already-registered-and-you-are-trying-to-register-an-identical-plugin-located-at-url-filecspark-202-bin-hadoop27binjarsdatanucleus-core-3210jar-161226-220541-warn-general-plugin-bundle-orgdatanucleusapijdo-is-already-registered-ensure-you-dont-have-multiple-jar-versions-of-the-same-plugin-in-the-classpath-the-url-filecspark-202-bin-hadoop27jarsdatanucleus-api-jdo-326jar-is-already-registered-and-you-are-trying-to-register-an-identical-plugin-located-at-url-filecspark-202-bin-hadoop27binjarsdatanucleus-api-jdo-326jar-161226-220541-warn-general-plugin-bundle-orgdatanucleusstorerdbms-is-already-registered-ensure-you-dont-have-multiple-jar-versions-of-the-same-plugin-in-the-classpath-the-url-filecspark-202-bin-hadoop27binjarsdatanucleus-rdbms-329jar-is-already-registered-and-you-are-trying-to-register-an-identical-plugin-located-at-url-filecspark-202-bin-hadoop27jarsdatanucleus-rdbms-329jar","title":"<pre><code>16/12/26 22:05:41 WARN General: Plugin (Bundle) \"org.datanucleus\" is already registered. Ensure you dont have multiple JAR versions of\nthe same plugin in the classpath. The URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar\" is already registered,\nand you are trying to register an identical plugin located at URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-core-\n3.2.10.jar.\"\n16/12/26 22:05:41 WARN General: Plugin (Bundle) \"org.datanucleus.api.jdo\" is already registered. Ensure you dont have multiple JAR\nversions of the same plugin in the classpath. The URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar\" is already\nregistered, and you are trying to register an identical plugin located at URL \"file:/C:/spark-2.0.2-bin-\nhadoop2.7/bin/../jars/datanucleus-api-jdo-3.2.6.jar.\"\n16/12/26 22:05:41 WARN General: Plugin (Bundle) \"org.datanucleus.store.rdbms\" is already registered. Ensure you dont have multiple JAR\nversions of the same plugin in the classpath. The URL \"file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-rdbms-3.2.9.jar\" is\nalready registered, and you are trying to register an identical plugin located at URL \"file:/C:/spark-2.0.2-bin-\nhadoop2.7/jars/datanucleus-rdbms-3.2.9.jar.\"\n</code></pre> <p>If you see the above output, you're done. You should now be able to run Spark applications on your Windows. Congrats!</p> <p>=== [[changing-hive.exec.scratchdir]] Changing <code>hive.exec.scratchdir</code> Configuration Property</p> <p>Create a <code>hive-site.xml</code> file with the following content:</p> <pre><code>&lt;configuration&gt;\n  &lt;property&gt;\n    &lt;name&gt;hive.exec.scratchdir&lt;/name&gt;\n    &lt;value&gt;/tmp/mydir&lt;/value&gt;\n    &lt;description&gt;Scratch space for Hive jobs&lt;/description&gt;\n  &lt;/property&gt;\n&lt;/configuration&gt;\n</code></pre> <p>Start a Spark application, e.g. <code>spark-shell</code>, with <code>HADOOP_CONF_DIR</code> environment variable set to the directory with <code>hive-site.xml</code>.</p> <pre><code>HADOOP_CONF_DIR=conf ./bin/spark-shell\n</code></pre>","text":""},{"location":"spark-tips-and-tricks-sparkexception-task-not-serializable/","title":"Task not serializable Exception","text":"<p>== org.apache.spark.SparkException: Task not serializable</p> <p>When you run into <code>org.apache.spark.SparkException: Task not serializable</code> exception, it means that you use a reference to an instance of a non-serializable class inside a transformation. See the following example:</p> <pre><code>\u279c  spark git:(master) \u2717 ./bin/spark-shell\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 1.6.0-SNAPSHOT\n      /_/\n\nUsing Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)\nType in expressions to have them evaluated.\nType :help for more information.\n\nscala&gt; class NotSerializable(val num: Int)\ndefined class NotSerializable\n\nscala&gt; val notSerializable = new NotSerializable(10)\nnotSerializable: NotSerializable = NotSerializable@2700f556\n\nscala&gt; sc.parallelize(0 to 10).map(_ =&gt; notSerializable.num).count\norg.apache.spark.SparkException: Task not serializable\n  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)\n  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)\n  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)\n  at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)\n  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)\n  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)\n  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)\n  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)\n  at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)\n  at org.apache.spark.rdd.RDD.map(RDD.scala:317)\n  ... 48 elided\nCaused by: java.io.NotSerializableException: NotSerializable\nSerialization stack:\n    - object not serializable (class: NotSerializable, value: NotSerializable@2700f556)\n    - field (class: $iw, name: notSerializable, type: class NotSerializable)\n    - object (class $iw, $iw@10e542f3)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@729feae8)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@5fc3b20b)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@36dab184)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@5eb974)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@79c514e4)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@5aeaee3)\n    - field (class: $iw, name: $iw, type: class $iw)\n    - object (class $iw, $iw@2be9425f)\n    - field (class: $line18.$read, name: $iw, type: class $iw)\n    - object (class $line18.$read, $line18.$read@6311640d)\n    - field (class: $iw, name: $line18$read, type: class $line18.$read)\n    - object (class $iw, $iw@c9cd06e)\n    - field (class: $iw, name: $outer, type: class $iw)\n    - object (class $iw, $iw@6565691a)\n    - field (class: $anonfun$1, name: $outer, type: class $iw)\n    - object (class $anonfun$1, &lt;function1&gt;)\n  at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)\n  at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)\n  at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)\n  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)\n  ... 57 more\n</code></pre> <p>=== Further reading</p> <ul> <li>https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html[Job aborted due to stage failure: Task not serializable]</li> <li>https://issues.apache.org/jira/browse/SPARK-5307[Add utility to help with NotSerializableException debugging]</li> <li>http://stackoverflow.com/q/22592811/1305344[Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects]</li> </ul>"},{"location":"spark-tips-and-tricks/","title":"Spark Tips and Tricks","text":"<p>= Spark Tips and Tricks</p> <p>== [[SPARK_PRINT_LAUNCH_COMMAND]] Print Launch Command of Spark Scripts</p> <p><code>SPARK_PRINT_LAUNCH_COMMAND</code> environment variable controls whether the Spark launch command is printed out to the standard error output, i.e. <code>System.err</code>, or not.</p> <pre><code>Spark Command: [here comes the command]\n========================================\n</code></pre> <p>All the Spark shell scripts use <code>org.apache.spark.launcher.Main</code> class internally that checks <code>SPARK_PRINT_LAUNCH_COMMAND</code> and when set (to any value) will print out the entire command line to launch it.</p> <pre><code>$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell\nSpark Command: /Library/Java/JavaVirtualMachines/Current/Contents/Home/bin/java -cp /Users/jacek/dev/oss/spark/conf/:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar -Dscala.usejavacp=true -Xms1g -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://localhost:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell\n========================================\n</code></pre> <p>== Show Spark version in Spark shell</p> <p>In spark-shell, use <code>sc.version</code> or <code>org.apache.spark.SPARK_VERSION</code> to know the Spark version:</p> <pre><code>scala&gt; sc.version\nres0: String = 1.6.0-SNAPSHOT\n\nscala&gt; org.apache.spark.SPARK_VERSION\nres1: String = 1.6.0-SNAPSHOT\n</code></pre> <p>== Resolving local host name</p> <p>When you face networking issues when Spark can't resolve your local hostname or IP address, use the preferred <code>SPARK_LOCAL_HOSTNAME</code> environment variable as the custom host name or <code>SPARK_LOCAL_IP</code> as the custom IP that is going to be later resolved to a hostname.</p> <p>Spark checks them out before using http://docs.oracle.com/javase/8/docs/api/java/net/InetAddress.html#getLocalHost--[java.net.InetAddress.getLocalHost()] (consult https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L759[org.apache.spark.util.Utils.findLocalInetAddress()] method).</p> <p>You may see the following WARN messages in the logs when Spark finished the resolving process:</p> <pre><code>Your hostname, [hostname] resolves to a loopback address: [host-address]; using...\nSet SPARK_LOCAL_IP if you need to bind to another address\n</code></pre>"},{"location":"spark-tips-and-tricks/#starting-standalone-master-and-workers-on-windows-7","title":"Starting standalone Master and workers on Windows 7","text":"<p>Windows 7 users can use spark-class to start Spark Standalone as there are no launch scripts for the Windows platform.</p> <pre><code>./bin/spark-class org.apache.spark.deploy.master.Master -h localhost\n</code></pre> <pre><code>./bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077\n</code></pre>"},{"location":"speculative-execution-of-tasks/","title":"Speculative Execution of Tasks","text":"<p>Speculative tasks (also speculatable tasks or task strugglers) are tasks that run slower than most (FIXME the setting) of the all tasks in a job.</p> <p>Speculative execution of tasks is a health-check procedure that checks for tasks to be speculated, i.e. running slower in a stage than the median of all successfully completed tasks in a taskset (FIXME the setting). Such slow tasks will be re-submitted to another worker. It will not stop the slow tasks, but run a new copy in parallel.</p> <p>The thread starts as <code>TaskSchedulerImpl</code> starts in spark-cluster.md[clustered deployment modes] with configuration-properties.md#spark.speculation[spark.speculation] enabled. It executes periodically every configuration-properties.md#spark.speculation.interval[spark.speculation.interval] after the initial <code>spark.speculation.interval</code> passes.</p> <p>When enabled, you should see the following INFO message in the logs:</p>"},{"location":"speculative-execution-of-tasks/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"speculative-execution-of-tasks/#starting-speculative-execution-thread","title":"Starting speculative execution thread","text":"<p>It works as scheduler:TaskSchedulerImpl.md#task-scheduler-speculation[<code>task-scheduler-speculation</code> daemon thread pool] (using <code>j.u.c.ScheduledThreadPoolExecutor</code> with core pool size of 1).</p> <p>The job with speculatable tasks should finish while speculative tasks are running, and it will leave these tasks running - no KILL command yet.</p> <p>It uses <code>checkSpeculatableTasks</code> method that asks <code>rootPool</code> to check for speculatable tasks. If there are any, <code>SchedulerBackend</code> is called for scheduler:SchedulerBackend.md#reviveOffers[reviveOffers].</p> <p>CAUTION: FIXME How does Spark handle repeated results of speculative tasks since there are copies launched?</p>"},{"location":"workers/","title":"Workers","text":"<p>== Workers</p> <p>Workers (aka slaves) are running Spark instances where executors live to execute tasks. They are the compute nodes in Spark.</p> <p>CAUTION: FIXME Are workers perhaps part of Spark Standalone only?</p> <p>CAUTION: FIXME How many executors are spawned per worker?</p> <p>A worker receives serialized tasks that it runs in a thread pool.</p> <p>It hosts a local storage:BlockManager.md[Block Manager] that serves blocks to other workers in a Spark cluster. Workers communicate among themselves using their Block Manager instances.</p> <p>CAUTION: FIXME Diagram of a driver with workers as boxes.</p> <p>Explain task execution in Spark and understand Spark\u2019s underlying execution model.</p> <p>New vocabulary often faced in Spark UI</p> <p>SparkContext.md[When you create SparkContext], each worker starts an executor. This is a separate process (JVM), and it loads your jar, too. The executors connect back to your driver program. Now the driver can send them commands, like <code>flatMap</code>, <code>map</code> and <code>reduceByKey</code>. When the driver quits, the executors shut down.</p> <p>A new process is not started for each step. A new process is started on each worker when the SparkContext is constructed.</p> <p>The executor deserializes the command (this is possible because it has loaded your jar), and executes it on a partition.</p> <p>Shortly speaking, an application in Spark is executed in three steps:</p> <ol> <li>Create RDD graph, i.e. DAG (directed acyclic graph) of RDDs to represent entire computation.</li> <li>Create stage graph, i.e. a DAG of stages that is a logical execution plan based on the RDD graph. Stages are created by breaking the RDD graph at shuffle boundaries.</li> <li>Based on the plan, schedule and execute tasks on workers.</li> </ol> <p>exercises/spark-examples-wordcount-spark-shell.md[In the WordCount example], the RDD graph is as follows:</p> <p>file -&gt; lines -&gt; words -&gt; per-word count -&gt; global word count -&gt; output</p> <p>Based on this graph, two stages are created. The stage creation rule is based on the idea of pipelining as many rdd:index.md[narrow transformations] as possible. RDD operations with \"narrow\" dependencies, like <code>map()</code> and <code>filter()</code>, are pipelined together into one set of tasks in each stage.</p> <p>In the end, every stage will only have shuffle dependencies on other stages, and may compute multiple operations inside it.</p> <p>In the WordCount example, the narrow transformation finishes at per-word count. Therefore, you get two stages:</p> <ul> <li>file -&gt; lines -&gt; words -&gt; per-word count</li> <li>global word count -&gt; output</li> </ul> <p>Once stages are defined, Spark will generate scheduler:Task.md[tasks] from scheduler:Stage.md[stages]. The first stage will create scheduler:ShuffleMapTask.md[ShuffleMapTask]s with the last stage creating scheduler:ResultTask.md[ResultTask]s because in the last stage, one action operation is included to produce results.</p> <p>The number of tasks to be generated depends on how your files are distributed. Suppose that you have 3 three different files in three different nodes, the first stage will generate 3 tasks: one task per partition.</p> <p>Therefore, you should not map your steps to tasks directly. A task belongs to a stage, and is related to a partition.</p> <p>The number of tasks being generated in each stage will be equal to the number of partitions.</p> <p>=== [[Cleanup]] Cleanup</p> <p>CAUTION: FIXME</p> <p>=== [[settings]] Settings</p> <ul> <li><code>spark.worker.cleanup.enabled</code> (default: <code>false</code>) &lt;&gt; enabled."},{"location":"accumulators/","title":"Accumulators","text":"<p>Accumulators are shared variables that accumulate values from executors on the driver using associative and commutative \"add\" operation.</p> <p>The main abstraction is AccumulatorV2.</p> <p>Accumulators are registered (created) using SparkContext with or without a name. Only named accumulators are displayed in web UI.</p> <p></p> <p><code>DAGScheduler</code> is responsible for updating accumulators (from partial values from tasks running on executors every heartbeat).</p> <p>Accumulators are serializable so they can safely be referenced in the code executed in executors and then safely send over the wire for execution.</p> <pre><code>// on the driver\nval counter = sc.longAccumulator(\"counter\")\n\nsc.parallelize(1 to 9).foreach { x =&gt;\n  // on executors\n  counter.add(x) }\n\n// on the driver\nprintln(counter.value)\n</code></pre>"},{"location":"accumulators/#further-reading","title":"Further Reading","text":"<ul> <li>Performance and Scalability of Broadcast in Spark</li> </ul>"},{"location":"accumulators/AccumulableInfo/","title":"AccumulableInfo","text":"<p><code>AccumulableInfo</code> represents an update to an AccumulatorV2.</p> <p><code>AccumulableInfo</code> is used to transfer accumulator updates from executors to the driver every executor heartbeat or when a task finishes.</p>"},{"location":"accumulators/AccumulableInfo/#creating-instance","title":"Creating Instance","text":"<p><code>AccumulableInfo</code> takes the following to be created:</p> <ul> <li> Accumulator ID <li> Name <li> Partial Update <li> Partial Value <li>internal flag</li> <li> <code>countFailedValues</code> flag <li> Metadata (default: <code>None</code>) <p><code>AccumulableInfo</code> is created\u00a0when:</p> <ul> <li><code>AccumulatorV2</code> is requested to convert itself to an AccumulableInfo</li> <li><code>JsonProtocol</code> is requested to accumulableInfoFromJson</li> <li><code>SQLMetric</code> (Spark SQL) is requested to convert itself to an <code>AccumulableInfo</code></li> </ul>"},{"location":"accumulators/AccumulableInfo/#internal-flag","title":"internal Flag <pre><code>internal: Boolean\n</code></pre> <p><code>AccumulableInfo</code> is given an <code>internal</code> flag when created.</p> <p><code>internal</code> flag denotes whether this accumulator is internal.</p> <p><code>internal</code> is used when:</p> <ul> <li><code>LiveEntityHelpers</code> is requested for <code>newAccumulatorInfos</code></li> <li><code>JsonProtocol</code> is requested to accumulableInfoToJson</li> </ul>","text":""},{"location":"accumulators/AccumulatorContext/","title":"AccumulatorContext","text":"<p>== [[AccumulatorContext]] AccumulatorContext</p> <p><code>AccumulatorContext</code> is a <code>private[spark]</code> internal object used to track accumulators by Spark itself using an internal <code>originals</code> lookup table. Spark uses the <code>AccumulatorContext</code> object to register and unregister accumulators.</p> <p>The <code>originals</code> lookup table maps accumulator identifier to the accumulator itself.</p> <p>Every accumulator has its own unique accumulator id that is assigned using the internal <code>nextId</code> counter.</p> <p>=== [[register]] <code>register</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[newId]] <code>newId</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[AccumulatorContext-SQL_ACCUM_IDENTIFIER]] AccumulatorContext.SQL_ACCUM_IDENTIFIER</p> <p><code>AccumulatorContext.SQL_ACCUM_IDENTIFIER</code> is an internal identifier for Spark SQL's internal accumulators. The value is <code>sql</code> and Spark uses it to distinguish spark-sql-SparkPlan.md#SQLMetric[Spark SQL metrics] from others.</p>"},{"location":"accumulators/AccumulatorSource/","title":"AccumulatorSource","text":"<p><code>AccumulatorSource</code> is...FIXME</p>"},{"location":"accumulators/AccumulatorV2/","title":"AccumulatorV2","text":"<p><code>AccumulatorV2[IN, OUT]</code> is an abstraction of accumulators</p> <p><code>AccumulatorV2</code> is a Java Serializable.</p>"},{"location":"accumulators/AccumulatorV2/#contract","title":"Contract","text":""},{"location":"accumulators/AccumulatorV2/#adding-value","title":"Adding Value <pre><code>add(\n  v: IN): Unit\n</code></pre> <p>Accumulates (adds) the given <code>v</code> value to this accumulator</p>","text":""},{"location":"accumulators/AccumulatorV2/#copying-accumulator","title":"Copying Accumulator <pre><code>copy(): AccumulatorV2[IN, OUT]\n</code></pre>","text":""},{"location":"accumulators/AccumulatorV2/#is-zero-value","title":"Is Zero Value <pre><code>isZero: Boolean\n</code></pre>","text":""},{"location":"accumulators/AccumulatorV2/#merging-updates","title":"Merging Updates <pre><code>merge(\n  other: AccumulatorV2[IN, OUT]): Unit\n</code></pre>","text":""},{"location":"accumulators/AccumulatorV2/#resetting-accumulator","title":"Resetting Accumulator <pre><code>reset(): Unit\n</code></pre>","text":""},{"location":"accumulators/AccumulatorV2/#value","title":"Value <pre><code>value: OUT\n</code></pre> <p>The current value of this accumulator</p> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to collectAccumulatorsAndResetStatusOnFailure</li> <li><code>AccumulatorSource</code> is requested to register</li> <li><code>DAGScheduler</code> is requested to update accumulators</li> <li><code>TaskSchedulerImpl</code> is requested to executorHeartbeatReceived</li> <li><code>TaskSetManager</code> is requested to handleSuccessfulTask</li> <li><code>JsonProtocol</code> is requested to taskEndReasonFromJson</li> <li>others</li> </ul>","text":""},{"location":"accumulators/AccumulatorV2/#implementations","title":"Implementations","text":"<ul> <li>AggregatingAccumulator (Spark SQL)</li> <li>CollectionAccumulator</li> <li>DoubleAccumulator</li> <li>EventTimeStatsAccum (Spark Structured Streaming)</li> <li>LongAccumulator</li> <li>SetAccumulator (Spark SQL)</li> <li>SQLMetric (Spark SQL)</li> </ul>"},{"location":"accumulators/AccumulatorV2/#converting-this-accumulator-to-accumulableinfo","title":"Converting this Accumulator to AccumulableInfo <pre><code>toInfo(\n  update: Option[Any],\n  value: Option[Any]): AccumulableInfo\n</code></pre> <p><code>toInfo</code> determines whether the accumulator is internal based on the name (and whether it uses the internal.metrics prefix) and uses it to create an AccumulableInfo.</p> <p><code>toInfo</code>\u00a0is used when:</p> <ul> <li><code>TaskRunner</code> is requested to collectAccumulatorsAndResetStatusOnFailure</li> <li><code>DAGScheduler</code> is requested to updateAccumulators</li> <li><code>TaskSchedulerImpl</code> is requested to executorHeartbeatReceived</li> <li><code>JsonProtocol</code> is requested to taskEndReasonFromJson</li> <li><code>SQLAppStatusListener</code> (Spark SQL) is requested to handle a <code>SparkListenerTaskEnd</code> event (<code>onTaskEnd</code>)</li> </ul>","text":""},{"location":"accumulators/AccumulatorV2/#registering-accumulator","title":"Registering Accumulator <pre><code>register(\n  sc: SparkContext,\n  name: Option[String] = None,\n  countFailedValues: Boolean = false): Unit\n</code></pre> <p><code>register</code>...FIXME</p> <p><code>register</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is requested to register an accumulator</li> <li><code>TaskMetrics</code> is requested to register task accumulators</li> <li><code>CollectMetricsExec</code> (Spark SQL) is requested for an <code>AggregatingAccumulator</code></li> <li><code>SQLMetrics</code> (Spark SQL) is used to create a performance metric</li> </ul>","text":""},{"location":"accumulators/AccumulatorV2/#serializing-accumulatorv2","title":"Serializing AccumulatorV2 <pre><code>writeReplace(): Any\n</code></pre> <p><code>writeReplace</code> is part of the <code>Serializable</code> (Java) abstraction (to designate an alternative object to be used when writing an object to the stream).</p> <p><code>writeReplace</code>...FIXME</p>","text":""},{"location":"accumulators/AccumulatorV2/#deserializing-accumulatorv2","title":"Deserializing AccumulatorV2 <pre><code>readObject(\n  in: ObjectInputStream): Unit\n</code></pre> <p><code>readObject</code> is part of the <code>Serializable</code> (Java) abstraction (for special handling during deserialization).</p> <p><code>readObject</code> reads the non-static and non-transient fields of the <code>AccumulatorV2</code> from the given <code>ObjectInputStream</code>.</p> <p>If the <code>atDriverSide</code> internal flag is turned on, <code>readObject</code> turns it off (to indicate <code>readObject</code> is executed on an executor). Otherwise, <code>atDriverSide</code> internal flag is turned on.</p> <p><code>readObject</code> requests the active TaskContext to register this accumulator.</p>","text":""},{"location":"accumulators/InternalAccumulator/","title":"InternalAccumulator","text":"<p><code>InternalAccumulator</code> is an utility with field names for internal accumulators.</p>"},{"location":"accumulators/InternalAccumulator/#internalmetrics-prefix","title":"internal.metrics Prefix <p><code>internal.metrics.</code> is the prefix of metrics that are considered internal and should not be displayed in web UI.</p> <p><code>internal.metrics.</code> is used when:</p> <ul> <li><code>AccumulatorV2</code> is requested to convert itself to AccumulableInfo and writeReplace</li> <li><code>JsonProtocol</code> is requested to accumValueToJson and accumValueFromJson</li> </ul>","text":""},{"location":"barrier-execution-mode/","title":"Barrier Execution Mode","text":"<p>Barrier Execution Mode (Barrier Scheduling) introduces a strong requirement on Spark Scheduler to launch all tasks of a Barrier Stage at the same time or not at all (and consequently wait until required resources are available). Moreover, a failure of a single task of a barrier stage fails the whole stage (and so the other tasks).</p> <p>Barrier Execution Mode allows for as many tasks to be executed concurrently as ResourceProfile permits (that is enforced upon scheduling a barrier job).</p> <p>Barrier Execution Mode aims at making Distributed Deep Learning with Apache Spark easier (or even possible).</p> <p>Rephrasing dmlc/xgboost, Barrier Execution Mode makes sure that:</p> <ol> <li> <p>All tasks of a barrier stage are all launched at once. If there is not enough task slots, the exception will be produced</p> </li> <li> <p>Tasks either all succeed or fail. Upon a task failure Spark aborts all the other tasks (TaskScheduler will kill all other running tasks) and restarts the whole barrier stage</p> </li> <li> <p>Spark makes no assumption that tasks don't talk to each other. Actually, it is the opposite. Spark provides BarrierTaskContext which facilitates tasks discovery (e.g., barrier, allGather)</p> </li> <li> <p>Permits restarting a training from a known state (checkpoint) in case of a failure</p> </li> </ol> <p>From the Design doc: Barrier Execution Mode:</p> <p>In Spark, a task in a stage doesn't depend on any other task in the same stage, and hence it can be scheduled independently.</p> <p>That gives Spark a freedom to schedule tasks in as many task batches as needed. So, 5 tasks can be scheduled on 1 CPU core quite easily in 5 consecutive batches. That's unlike MPI (or non-MapReduce scheduling systems) that allows for greater flexibility and inter-task dependency.</p> <p>Later in Design doc: Barrier Execution Mode:</p> <p>In MPI, all workers start at the same time and pass messages around.</p> <p>To embed this workload in Spark, we need to introduce a new scheduling model, tentatively named \"barrier scheduling\", which launches the tasks at the same time and provides users enough information and tooling to embed distributed DL training into a Spark pipeline.</p>"},{"location":"barrier-execution-mode/#barrier-rdd","title":"Barrier RDD","text":"<p>Barrier RDD is a RDDBarrier.</p>"},{"location":"barrier-execution-mode/#barrier-stage","title":"Barrier Stage","text":"<p>Barrier Stage is a Stage with at least one Barrier RDD.</p>"},{"location":"barrier-execution-mode/#abstractions","title":"Abstractions","text":"<ul> <li>BarrierTaskContext</li> <li>RDDBarrier</li> </ul>"},{"location":"barrier-execution-mode/#barrier","title":"RDD.barrier Operator","text":"<p>Barrier Execution Mode is based on RDD.barrier operator to indicate that Spark Scheduler must launch the tasks together for the current stage (and mark the current stage as a barrier stage).</p> <pre><code>barrier(): RDDBarrier[T]\n</code></pre> <p><code>RDD.barrier</code> creates a RDDBarrier that comes with the barrier-aware mapPartitions transformation.</p> <pre><code>mapPartitions[S](\n  f: Iterator[T] =&gt; Iterator[S],\n  preservesPartitioning: Boolean = false): RDD[S]\n</code></pre> <p>Under the covers, <code>RDDBarrier.mapPartitions</code> creates a MapPartitionsRDD like the regular <code>RDD.mapPartitions</code> transformation but with isFromBarrier flag enabled.</p> <ul> <li><code>Task</code> has a isBarrier flag that says whether this task belongs to a barrier stage (default: <code>false</code>).</li> </ul>"},{"location":"barrier-execution-mode/#isFromBarrier","title":"isFromBarrier Flag","text":"<p>An RDD is in a barrier stage, if at least one of its parent RDD(s), or itself, are mapped from an <code>RDDBarrier</code>.</p> <p>ShuffledRDD has the isBarrier flag always disabled (<code>false</code>).</p> <p>MapPartitionsRDD is the only RDD that can have the isBarrier flag enabled.</p> <p>RDDBarrier.mapPartitions is the only transformation that creates a MapPartitionsRDD with the isFromBarrier flag enabled.</p>"},{"location":"barrier-execution-mode/#unsupported-spark-features","title":"Unsupported Spark Features","text":"<p>The following Spark features are not supported:</p> <ul> <li>Push-Based Shuffle</li> <li>Dynamic Allocation of Executors</li> </ul>"},{"location":"barrier-execution-mode/#demo","title":"Demo","text":"<p>Enable <code>ALL</code> logging level for org.apache.spark.BarrierTaskContext logger to see what happens inside.</p> <pre><code>val tasksNum = 3\nval nums = sc.parallelize(seq = 0 until 9, numSlices = tasksNum)\nassert(nums.getNumPartitions == tasksNum)\n</code></pre> <p>Print out the available partitions and the number of records within each (using Spark SQL for a human-friendlier output).</p> Scala <pre><code>import org.apache.spark.TaskContext\nnums\n  .mapPartitions { it =&gt; Iterator.single((TaskContext.get.partitionId, it.size)) }\n  .toDF(\"partitionId\", \"size\")\n  .show\n</code></pre> <pre><code>+-----------+----+\n|partitionId|size|\n+-----------+----+\n|          0|   3|\n|          1|   3|\n|          2|   3|\n+-----------+----+\n</code></pre>"},{"location":"barrier-execution-mode/#distributed-training","title":"Distributed Training","text":"<p>RDD.barrier creates a Barrier Stage (a RDDBarrier).</p> <pre><code>import org.apache.spark.rdd.RDDBarrier\nassert(nums.barrier.isInstanceOf[RDDBarrier[_]])\n</code></pre> <p>Use RDD.mapPartitions transformation to access a BarrierTaskContext.</p> <pre><code>val barrierRdd = nums\n  .barrier\n  .mapPartitions { ns =&gt;\n    import org.apache.spark.{BarrierTaskContext, TaskContext}\n    val ctx = TaskContext.get.asInstanceOf[BarrierTaskContext]\n    val tid = ctx.partitionId()\n    val port = 10000 + tid\n    val host = \"localhost\"\n    val message = s\"A message from task $tid, e.g. $host:$port it listens at\"\n    val allTaskMessages = ctx.allGather(message)\n\n    if (tid == 0) { // only Task 0 prints out status\n      println(\"&gt;&gt;&gt; Got host:port's from the other tasks\")\n      allTaskMessages.foreach(println)\n    }\n\n    if (tid == 0) { // only Task 0 prints out status\n      println(\"&gt;&gt;&gt; Starting a distributed training at the nodes...\")\n    }\n\n    ctx.barrier() // this is BarrierTaskContext.barrier (not RDD.barrier)\n                  // which can be confusing\n\n    if (tid == 0) { // only Task 0 prints out status\n      println(\"&gt;&gt;&gt; All tasks have finished\")\n    }\n\n    // return a model after combining (model) pieces from the nodes\n    ns\n  }\n</code></pre> <p>Run a distributed computation (using RDD.count action).</p> <pre><code>barrierRdd.count()\n</code></pre> <p>There should be INFO and TRACE messages printed out to the console (given <code>ALL</code> logging level for org.apache.spark.BarrierTaskContext logger).</p> <pre><code>[Executor task launch worker for task 1.0 in stage 5.0 (TID 13)] INFO  org.apache.spark.BarrierTaskContext:60 - Task 13 from Stage 5(Attempt 0) has entered the global sync, current barrier epoch is 0.\n...\n[Executor task launch worker for task 1.0 in stage 5.0 (TID 13)] TRACE org.apache.spark.BarrierTaskContext:68 - Current callSite: CallSite($anonfun$runBarrier$2 at Logging.scala:68,org.apache.spark.BarrierTaskContext.$anonfun$runBarrier$2(BarrierTaskContext.scala:61)\n...\n[Executor task launch worker for task 1.0 in stage 5.0 (TID 13)] INFO  org.apache.spark.BarrierTaskContext:60 - Task 13 from Stage 5(Attempt 0) finished global sync successfully, waited for 1 seconds, current barrier epoch is 1.\n...\n</code></pre> <p>Open up web UI and explore the execution plans.</p>"},{"location":"barrier-execution-mode/#access-mappartitionsrdd","title":"Access MapPartitionsRDD","text":"<p>MapPartitionsRDD is a <code>private[spark]</code> class so to access <code>RDD.isBarrier</code> method requires to be in <code>org.apache.spark</code> package.</p> <p>Paste the following code in spark-shell / Scala REPL using <code>:paste -raw</code> mode.</p> <pre><code>package org.apache.spark\n\nobject IsBarrier {\n  import org.apache.spark.rdd.RDD\n  implicit class BypassPrivateSpark[T](rdd: RDD[T]) {\n    def isBarrier = rdd.isBarrier\n  }\n}\n</code></pre> <pre><code>import org.apache.spark.IsBarrier._\nassert(barrierRdd.isBarrier)\n</code></pre>"},{"location":"barrier-execution-mode/#examples","title":"Examples","text":"<p>Something worth reviewing the source code and learn from it</p>"},{"location":"barrier-execution-mode/#synapseml","title":"SynapseML","text":"<p>SynapseML's LightGBM on Apache Spark can be configured to use Barrier Execution Mode in the following modules:</p> <ul> <li><code>synapse.ml.lightgbm.LightGBMClassifier</code></li> <li><code>synapse.ml.lightgbm.LightGBMRanker</code></li> <li><code>synapse.ml.lightgbm.LightGBMRegressor</code></li> </ul>"},{"location":"barrier-execution-mode/#xgboost4j","title":"XGBoost4J","text":"<p>XGBoost4J is the JVM package of xgboost (an optimized distributed gradient boosting library with machine learning algorithms for regression and classification under the Gradient Boosting framework).</p> <p>The heart of distributed training in xgboost4j-spark (that can run distributed xgboost on Apache Spark) is XGBoost.trainDistributed.</p> <p>There's a familiar line that creates a barrier stage (using <code>RDD.barrier()</code>):</p> <pre><code>val boostersAndMetrics = trainingRDD.barrier().mapPartitions {\n  // distributed training using XGBoost happens here\n}\n</code></pre> <p>The barrier <code>mapPartitions</code> block finishes is followed by <code>RDD.collect()</code> that gets XGBoost4J-specific metadata (<code>booster</code> and <code>metrics</code>):</p> <pre><code>val (booster, metrics) = boostersAndMetrics.collect()(0)\n</code></pre> <p>Within the barrier stage (within <code>mapPartitions</code> block), xgboost4j-spark builds a distributed booster:</p> <ol> <li>Checkpointing, when enabled, happens only by Task 0</li> <li>All tasks initialize so-called collective Communicator for synchronization</li> <li>xgboost4j-spark uses XGBoostJNI to talk to XGBoost using JNI</li> <li>Only Task 0 returns non-empty iterator (and that's why the <code>RDD.collect()(0)</code> gets <code>(booster, metrics)</code>)</li> <li>All tasks execute SXGBoost.train that eventually leads to XGBoost.trainAndSaveCheckpoint</li> </ol>"},{"location":"barrier-execution-mode/#learn-more","title":"Learn More","text":"<ol> <li>SPIP: Support Barrier Execution Mode in Apache Spark (esp. Design: Barrier execution mode)</li> <li>Barrier Execution Mode in Spark 3.0 - Part 1 : Introduction</li> </ol>"},{"location":"barrier-execution-mode/BarrierCoordinator/","title":"Barrier Coordinator RPC Endpoint","text":"<p><code>BarrierCoordinator</code> is a ThreadSafeRpcEndpoint that is registered as barrierSync RPC Endpoint when <code>TaskSchedulerImpl</code> is requested to maybeInitBarrierCoordinator.</p> <p><code>BarrierCoordinator</code> is responsible for handling RequestToSync messages to coordinate Global Syncs of barrier tasks (using allGather and barrier operators).</p> <p>In other words, the driver sets up a <code>BarrierCoordinator</code> (TaskSchedulerImpl precisely) upon startup that BarrierTaskContexts talk to using RequestToSync messages. <code>BarrierCoordinator</code> tracks the number of tasks to wait for until a barrier stage is complete and a response can be sent back to the tasks to continue (that are paused for 365 days (!)).</p>"},{"location":"barrier-execution-mode/BarrierCoordinator/#creating-instance","title":"Creating Instance","text":"<p><code>BarrierCoordinator</code> takes the following to be created:</p> <ul> <li> Timeout (seconds) <li> LiveListenerBus <li> RpcEnv <p><code>BarrierCoordinator</code> is created when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to maybeInitBarrierCoordinator</li> </ul>"},{"location":"barrier-execution-mode/BarrierCoordinator/#receiveAndReply","title":"Processing RequestToSync Messages (from Barrier Tasks)","text":"RpcEndpoint <pre><code>receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n</code></pre> <p><code>receiveAndReply</code> is part of the RpcEndpoint abstraction.</p> <p><code>receiveAndReply</code> handles RequestToSync messages.</p> <p>Unless already registered, <code>receiveAndReply</code> registers a new <code>ContextBarrierId</code> (for the stageId and the stageAttemptId) in the Barrier States registry.</p> <p>Multiple Tasks and One BarrierCoordinator</p> <p><code>receiveAndReply</code> handles RequestToSync messages, one per task in a barrier stage. Out of all the properties of <code>RequestToSync</code>, numTasks, stageId and stageAttemptId are used.</p> <p>The very first <code>RequestToSync</code> is used to register the stageId and stageAttemptId (as <code>ContextBarrierId</code>) with numTasks.</p> <p><code>receiveAndReply</code> finds the ContextBarrierState for the stage and the stage attempt (in the Barrier States registry) to handle the RequestToSync.</p>"},{"location":"barrier-execution-mode/BarrierCoordinator/#states","title":"Barrier States","text":"<pre><code>states: ConcurrentHashMap[ContextBarrierId, ContextBarrierState]\n</code></pre> <p><code>BarrierCoordinator</code> creates an empty <code>ConcurrentHashMap</code> (Java) when created.</p> <p><code>states</code> registry is used to keep track of all the active barrier stage attempts and the corresponding internal ContextBarrierState.</p> <p><code>states</code> is used when:</p> <ul> <li>onStop to clean up</li> <li>cleanupBarrierStage to remove a specific stage attempt</li> <li>receiveAndReply to handle RequestToSync messages</li> </ul>"},{"location":"barrier-execution-mode/BarrierCoordinator/#listener","title":"SparkListener","text":"<p><code>BarrierCoordinator</code> creates a SparkListener when created.</p> <p>The <code>SparkListener</code> is used to intercept SparkListenerStageCompleted events.</p> <p>The <code>SparkListener</code> is addToStatusQueue upon startup and removed at stop.</p>"},{"location":"barrier-execution-mode/BarrierCoordinator/#onStageCompleted","title":"onStageCompleted","text":"SparkListener <pre><code>onStageCompleted(\n  stageCompleted: SparkListenerStageCompleted): Unit\n</code></pre> <p><code>onStageCompleted</code> is part of the SparkListenerInterface abstraction.</p> <p><code>onStageCompleted</code> cleanupBarrierStage for the stage and the attempt number (based on the given <code>SparkListenerStageCompleted</code>).</p>"},{"location":"barrier-execution-mode/BarrierCoordinator/#logging","title":"Logging","text":"<p>Enable <code>ALL</code> logging level for <code>org.apache.spark.BarrierCoordinator</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.BarrierCoordinator.name = org.apache.spark.BarrierCoordinator\nlogger.BarrierCoordinator.level = all\n</code></pre> <p>Refer to Logging.</p>"},{"location":"barrier-execution-mode/BarrierCoordinatorMessage/","title":"BarrierCoordinatorMessage RPC Messages","text":"<p><code>BarrierCoordinatorMessage</code> is an abstraction of RPC messages that tasks can send out using BarrierTaskContext operators for BarrierCoordinator to handle.</p> <p><code>BarrierCoordinatorMessage</code> is a <code>Serializable</code> (Java) (so it can be sent from executors to the driver over the wire).</p>"},{"location":"barrier-execution-mode/BarrierCoordinatorMessage/#implementations","title":"Implementations","text":"Sealed Trait <p><code>BarrierCoordinatorMessage</code> is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).</p> <p>Learn more in the Scala Language Specification.</p> <ul> <li>RequestToSync</li> </ul>"},{"location":"barrier-execution-mode/BarrierJobAllocationFailed/","title":"BarrierJobAllocationFailed","text":"<p><code>BarrierJobAllocationFailed</code> is...FIXME</p>"},{"location":"barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/","title":"BarrierJobSlotsNumberCheckFailed","text":""},{"location":"barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/#barrierjobslotsnumbercheckfailed","title":"BarrierJobSlotsNumberCheckFailed","text":"<p><code>BarrierJobSlotsNumberCheckFailed</code> is a BarrierJobAllocationFailed with the following exception message:</p> <pre><code>[SPARK-24819]: Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently.\nPlease init a new cluster with more resources(e.g. CPU, GPU) or repartition the input RDD(s) to reduce the number of slots required to run this barrier stage.\n</code></pre> <p><code>BarrierJobSlotsNumberCheckFailed</code> can be thrown when <code>DAGScheduler</code> is requested to handle a JobSubmitted event.</p>"},{"location":"barrier-execution-mode/BarrierJobSlotsNumberCheckFailed/#creating-instance","title":"Creating Instance","text":"<p><code>BarrierJobSlotsNumberCheckFailed</code> takes the following to be created:</p> <ul> <li> Required Concurrent Tasks (based on the number of partitions of a barrier RDD) <li> Maximum Number of Concurrent Tasks (based on a ResourceProfile used) <p><code>BarrierJobSlotsNumberCheckFailed</code> is created when:</p> <ul> <li><code>SparkCoreErrors</code> is requested to numPartitionsGreaterThanMaxNumConcurrentTasksError</li> </ul>"},{"location":"barrier-execution-mode/BarrierTaskContext/","title":"BarrierTaskContext \u2014 TaskContext for Barrier Tasks","text":"<p><code>BarrierTaskContext</code> is a concrete TaskContext of the tasks in a Barrier Stage in Barrier Execution Mode.</p>"},{"location":"barrier-execution-mode/BarrierTaskContext/#creating-instance","title":"Creating Instance","text":"<p><code>BarrierTaskContext</code> takes the following to be created:</p> <ul> <li> TaskContext <p><code>BarrierTaskContext</code> is created when:</p> <ul> <li><code>Task</code> is requested to run (with isBarrier flag enabled)</li> </ul>"},{"location":"barrier-execution-mode/BarrierTaskContext/#barrierCoordinator","title":"Barrier Coordinator RPC Endpoint","text":"<pre><code>barrierCoordinator: RpcEndpointRef\n</code></pre> <p><code>BarrierTaskContext</code> creates a RpcEndpointRef to Barrier Coordinator RPC Endpoint when created.</p> <p><code>barrierCoordinator</code> is used to handle barrier and allGather operators (through runBarrier).</p>"},{"location":"barrier-execution-mode/BarrierTaskContext/#allGather","title":"allGather","text":"<pre><code>allGather(\n  message: String): Array[String]\n</code></pre> <p><code>allGather</code> runBarrier with the given <code>message</code> and <code>ALL_GATHER</code> request method.</p> Public API and PySpark <p><code>allGather</code> is part of a public API.</p> <p><code>allGather</code> is used in <code>BasePythonRunner.WriterThread</code> (PySpark) when requested to <code>barrierAndServe</code>.</p>"},{"location":"barrier-execution-mode/BarrierTaskContext/#barrier","title":"barrier","text":"<pre><code>barrier(): Unit\n</code></pre> <p><code>barrier</code> runBarrier with no message and <code>BARRIER</code> request method.</p> Public API and PySpark <p><code>barrier</code> is part of a public API.</p> <p><code>barrier</code> is used in <code>BasePythonRunner.WriterThread</code> (PySpark) when requested to <code>barrierAndServe</code>.</p>"},{"location":"barrier-execution-mode/BarrierTaskContext/#runBarrier","title":"Global Sync","text":"<pre><code>runBarrier(\n  message: String,\n  requestMethod: RequestMethod.Value): Array[String]\n</code></pre> <p><code>runBarrier</code> prints out the following INFO message to the logs:</p> <pre><code>Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) has entered the global sync, current barrier epoch is [barrierEpoch].\n</code></pre> <p><code>runBarrier</code> prints out the following TRACE message to the logs:</p> <pre><code>Current callSite: [callSite]\n</code></pre> <p><code>runBarrier</code> schedules a <code>TimerTask</code> (Java) to print out the following INFO message to the logs every minute:</p> <pre><code>Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) waiting under the global sync since [startTime],\nhas been waiting for [duration] seconds,\ncurrent barrier epoch is [barrierEpoch].\n</code></pre> <p><code>runBarrier</code> requests the Barrier Coordinator RPC Endpoint to send a RequestToSync one-off message and waits 365 days (!) for a response (a collection of responses from all the barrier tasks).</p> <p>1 Year to Wait for Response from Barrier Coordinator</p> <p><code>runBarrier</code> uses 1 year to wait until the response arrives.</p> <p><code>runBarrier</code> checks every second if the response \"bundle\" arrived.</p> <p><code>runBarrier</code> increments the barrierEpoch.</p> <p><code>runBarrier</code> prints out the following INFO message to the logs:</p> <pre><code>Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) finished global sync successfully,\nwaited for [duration] seconds,\ncurrent barrier epoch is [barrierEpoch].\n</code></pre> <p>In the end, <code>runBarrier</code> returns the response \"bundle\" (a collection of responses from all the barrier tasks).</p> <p>In case of a <code>SparkException</code>, <code>runBarrier</code> prints out the following INFO message to the logs and reports (re-throws) the exception up (the call chain):</p> <pre><code>Task [taskAttemptId] from Stage [stageId](Attempt [stageAttemptNumber]) failed to perform global sync,\nwaited for [duration] seconds,\ncurrent barrier epoch is [barrierEpoch].\n</code></pre> <p><code>runBarrier</code> is used when:</p> <ul> <li><code>BarrierTaskContext</code> is requested to barrier, allGather</li> </ul>"},{"location":"barrier-execution-mode/BarrierTaskContext/#logging","title":"Logging","text":"<p>Enable <code>ALL</code> logging level for <code>org.apache.spark.BarrierTaskContext</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.BarrierTaskContext.name = org.apache.spark.BarrierTaskContext\nlogger.BarrierTaskContext.level = all\n</code></pre> <p>Refer to Logging.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/","title":"ContextBarrierState","text":"<p><code>ContextBarrierState</code> represents the state of global sync of a barrier stage (with the number of tasks).</p> <p><code>ContextBarrierState</code> is used by BarrierCoordinator to handle RequestToSync messages (and to keep track of active barrier stage attempts).</p> ContextBarrierState <p><code>ContextBarrierState</code> is a <code>private class</code> of BarrierCoordinator.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/#creating-instance","title":"Creating Instance","text":"<p><code>ContextBarrierState</code> takes the following to be created:</p> <ul> <li>ContextBarrierId</li> <li> Number of Tasks (of a barrier stage) <p><code>ContextBarrierState</code> is created when:</p> <ul> <li><code>BarrierCoordinator</code> is requested to handle a RequestToSync message for a new stage and stage attempt IDs</li> </ul>"},{"location":"barrier-execution-mode/ContextBarrierState/#barrierId","title":"Barrier Stage Attempt (ContextBarrierId)","text":"<p><code>ContextBarrierState</code> is given a <code>ContextBarrierId</code> (of a barrier stage) when created.</p> <p>The <code>ContextBarrierId</code> uniquely identifies a barrier stage by the stage and stage attempt IDs.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/#barrierEpoch","title":"Barrier Epoch","text":"<p><code>ContextBarrierState</code> initializes <code>barrierEpoch</code> counter to be <code>0</code> when created.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/#requesters","title":"Barrier Tasks","text":"<pre><code>requesters: ArrayBuffer[RpcCallContext]\n</code></pre> <p><code>requesters</code> is a registry of <code>RpcCallContext</code>s of the barrier tasks (of a barrier stage attempt) pending a reply.</p> <p>It is only when the number of <code>RpcCallContext</code>s in the <code>requesters</code> reaches the number of tasks expected (while handling RequestToSync requests) that this <code>ContextBarrierState</code> is considered finished successfully.</p> <p><code>ContextBarrierState</code> initializes <code>requesters</code> when created to be of number of tasks size.</p> <p>A new <code>RpcCallContext</code> of a barrier task is added in handleRequest only when the epoch of the barrier task matches the current barrierEpoch.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/#timerTask","title":"TimerTask","text":"<pre><code>timerTask: TimerTask\n</code></pre> <p><code>ContextBarrierState</code> uses a <code>TimerTask</code> (Java) to ensure that a <code>barrier()</code> call can time out.</p> <p><code>ContextBarrierState</code> creates a <code>TimerTask</code> (Java) when requested to initTimerTask when requested to handle a RequestToSync message for the first global sync message received (when the requesters is empty). The <code>TimerTask</code> is then immediately scheduled to be executed after spark.barrier.sync.timeout.</p> <p>spark.barrier.sync.timeout</p> <p>Since spark.barrier.sync.timeout defaults to <code>365d</code> (1 year), the <code>TimerTask</code> will run only after one year.</p> <p>The <code>TimerTask</code> is stopped in cancelTimerTask.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/#initTimerTask","title":"Initializing TimerTask","text":"<pre><code>initTimerTask(\n  state: ContextBarrierState): Unit\n</code></pre> <p><code>initTimerTask</code> creates a new <code>TimerTask</code> (Java) that, when executed, sends a <code>SparkException</code> to all the requesters with the following message followed by cleanupBarrierStage for this ContextBarrierId.</p> <pre><code>The coordinator didn't get all barrier sync requests\nfor barrier epoch [barrierEpoch] from [barrierId] within [timeoutInSecs] second(s).\n</code></pre> <p>The <code>TimerTask</code> is made available as timerTask.</p> <p><code>initTimerTask</code> is used when:</p> <ul> <li><code>ContextBarrierState</code> is requested to handle a RequestToSync message (for the first global sync message received when the requesters is empty)</li> </ul>"},{"location":"barrier-execution-mode/ContextBarrierState/#messages","title":"messages","text":"<p><code>ContextBarrierState</code> initializes <code>messages</code> registry of messages from all numTasks barrier tasks (of a barrier stage attempt) when created.</p> <p><code>messages</code> registry is empty.</p> <p>A new message is registered (added) when handling a RequestToSync request.</p>"},{"location":"barrier-execution-mode/ContextBarrierState/#handleRequest","title":"Handling RequestToSync Message","text":"<pre><code>handleRequest(\n  requester: RpcCallContext,\n  request: RequestToSync): Unit\n</code></pre> <p><code>handleRequest</code> makes sure that the RequestMethod (of the given RequestToSync) is consistent across barrier tasks (using requestMethods registry).</p> <p><code>handleRequest</code> asserts that the number of tasks is this numTasks, and so consistent across barrier tasks. Otherwise, <code>handleRequest</code> reports <code>IllegalArgumentException</code>:</p> <pre><code>Number of tasks of [barrierId] is [numTasks] from Task [taskId], previously it was [numTasks].\n</code></pre> <p><code>handleRequest</code> prints out the following INFO message to the logs (with the ContextBarrierId and barrierEpoch):</p> <pre><code>Current barrier epoch for [barrierId] is [barrierEpoch].\n</code></pre> <p>For the first sync message received (requesters is empty), <code>handleRequest</code> initializes the TimerTask and schedules it for execution after the timeoutInSecs.</p> <p>Timeout</p> <p>Starting the timerTask ensures that a sync may eventually time out (after a configured delay).</p> <p><code>handleRequest</code> registers the given <code>requester</code> in the requesters.</p> <p><code>handleRequest</code> registers the message of the RequestToSync in the messages for the partitionId.</p> <p><code>handleRequest</code> prints out the following INFO message to the logs:</p> <pre><code>Barrier sync epoch [barrierEpoch] from [barrierId] received update from Task taskId,\ncurrent progress: [requesters]/[numTasks].\n</code></pre>"},{"location":"barrier-execution-mode/ContextBarrierState/#updates-from-all-barrier-tasks-received","title":"Updates from All Barrier Tasks Received","text":"<p>When the barrier sync received updates from all barrier tasks (i.e., the number of requesters is the numTasks), <code>handleRequest</code> replies back to all the requesters with the messages.</p> <p><code>handleRequest</code> prints out the following INFO message to the logs:</p> <pre><code>Barrier sync epoch [barrierEpoch] from [barrierId] received all updates from tasks,\nfinished successfully.\n</code></pre> <p><code>handleRequest</code> increments the barrierEpoch, clears the requesters and the requestMethods, and then cancelTimerTask.</p> <p>In case of the epoch of the given RequestToSync being different from this barrierEpoch, <code>handleRequest</code> sends back a failure message (with a <code>SparkException</code>) to the given <code>requester</code>:</p> <pre><code>The request to sync of [barrierId] with barrier epoch [barrierEpoch] has already finished.\nMaybe task [taskId] is not properly killed.\n</code></pre> <p>In case of different RequestMethods (in requestMethods registry), <code>handleRequest</code> sends back a failure message to the requesters (incl. the given <code>requester</code>):</p> <pre><code>Different barrier sync types found for the sync [barrierId]: [requestMethods].\nPlease use the same barrier sync type within a single sync.\n</code></pre> <p><code>handleRequest</code> clear.</p> <p><code>handleRequest</code> is used when:</p> <ul> <li><code>BarrierCoordinator</code> is requested to handle a RequestToSync message</li> </ul>"},{"location":"barrier-execution-mode/ContextBarrierState/#logging","title":"Logging","text":"<p><code>ContextBarrierState</code> is a private class of BarrierCoordinator and logging is configured using the logger of BarrierCoordinator.</p>"},{"location":"barrier-execution-mode/RDDBarrier/","title":"RDDBarrier","text":"<p><code>RDDBarrier</code> is a wrapper around RDD with two custom map transformations:</p> <ul> <li>mapPartitions</li> <li>mapPartitionsWithIndex</li> </ul> <p>Unlike regular RDD.mapPartitions transformations, <code>RDDBarrier</code> transformations create a MapPartitionsRDD with isFromBarrier flag enabled.</p> <p><code>RDDBarrier</code> (of <code>T</code> records) marks the current stage as a barrier stage in Barrier Execution Mode.</p>"},{"location":"barrier-execution-mode/RDDBarrier/#creating-instance","title":"Creating Instance","text":"<p><code>RDDBarrier</code> takes the following to be created:</p> <ul> <li> RDD (of <code>T</code> records) <p><code>RDDBarrier</code> is created when:</p> <ul> <li>RDD.barrier transformation is used</li> </ul>"},{"location":"barrier-execution-mode/RequestMethod/","title":"RequestMethod","text":"<p><code>RequestMethod</code> represents the allowed request methods of RequestToSyncs (that are sent out from barrier tasks using BarrierTaskContext).</p> <p>ContextBarrierState tracks <code>RequestMethod</code>s (from tasks inside a barrier sync) to make sure that the tasks are all part of a legitimate barrier sync. All tasks should make sure that they're calling the same method within the same barrier sync phase.</p>"},{"location":"barrier-execution-mode/RequestMethod/#BARRIER","title":"BARRIER","text":"<p>Marks execution of BarrierTaskContext.barrier</p>"},{"location":"barrier-execution-mode/RequestMethod/#ALL_GATHER","title":"ALL_GATHER","text":"<p>Marks execution of BarrierTaskContext.allGather</p>"},{"location":"barrier-execution-mode/RequestToSync/","title":"RequestToSync RPC Message","text":"<p><code>RequestToSync</code> is a BarrierCoordinatorMessage to start Global Sync phase.</p> <p><code>RequestToSync</code> is sent out from BarrierTaskContext (i.e., barrier tasks on executors) to a BarrierCoordinator (on the driver) to handle.</p> Operation Message Request Message allGather User-defined message ALL_GATHER barrier empty BARRIER"},{"location":"barrier-execution-mode/RequestToSync/#creating-instance","title":"Creating Instance","text":"<p><code>RequestToSync</code> takes the following to be created:</p> <ul> <li> Number of tasks (partitions) <li> Stage ID <li> Stage Attempt ID <li> Task Attempt ID <li> BarrierEpoch <li> Partition ID <li> Message <li> RequestMethod <p><code>RequestToSync</code> is created when:</p> <ul> <li><code>BarrierTaskContext</code> is requested for a Global Sync</li> </ul>"},{"location":"broadcast-variables/","title":"Broadcast Variables","text":"<p>From the official documentation about Broadcast Variables:</p> <p>Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.</p> <p>And later in the document:</p> <p>Explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.</p> <p></p> <p>Spark uses <code>SparkContext</code> to create broadcast variables and BroadcastManager with ContextCleaner to manage their lifecycle.</p> <p></p> <p>Not only can Spark developers use broadcast variables for efficient data distribution, but Spark itself uses them quite often too. A very notable use case is when Spark distributes tasks (to executors) for execution.</p> <p>The idea is to transfer values used in transformations from a driver to executors in a most effective way so they are copied once and used many times by tasks (rather than being copied every time a task is launched).</p>"},{"location":"broadcast-variables/#lifecycle-of-broadcast-variable","title":"Lifecycle of Broadcast Variable <p>Broadcast variables (TorrentBroadcasts, actually) are created using SparkContext.broadcast method.</p> <pre><code>scala&gt; val b = sc.broadcast(1)\nb: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(0)\n</code></pre>  <p>Tip</p> <p>Enable <code>DEBUG</code> logging level for org.apache.spark.storage.BlockManager logger to debug <code>broadcast</code> method.</p>  <p>With DEBUG logging level enabled, there should be the following messages printed out to the logs:</p> <pre><code>Put block broadcast_0 locally took  430 ms\nPutting block broadcast_0 without replication took  431 ms\nTold master about block broadcast_0_piece0\nPut block broadcast_0_piece0 locally took  4 ms\nPutting block broadcast_0_piece0 without replication took  4 ms\n</code></pre> <p>A broadcast variable is stored on the driver's BlockManager as a single value and separately as chunks (of spark.broadcast.blockSize).</p> <p></p> <p>When requested for the broadcast value, <code>TorrentBroadcast</code> reads the broadcast block from the local BroadcastManager and, if fails, from the local BlockManager. Only when the local lookups fail, <code>TorrentBroadcast</code> reads the broadcast block chunks (from the <code>BlockMannager</code>s on the other executors), persists them as a single broadcast variable (in the local <code>BlockManager</code>) and caches in <code>BroadcastManager</code>.</p> <pre><code>scala&gt; b.value\nres0: Int = 1\n</code></pre> <p>Broadcast.value is the only way to access the value of a broadcast variable in a Spark transformation. You can only access the broadcast value any time until the broadcast variable is destroyed.</p> <p>With DEBUG logging level enabled, there should be the following messages printed out to the logs:</p> <pre><code>Getting local block broadcast_0\nLevel for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)\n</code></pre> <p>In the end, broadcast variables should be destroyed to release memory.</p> <pre><code>b.destroy\n</code></pre> <p>With DEBUG logging level enabled, there should be the following messages printed out to the logs:</p> <pre><code>Removing broadcast 0\nRemoving block broadcast_0_piece0\nTold master about block broadcast_0_piece0\nRemoving block broadcast_0\n</code></pre> <p>Broadcast variables can optionally be unpersisted.</p> <pre><code>b.unpersist\n</code></pre>","text":""},{"location":"broadcast-variables/#introduction","title":"Introduction <p>You use broadcast variable to implement map-side join, i.e. a join using a <code>map</code>. For this, lookup tables are distributed across nodes in a cluster using <code>broadcast</code> and then looked up inside <code>map</code> (to do the join implicitly).</p> <p>When you broadcast a value, it is copied to executors only once (while it is copied multiple times for tasks otherwise). It means that broadcast can help to get your Spark application faster if you have a large value to use in tasks or there are more tasks than executors.</p> <p>It appears that a Spark idiom emerges that uses <code>broadcast</code> with <code>collectAsMap</code> to create a <code>Map</code> for broadcast. When an RDD is <code>map</code> over to a smaller dataset (column-wise not record-wise), <code>collectAsMap</code>, and <code>broadcast</code>, using the very big RDD to map its elements to the broadcast RDDs is computationally faster.</p> <pre><code>val acMap = sc.broadcast(myRDD.map { case (a,b,c,b) =&gt; (a, c) }.collectAsMap)\nval otherMap = sc.broadcast(myOtherRDD.collectAsMap)\n\nmyBigRDD.map { case (a, b, c, d) =&gt;\n  (acMap.value.get(a).get, otherMap.value.get(c).get)\n}.collect\n</code></pre> <p>Use large broadcasted <code>HashMap</code>s over <code>RDD</code>s whenever possible and leave <code>RDD</code>s with a key to lookup necessary data as demonstrated above.</p>","text":""},{"location":"broadcast-variables/#demo","title":"Demo <p>You're going to use a static mapping of interesting projects with their websites, i.e. <code>Map[String, String]</code> that the tasks, i.e. closures (anonymous functions) in transformations, use.</p> <pre><code>val pws = Map(\n  \"Apache Spark\" -&gt; \"http://spark.apache.org/\",\n  \"Scala\" -&gt; \"http://www.scala-lang.org/\")\n\nval websites = sc.parallelize(Seq(\"Apache Spark\", \"Scala\")).map(pws).collect\n// websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)\n</code></pre> <p>It works, but is very ineffective as the <code>pws</code> map is sent over the wire to executors while it could have been there already. If there were more tasks that need the <code>pws</code> map, you could improve their performance by minimizing the number of bytes that are going to be sent over the network for task execution.</p> <p>Enter broadcast variables.</p> <pre><code>val pwsB = sc.broadcast(pws)\nval websites = sc.parallelize(Seq(\"Apache Spark\", \"Scala\")).map(pwsB.value).collect\n// websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)\n</code></pre> <p>Semantically, the two computations - with and without the broadcast value - are exactly the same, but the broadcast-based one wins performance-wise when there are more executors spawned to execute many tasks that use <code>pws</code> map.</p>","text":""},{"location":"broadcast-variables/#further-reading-or-watching","title":"Further Reading or Watching <ul> <li>Map-Side Join in Spark</li> </ul>","text":""},{"location":"broadcast-variables/Broadcast/","title":"Broadcast","text":"<p><code>Broadcast[T]</code> is an abstraction of broadcast variables (with the value of type <code>T</code>).</p>"},{"location":"broadcast-variables/Broadcast/#contract","title":"Contract","text":""},{"location":"broadcast-variables/Broadcast/#destroying-variable","title":"Destroying Variable <pre><code>doDestroy(\n  blocking: Boolean): Unit\n</code></pre> <p>Destroys all the data and metadata related to this broadcast variable</p> <p>Used when:</p> <ul> <li><code>Broadcast</code> is requested to destroy</li> </ul>","text":""},{"location":"broadcast-variables/Broadcast/#unpersisting-variable","title":"Unpersisting Variable <pre><code>doUnpersist(\n  blocking: Boolean): Unit\n</code></pre> <p>Deletes the cached copies of this broadcast value on executors</p> <p>Used when:</p> <ul> <li><code>Broadcast</code> is requested to unpersist</li> </ul>","text":""},{"location":"broadcast-variables/Broadcast/#broadcast-value","title":"Broadcast Value <pre><code>getValue(): T\n</code></pre> <p>Gets the broadcast value</p> <p>Used when:</p> <ul> <li><code>Broadcast</code> is requested for the value</li> </ul>","text":""},{"location":"broadcast-variables/Broadcast/#implementations","title":"Implementations","text":"<ul> <li>TorrentBroadcast</li> </ul>"},{"location":"broadcast-variables/Broadcast/#creating-instance","title":"Creating Instance","text":"<p><code>Broadcast</code> takes the following to be created:</p> <ul> <li> Unique Identifier Abstract Class <p><code>Broadcast</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete Broadcasts.</p>"},{"location":"broadcast-variables/Broadcast/#serializable","title":"Serializable <p><code>Broadcast</code> is a <code>Serializable</code> (Java) so it can be serialized (converted to bytes) and send over the wire from the driver to executors.</p>","text":""},{"location":"broadcast-variables/Broadcast/#destroying","title":"Destroying <pre><code>destroy(): Unit // (1)\ndestroy(\n  blocking: Boolean): Unit\n</code></pre> <ol> <li>Non-blocking destroy (<code>blocking</code> is <code>false</code>)</li> </ol> <p><code>destroy</code> removes persisted data and metadata associated with this broadcast variable.</p>  <p>Note</p> <p>Once a broadcast variable has been destroyed, it cannot be used again.</p>","text":""},{"location":"broadcast-variables/Broadcast/#unpersisting","title":"Unpersisting <pre><code>unpersist(): Unit // (1)\nunpersist(\n  blocking: Boolean): Unit\n</code></pre> <ol> <li>Non-blocking unpersist (<code>blocking</code> is <code>false</code>)</li> </ol> <p><code>unpersist</code>...FIXME</p>","text":""},{"location":"broadcast-variables/Broadcast/#brodcast-value","title":"Brodcast Value <pre><code>value: T\n</code></pre> <p><code>value</code> makes sure that it was not destroyed and gets the value.</p>","text":""},{"location":"broadcast-variables/Broadcast/#text-representation","title":"Text Representation <pre><code>toString: String\n</code></pre> <p><code>toString</code> uses the id as follows:</p> <pre><code>Broadcast([id])\n</code></pre>","text":""},{"location":"broadcast-variables/Broadcast/#validation","title":"Validation <p><code>Broadcast</code> is considered valid until destroyed.</p> <p><code>Broadcast</code> throws a <code>SparkException</code> (with the text representation) when destroyed but requested for the value, to unpersist or destroy:</p> <pre><code>Attempted to use [toString] after it was destroyed ([destroySite])\n</code></pre>","text":""},{"location":"broadcast-variables/BroadcastFactory/","title":"BroadcastFactory","text":"<p><code>BroadcastFactory</code> is an abstraction of broadcast variable factories that BroadcastManager uses to create or delete (unbroadcast) broadcast variables.</p>"},{"location":"broadcast-variables/BroadcastFactory/#contract","title":"Contract","text":""},{"location":"broadcast-variables/BroadcastFactory/#initialize","title":"Initializing","text":"<pre><code>initialize(\n  isDriver: Boolean,\n  conf: SparkConf): Unit\n</code></pre> Procedure <p><code>initialize</code> is a procedure (returns <code>Unit</code>) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).</p> <p>See:</p> <ul> <li>TorrentBroadcastFactory</li> </ul> <p>Used when:</p> <ul> <li><code>BroadcastManager</code> is requested to initialize</li> </ul>"},{"location":"broadcast-variables/BroadcastFactory/#newBroadcast","title":"Creating Broadcast Variable","text":"<pre><code>newBroadcast[T: ClassTag](\n  value: T,\n  isLocal: Boolean,\n  id: Long,\n  serializedOnly: Boolean = false): Broadcast[T]\n</code></pre> <p>See:</p> <ul> <li>TorrentBroadcastFactory</li> </ul> <p>Used when:</p> <ul> <li><code>BroadcastManager</code> is requested for a new broadcast variable</li> </ul>"},{"location":"broadcast-variables/BroadcastFactory/#stop","title":"Stopping","text":"<pre><code>stop(): Unit\n</code></pre> Procedure <p><code>stop</code> is a procedure (returns <code>Unit</code>) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).</p> <p>See:</p> <ul> <li>TorrentBroadcastFactory</li> </ul> <p>Used when:</p> <ul> <li><code>BroadcastManager</code> is requested to stop</li> </ul>"},{"location":"broadcast-variables/BroadcastFactory/#unbroadcast","title":"Deleting Broadcast Variable","text":"<pre><code>unbroadcast(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n</code></pre> Procedure <p><code>unbroadcast</code> is a procedure (returns <code>Unit</code>) so what happens inside stays inside (paraphrasing the former advertising slogan of Las Vegas, Nevada).</p> <p>See:</p> <ul> <li>TorrentBroadcastFactory</li> </ul> <p>Used when:</p> <ul> <li><code>BroadcastManager</code> is requested to delete a broadcast variable (unbroadcast)</li> </ul>"},{"location":"broadcast-variables/BroadcastFactory/#implementations","title":"Implementations","text":"<ul> <li>TorrentBroadcastFactory</li> </ul>"},{"location":"broadcast-variables/BroadcastManager/","title":"BroadcastManager","text":"<p><code>BroadcastManager</code> manages a TorrentBroadcastFactory.</p> <p></p> <p>Note</p> <p>As of Spark 2.0, it is no longer possible to plug a custom BroadcastFactory in, and TorrentBroadcastFactory is the only known implementation.</p>"},{"location":"broadcast-variables/BroadcastManager/#creating-instance","title":"Creating Instance","text":"<p><code>BroadcastManager</code> takes the following to be created:</p> <ul> <li> <code>isDriver</code> flag <li> SparkConf <li> <code>SecurityManager</code> <p>While being created, <code>BroadcastManager</code> is requested to initialize.</p> <p><code>BroadcastManager</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a base SparkEnv (for the driver and executors)</li> </ul>"},{"location":"broadcast-variables/BroadcastManager/#initializing","title":"Initializing <pre><code>initialize(): Unit\n</code></pre> <p>Unless initialized already, <code>initialize</code> creates a TorrentBroadcastFactory and requests it to initialize itself.</p>","text":""},{"location":"broadcast-variables/BroadcastManager/#torrentbroadcastfactory","title":"TorrentBroadcastFactory <p><code>BroadcastManager</code> manages a BroadcastFactory:</p> <ul> <li> <p>Creates and initializes it when created (and requested to initialize)</p> </li> <li> <p>Stops it when stopped</p> </li> </ul> <p><code>BroadcastManager</code> uses the <code>BroadcastFactory</code> when requested for the following:</p> <ul> <li>Creating a new broadcast variable</li> <li>Deleting a broadcast variable</li> </ul>","text":""},{"location":"broadcast-variables/BroadcastManager/#creating-broadcast-variable","title":"Creating Broadcast Variable <pre><code>newBroadcast(\n  value_ : T,\n  isLocal: Boolean): Broadcast[T]\n</code></pre> <p><code>newBroadcast</code> requests the BroadcastFactory for a new broadcast variable (with the next available broadcast ID).</p> <p><code>newBroadcast</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is requested for a new broadcast variable</li> <li><code>MapOutputTracker</code> utility is used to serializeMapStatuses</li> </ul>","text":""},{"location":"broadcast-variables/BroadcastManager/#unique-identifiers-of-broadcast-variables","title":"Unique Identifiers of Broadcast Variables <p><code>BroadcastManager</code> tracks broadcast variables and assigns unique and continuous identifiers.</p>","text":""},{"location":"broadcast-variables/BroadcastManager/#mapoutputtrackermaster","title":"MapOutputTrackerMaster <p><code>BroadcastManager</code> is used to create a MapOutputTrackerMaster</p>","text":""},{"location":"broadcast-variables/BroadcastManager/#deleting-broadcast-variable","title":"Deleting Broadcast Variable <pre><code>unbroadcast(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n</code></pre> <p><code>unbroadcast</code> requests the BroadcastFactory to delete a broadcast variable (by <code>id</code>).</p> <p><code>unbroadcast</code>\u00a0is used when:</p> <ul> <li><code>ContextCleaner</code> is requested to clean up a broadcast variable</li> </ul>","text":""},{"location":"broadcast-variables/TorrentBroadcast/","title":"TorrentBroadcast","text":"<p><code>TorrentBroadcast</code> is a Broadcast that uses a BitTorrent-like protocol for broadcast blocks distribution.</p> <p></p>"},{"location":"broadcast-variables/TorrentBroadcast/#creating-instance","title":"Creating Instance","text":"<p><code>TorrentBroadcast</code> takes the following to be created:</p> <ul> <li> Broadcast Value (of type <code>T</code>) <li> Identifier <p><code>TorrentBroadcast</code> is created\u00a0when:</p> <ul> <li><code>TorrentBroadcastFactory</code> is requested for a new broadcast variable</li> </ul>"},{"location":"broadcast-variables/TorrentBroadcast/#broadcastblockid","title":"BroadcastBlockId <p><code>TorrentBroadcast</code> creates a BroadcastBlockId (with the id) when created</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#number-of-block-chunks","title":"Number of Block Chunks <p><code>TorrentBroadcast</code> uses <code>numBlocks</code> for the number of blocks of a broadcast variable (that was blockified into when created).</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#transient-lazy-broadcast-value","title":"Transient Lazy Broadcast Value <pre><code>_value: T\n</code></pre> <p><code>TorrentBroadcast</code> uses <code>_value</code> transient registry for the value that is computed on demand (and cached afterwards).</p> <p><code>_value</code> is a <code>@transient private lazy val</code> and uses the following Scala language features:</p> <ol> <li>It is not serialized when the <code>TorrentBroadcast</code> is serialized to be sent over the wire to executors (and has to be re-computed afterwards)</li> <li>It is lazily instantiated when first requested and cached afterwards</li> </ol>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#value","title":"Value <pre><code>getValue(): T\n</code></pre> <p><code>getValue</code> uses the _value transient registry for the value if available (non-<code>null</code>).</p> <p>Otherwise, <code>getValue</code> reads the broadcast block (from the local BroadcastManager, BlockManager or falls back to readBlocks).</p> <p><code>getValue</code> saves the object in the _value registry.</p>  <p><code>getValue</code>\u00a0is part of the Broadcast abstraction.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#reading-broadcast-block","title":"Reading Broadcast Block <pre><code>readBroadcastBlock(): T\n</code></pre> <p><code>readBroadcastBlock</code> looks up the BroadcastBlockId in (the cache of) BroadcastManager and returns the value if found.</p> <p>Otherwise, <code>readBroadcastBlock</code> setConf and requests the BlockManager for the locally-stored broadcast data.</p> <p>If the broadcast block is found locally, <code>readBroadcastBlock</code> requests the <code>BroadcastManager</code> to cache it and returns the value.</p> <p>If not found locally, <code>readBroadcastBlock</code> multiplies the numBlocks by the blockSize for an estimated size of the broadcast block. <code>readBroadcastBlock</code> prints out the following INFO message to the logs:</p> <pre><code>Started reading broadcast variable [id] with [numBlocks] pieces\n(estimated total size [estimatedTotalSize])\n</code></pre> <p><code>readBroadcastBlock</code> readBlocks and prints out the following INFO message to the logs:</p> <pre><code>Reading broadcast variable [id] took [time] ms\n</code></pre> <p><code>readBroadcastBlock</code> unblockifies the block chunks into an object (using the Serializer and the CompressionCodec).</p> <p><code>readBroadcastBlock</code> requests the BlockManager to store the merged copy (so other tasks on this executor don't need to re-fetch it). <code>readBroadcastBlock</code> uses <code>MEMORY_AND_DISK</code> storage level and the <code>tellMaster</code> flag off.</p> <p><code>readBroadcastBlock</code> requests the <code>BroadcastManager</code> to cache it and returns the value.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#unblockifying-broadcast-value","title":"Unblockifying Broadcast Value <pre><code>unBlockifyObject(\n  blocks: Array[InputStream],\n  serializer: Serializer,\n  compressionCodec: Option[CompressionCodec]): T\n</code></pre> <p><code>unBlockifyObject</code>...FIXME</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#reading-broadcast-block-chunks","title":"Reading Broadcast Block Chunks <pre><code>readBlocks(): Array[BlockData]\n</code></pre> <p><code>readBlocks</code> creates a collection of BlockDatas for numBlocks block chunks.</p> <p>For every block (randomly-chosen by block ID between 0 and numBlocks), <code>readBlocks</code> creates a BroadcastBlockId for the id (of the broadcast variable) and the chunk (identified by the <code>piece</code> prefix followed by the ID).</p> <p><code>readBlocks</code> prints out the following DEBUG message to the logs:</p> <pre><code>Reading piece [pieceId] of [broadcastId]\n</code></pre> <p><code>readBlocks</code> first tries to look up the piece locally by requesting the <code>BlockManager</code> to getLocalBytes and, if found, stores the reference in the local block array (for the piece ID).</p> <p>If not found in the local <code>BlockManager</code>, <code>readBlocks</code> requests the <code>BlockManager</code> to getRemoteBytes.</p> <p>With checksumEnabled, <code>readBlocks</code>...FIXME</p> <p><code>readBlocks</code> requests the <code>BlockManager</code> to store the chunk (so other tasks on this executor don't need to re-fetch it) using <code>MEMORY_AND_DISK_SER</code> storage level and reporting to the driver (so other executors can pull these chunks from this executor as well).</p> <p><code>readBlocks</code> creates a ByteBufferBlockData for the chunk (and stores it in the <code>blocks</code> array).</p>  <p><code>readBlocks</code> throws a <code>SparkException</code> for blocks neither available locally nor remotely:</p> <pre><code>Failed to get [pieceId] of [broadcastId]\n</code></pre>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#compressioncodec","title":"CompressionCodec <pre><code>compressionCodec: Option[CompressionCodec]\n</code></pre> <p><code>TorrentBroadcast</code> uses the spark.broadcast.compress configuration property for the <code>CompressionCodec</code> to use for writeBlocks and readBroadcastBlock.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#broadcast-block-chunk-size","title":"Broadcast Block Chunk Size <p><code>TorrentBroadcast</code> uses the spark.broadcast.blockSize configuration property for the size of the chunks (pieces) of a broadcast block.</p> <p><code>TorrentBroadcast</code> uses the size for writeBlocks and readBroadcastBlock.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#persisting-broadcast-to-blockmanager","title":"Persisting Broadcast (to BlockManager) <pre><code>writeBlocks(\n  value: T): Int\n</code></pre> <p><code>writeBlocks</code> returns the number of blocks (chunks) this broadcast variable (was blockified into).</p> <p>The whole broadcast value is stored in the local <code>BlockManager</code> with <code>MEMORY_AND_DISK</code> storage level while the block chunks with <code>MEMORY_AND_DISK_SER</code> storage level.</p> <p><code>writeBlocks</code>\u00a0is used when:</p> <ul> <li><code>TorrentBroadcast</code> is created (that happens on the driver only)</li> </ul>  <p><code>writeBlocks</code> requests the BlockManager to store the given broadcast value (to be identified as the broadcastId and with the <code>MEMORY_AND_DISK</code> storage level).</p> <p><code>writeBlocks</code> blockify the object (into chunks of the block size, the Serializer, and the optional compressionCodec).</p> <p>With checksumEnabled <code>writeBlocks</code>...FIXME</p> <p>For every block, <code>writeBlocks</code> creates a BroadcastBlockId for the id and <code>piece[index]</code> identifier, and requests the <code>BlockManager</code> to store the chunk bytes (with <code>MEMORY_AND_DISK_SER</code> storage level and reporting to the driver).</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#blockifying-broadcast-variable","title":"Blockifying Broadcast Variable <pre><code>blockifyObject(\n  obj: T,\n  blockSize: Int,\n  serializer: Serializer,\n  compressionCodec: Option[CompressionCodec]): Array[ByteBuffer]\n</code></pre> <p><code>blockifyObject</code> divides (blockifies) the input <code>obj</code> broadcast value into blocks (<code>ByteBuffer</code> chunks). <code>blockifyObject</code> uses the given Serializer to write the value in a serialized format to a <code>ChunkedByteBufferOutputStream</code> of the given <code>blockSize</code> size with the optional <code>CompressionCodec</code>.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#error-handling","title":"Error Handling <p>In case of any error, <code>writeBlocks</code> prints out the following ERROR message to the logs and requests the local <code>BlockManager</code> to remove the broadcast.</p> <pre><code>Store broadcast [broadcastId] fail, remove all pieces of the broadcast\n</code></pre>  <p>In case of an error while storing the value itself, <code>writeBlocks</code> throws a <code>SparkException</code>:</p> <pre><code>Failed to store [broadcastId] in BlockManager\n</code></pre>  <p>In case of an error while storing the chunks of the blockified value, <code>writeBlocks</code> throws a <code>SparkException</code>:</p> <pre><code>Failed to store [pieceId] of [broadcastId] in local BlockManager\n</code></pre>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#destroying-variable","title":"Destroying Variable <pre><code>doDestroy(\n  blocking: Boolean): Unit\n</code></pre> <p><code>doDestroy</code> removes the persisted state (associated with the broadcast variable) on all the nodes in a Spark application (the driver and executors).</p> <p><code>doDestroy</code>\u00a0is part of the Broadcast abstraction.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#unpersisting-variable","title":"Unpersisting Variable <pre><code>doUnpersist(\n  blocking: Boolean): Unit\n</code></pre> <p><code>doUnpersist</code> removes the persisted state (associated with the broadcast variable) on executors only.</p> <p><code>doUnpersist</code>\u00a0is part of the Broadcast abstraction.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#removing-persisted-state-broadcast-blocks-of-broadcast-variable","title":"Removing Persisted State (Broadcast Blocks) of Broadcast Variable <pre><code>unpersist(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n</code></pre> <p><code>unpersist</code> prints out the following DEBUG message to the logs:</p> <pre><code>Unpersisting TorrentBroadcast [id]\n</code></pre> <p>In the end, <code>unpersist</code> requests the BlockManagerMaster to remove the blocks of the given broadcast.</p> <p><code>unpersist</code> is used when:</p> <ul> <li><code>TorrentBroadcast</code> is requested to unpersist and destroy</li> <li><code>TorrentBroadcastFactory</code> is requested to unbroadcast</li> </ul>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#setconf","title":"setConf <pre><code>setConf(\n  conf: SparkConf): Unit\n</code></pre> <p><code>setConf</code> uses the given SparkConf to initialize the compressionCodec, the blockSize and the checksumEnabled.</p> <p><code>setConf</code> is used when:</p> <ul> <li><code>TorrentBroadcast</code> is created and re-created (when deserialized on executors)</li> </ul>","text":""},{"location":"broadcast-variables/TorrentBroadcast/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.broadcast.TorrentBroadcast</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.broadcast.TorrentBroadcast=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"broadcast-variables/TorrentBroadcastFactory/","title":"TorrentBroadcastFactory","text":"<p><code>TorrentBroadcastFactory</code> is a BroadcastFactory of TorrentBroadcasts.</p> <p>Note</p> <p>As of Spark 2.0 <code>TorrentBroadcastFactory</code> is the only known BroadcastFactory.</p>"},{"location":"broadcast-variables/TorrentBroadcastFactory/#creating-instance","title":"Creating Instance","text":"<p><code>TorrentBroadcastFactory</code> takes no arguments to be created.</p> <p><code>TorrentBroadcastFactory</code> is created for BroadcastManager.</p>"},{"location":"broadcast-variables/TorrentBroadcastFactory/#newBroadcast","title":"Creating Broadcast Variable","text":"BroadcastFactory <pre><code>newBroadcast[T: ClassTag](\n  value_ : T,\n  isLocal: Boolean,\n  id: Long,\n  serializedOnly: Boolean = false): Broadcast[T]\n</code></pre> <p><code>newBroadcast</code>\u00a0is part of the BroadcastFactory abstraction.</p> <p><code>newBroadcast</code> creates a new TorrentBroadcast with the given <code>value_</code> and <code>id</code> (and ignoring <code>isLocal</code>).</p>"},{"location":"broadcast-variables/TorrentBroadcastFactory/#unbroadcast","title":"Deleting Broadcast Variable","text":"BroadcastFactory <pre><code>unbroadcast(\n  id: Long,\n  removeFromDriver: Boolean,\n  blocking: Boolean): Unit\n</code></pre> <p><code>unbroadcast</code>\u00a0is part of the BroadcastFactory abstraction.</p> <p><code>unbroadcast</code> removes all persisted state associated with the broadcast variable (identified by <code>id</code>).</p>"},{"location":"broadcast-variables/TorrentBroadcastFactory/#initialize","title":"Initializing","text":"BroadcastFactory <pre><code>initialize(\n  isDriver: Boolean,\n  conf: SparkConf): Unit\n</code></pre> <p><code>initialize</code>\u00a0is part of the BroadcastFactory abstraction.</p> <p><code>initialize</code> does nothing (noop).</p>"},{"location":"broadcast-variables/TorrentBroadcastFactory/#stop","title":"Stopping","text":"BroadcastFactory <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>\u00a0is part of the BroadcastFactory abstraction.</p> <p><code>stop</code> does nothing (noop).</p>"},{"location":"core/BlockFetchStarter/","title":"BlockFetchStarter","text":"<p>BlockFetchStarter is the &lt;&gt; of...FIXME...to &lt;&gt;. <p>[[contract]] [[createAndStart]] [source, java]</p> <p>void createAndStart(String[] blockIds, BlockFetchingListener listener)    throws IOException, InterruptedException;</p> <p><code>createAndStart</code> is used when:</p> <ul> <li> <p><code>NettyBlockTransferService</code> is requested to storage:NettyBlockTransferService.md#fetchBlocks[fetchBlocks] (when network:TransportConf.md#io.maxRetries[maxIORetries] is <code>0</code>)</p> </li> <li> <p><code>RetryingBlockFetcher</code> is requested to core:RetryingBlockFetcher.md#fetchAllOutstanding[fetchAllOutstanding]</p> </li> </ul>"},{"location":"core/BlockFetchingListener/","title":"BlockFetchingListener","text":"<p><code>BlockFetchingListener</code>\u00a0is an extension of the <code>EventListener</code> (Java) abstraction that want to be notified about block fetch success and failures.</p> <p><code>BlockFetchingListener</code> is used to create a OneForOneBlockFetcher, <code>OneForOneBlockPusher</code> and RetryingBlockFetcher.</p>"},{"location":"core/BlockFetchingListener/#contract","title":"Contract","text":""},{"location":"core/BlockFetchingListener/#onblockfetchfailure","title":"onBlockFetchFailure <pre><code>void onBlockFetchFailure(\n  String blockId,\n  Throwable exception)\n</code></pre>","text":""},{"location":"core/BlockFetchingListener/#onblockfetchsuccess","title":"onBlockFetchSuccess <pre><code>void onBlockFetchSuccess(\n  String blockId,\n  ManagedBuffer data)\n</code></pre>","text":""},{"location":"core/BlockFetchingListener/#implementations","title":"Implementations","text":"<ul> <li>\"Unnamed\" in ShuffleBlockFetcherIterator</li> <li>\"Unnamed\" in BlockTransferService</li> <li>RetryingBlockFetchListener</li> </ul>"},{"location":"core/CleanerListener/","title":"CleanerListener","text":"<p>= CleanerListener</p> <p>CleanerListener is an abstraction of listeners that can be core:ContextCleaner.md#attachListener[registered with ContextCleaner] to be informed when &lt;&gt;, &lt;&gt;, &lt;&gt;, &lt;&gt; and &lt;&gt; are cleaned. <p>== [[rddCleaned]] rddCleaned Callback Method</p>"},{"location":"core/CleanerListener/#source-scala","title":"[source, scala]","text":"<p>rddCleaned(   rddId: Int): Unit</p> <p>rddCleaned is used when...FIXME</p> <p>== [[broadcastCleaned]] broadcastCleaned Callback Method</p>"},{"location":"core/CleanerListener/#source-scala_1","title":"[source, scala]","text":"<p>broadcastCleaned(   broadcastId: Long): Unit</p> <p>broadcastCleaned is used when...FIXME</p> <p>== [[shuffleCleaned]] shuffleCleaned Callback Method</p>"},{"location":"core/CleanerListener/#source-scala_2","title":"[source, scala]","text":"<p>shuffleCleaned(   shuffleId: Int,   blocking: Boolean): Unit</p> <p>shuffleCleaned is used when...FIXME</p> <p>== [[accumCleaned]] accumCleaned Callback Method</p>"},{"location":"core/CleanerListener/#source-scala_3","title":"[source, scala]","text":"<p>accumCleaned(   accId: Long): Unit</p> <p>accumCleaned is used when...FIXME</p> <p>== [[checkpointCleaned]] checkpointCleaned Callback Method</p>"},{"location":"core/CleanerListener/#source-scala_4","title":"[source, scala]","text":"<p>checkpointCleaned(   rddId: Long): Unit</p> <p>checkpointCleaned is used when...FIXME</p>"},{"location":"core/ContextCleaner/","title":"ContextCleaner","text":"<p><code>ContextCleaner</code> is a Spark service that is responsible for &lt;&gt; (cleanup) of &lt;&gt;, &lt;&gt;, &lt;&gt;, &lt;&gt; and &lt;&gt; that is aimed at reducing the memory requirements of long-running data-heavy Spark applications. <p></p>"},{"location":"core/ContextCleaner/#creating-instance","title":"Creating Instance","text":"<p>ContextCleaner takes the following to be created:</p> <ul> <li>[[sc]] SparkContext.md[]</li> </ul> <p><code>ContextCleaner</code> is created and requested to start when SparkContext is created with configuration-properties.md#spark.cleaner.referenceTracking[spark.cleaner.referenceTracking] configuration property enabled.</p> <p>== [[cleaningThread]] Spark Context Cleaner Cleaning Thread</p> <p>ContextCleaner uses a daemon thread Spark Context Cleaner to clean RDD, shuffle, and broadcast states.</p> <p>The Spark Context Cleaner thread is started when ContextCleaner is requested to &lt;&gt;. <p>== [[listeners]][[attachListener]] CleanerListeners</p> <p>ContextCleaner allows attaching core:CleanerListener.md[CleanerListeners] to be informed when objects are cleaned using <code>attachListener</code> method.</p>"},{"location":"core/ContextCleaner/#sourcescala","title":"[source,scala]","text":"<p>attachListener(   listener: CleanerListener): Unit</p> <p>== [[doCleanupRDD]] doCleanupRDD Method</p>"},{"location":"core/ContextCleaner/#source-scala","title":"[source, scala]","text":"<p>doCleanupRDD(   rddId: Int,   blocking: Boolean): Unit</p> <p>doCleanupRDD...FIXME</p> <p>doCleanupRDD is used when ContextCleaner is requested to &lt;&gt; for a CleanRDD. <p>== [[keepCleaning]] keepCleaning Internal Method</p>"},{"location":"core/ContextCleaner/#source-scala_1","title":"[source, scala]","text":""},{"location":"core/ContextCleaner/#keepcleaning-unit","title":"keepCleaning(): Unit","text":"<p>keepCleaning runs indefinitely until ContextCleaner is requested to &lt;&gt;. keepCleaning...FIXME <p>keepCleaning prints out the following DEBUG message to the logs:</p>"},{"location":"core/ContextCleaner/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#got-cleaning-task-task","title":"Got cleaning task [task]","text":"<p>keepCleaning is used in &lt;&gt; that is started once when ContextCleaner is requested to &lt;&gt;. <p>== [[registerRDDCheckpointDataForCleanup]] registerRDDCheckpointDataForCleanup Method</p>"},{"location":"core/ContextCleaner/#source-scala_2","title":"[source, scala]","text":"<p>registerRDDCheckpointDataForCleanupT: Unit</p> <p>registerRDDCheckpointDataForCleanup...FIXME</p> <p>registerRDDCheckpointDataForCleanup is used when ContextCleaner is requested to &lt;&gt; (with configuration-properties.md#spark.cleaner.referenceTracking.cleanCheckpoints[spark.cleaner.referenceTracking.cleanCheckpoints] configuration property enabled). <p>== [[registerBroadcastForCleanup]] registerBroadcastForCleanup Method</p>"},{"location":"core/ContextCleaner/#source-scala_3","title":"[source, scala]","text":"<p>registerBroadcastForCleanupT: Unit</p> <p>registerBroadcastForCleanup...FIXME</p> <p>registerBroadcastForCleanup is used when SparkContext is used to SparkContext.md#broadcast[create a broadcast variable].</p> <p>== [[registerRDDForCleanup]] registerRDDForCleanup Method</p>"},{"location":"core/ContextCleaner/#source-scala_4","title":"[source, scala]","text":"<p>registerRDDForCleanup(   rdd: RDD[_]): Unit</p> <p>registerRDDForCleanup...FIXME</p> <p>registerRDDForCleanup is used for rdd:RDD.md#persist[RDD.persist] operation.</p> <p>== [[registerAccumulatorForCleanup]] registerAccumulatorForCleanup Method</p>"},{"location":"core/ContextCleaner/#source-scala_5","title":"[source, scala]","text":"<p>registerAccumulatorForCleanup(   a: AccumulatorV2[_, _]): Unit</p> <p>registerAccumulatorForCleanup...FIXME</p> <p>registerAccumulatorForCleanup is used when AccumulatorV2 is requested to register.</p> <p>== [[stop]] Stopping ContextCleaner</p>"},{"location":"core/ContextCleaner/#source-scala_6","title":"[source, scala]","text":""},{"location":"core/ContextCleaner/#stop-unit","title":"stop(): Unit","text":"<p>stop...FIXME</p> <p>stop is used when SparkContext is requested to SparkContext.md#stop[stop].</p> <p>== [[start]] Starting ContextCleaner</p>"},{"location":"core/ContextCleaner/#source-scala_7","title":"[source, scala]","text":""},{"location":"core/ContextCleaner/#start-unit","title":"start(): Unit","text":"<p>start starts the &lt;&gt; and an action to request the JVM garbage collector (using <code>System.gc()</code>) on regular basis per configuration-properties.md#spark.cleaner.periodicGC.interval[spark.cleaner.periodicGC.interval] configuration property. <p>The action to request the JVM GC is scheduled on &lt;&gt;. <p><code>start</code> is used when SparkContext is created.</p> <p>== [[periodicGCService]] periodicGCService Single-Thread Executor Service</p> <p>periodicGCService is an internal single-thread {java-javadoc-url}/java/util/concurrent/ScheduledExecutorService.html[executor service] with the name context-cleaner-periodic-gc to request the JVM garbage collector.</p> <p>The periodic runs are started when &lt;&gt; and stopped when &lt;&gt;. <p>== [[registerShuffleForCleanup]] Registering ShuffleDependency for Cleanup</p>"},{"location":"core/ContextCleaner/#source-scala_8","title":"[source, scala]","text":"<p>registerShuffleForCleanup(   shuffleDependency: ShuffleDependency[_, _, _]): Unit</p> <p>registerShuffleForCleanup registers the given ShuffleDependency for cleanup.</p> <p>Internally, registerShuffleForCleanup simply executes &lt;&gt; for the given ShuffleDependency. <p><code>registerShuffleForCleanup</code> is used when ShuffleDependency is created.</p> <p>== [[registerForCleanup]] Registering Object Reference For Cleanup</p>"},{"location":"core/ContextCleaner/#source-scala_9","title":"[source, scala]","text":"<p>registerForCleanup(   objectForCleanup: AnyRef,   task: CleanupTask): Unit</p> <p>registerForCleanup adds the input objectForCleanup to the &lt;&gt; internal queue. <p>Despite the widest-possible <code>AnyRef</code> type of the input <code>objectForCleanup</code>, the type is really <code>CleanupTaskWeakReference</code> which is a custom Java's {java-javadoc-url}/java/lang/ref/WeakReference.html[java.lang.ref.WeakReference].</p> <p>registerForCleanup is used when ContextCleaner is requested to &lt;&gt;, &lt;&gt;, &lt;&gt;, &lt;&gt;, and &lt;&gt;. <p>== [[doCleanupShuffle]] Shuffle Cleanup</p>"},{"location":"core/ContextCleaner/#source-scala_10","title":"[source, scala]","text":"<p>doCleanupShuffle(   shuffleId: Int,   blocking: Boolean): Unit</p> <p>doCleanupShuffle performs a shuffle cleanup which is to remove the shuffle from the current scheduler:MapOutputTrackerMaster.md[MapOutputTrackerMaster] and storage:BlockManagerMaster.md[BlockManagerMaster]. doCleanupShuffle also notifies core:CleanerListener.md[CleanerListeners].</p> <p>Internally, when executed, doCleanupShuffle prints out the following DEBUG message to the logs:</p>"},{"location":"core/ContextCleaner/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#cleaning-shuffle-id","title":"Cleaning shuffle [id]","text":"<p>doCleanupShuffle uses core:SparkEnv.md[SparkEnv] to access the core:SparkEnv.md#mapOutputTracker[MapOutputTracker] to scheduler:MapOutputTracker.md#unregisterShuffle[unregister the given shuffle].</p> <p>doCleanupShuffle uses core:SparkEnv.md[SparkEnv] to access the core:SparkEnv.md#blockManager[BlockManagerMaster] to storage:BlockManagerMaster.md#removeShuffle[remove the shuffle blocks] (for the given shuffleId).</p> <p>doCleanupShuffle informs all registered &lt;&gt; that core:CleanerListener.md#shuffleCleaned[shuffle was cleaned]. <p>In the end, doCleanupShuffle prints out the following DEBUG message to the logs:</p>"},{"location":"core/ContextCleaner/#sourceplaintext_2","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#cleaned-shuffle-id","title":"Cleaned shuffle [id]","text":"<p>In case of any exception, doCleanupShuffle prints out the following ERROR message to the logs and the exception itself:</p>"},{"location":"core/ContextCleaner/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#error-cleaning-shuffle-id","title":"Error cleaning shuffle [id]","text":"<p>doCleanupShuffle is used when ContextCleaner is requested to &lt;&gt; and (interestingly) while fitting an <code>ALSModel</code> (in Spark MLlib). <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.ContextCleaner</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>"},{"location":"core/ContextCleaner/#sourceplaintext_4","title":"[source,plaintext]","text":""},{"location":"core/ContextCleaner/#log4jloggerorgapachesparkcontextcleanerall","title":"log4j.logger.org.apache.spark.ContextCleaner=ALL","text":"<p>Refer to spark-logging.md[Logging].</p> <p>== [[internal-properties]] Internal Properties</p> <p>=== [[referenceBuffer]] referenceBuffer</p> <p>=== [[referenceQueue]] referenceQueue</p>"},{"location":"core/InMemoryStore/","title":"InMemoryStore","text":"<p><code>InMemoryStore</code> is a KVStore.</p>"},{"location":"core/InMemoryStore/#creating-instance","title":"Creating Instance","text":"<p><code>InMemoryStore</code> takes no arguments to be created.</p> <p><code>InMemoryStore</code> is created when:</p> <ul> <li><code>FsHistoryProvider</code> is created and requested to createInMemoryStore</li> <li><code>AppStatusStore</code> utility is used to create an AppStatusStore for a live Spark application</li> </ul>"},{"location":"core/KVStore/","title":"KVStore","text":"<p><code>KVStore</code> is an abstraction of key-value stores.</p> <p><code>KVStore</code> is a Java Closeable.</p>"},{"location":"core/KVStore/#contract","title":"Contract","text":""},{"location":"core/KVStore/#count","title":"count <pre><code>long count(\n  Class&lt;?&gt; type)\nlong count(\n  Class&lt;?&gt; type,\n  String index,\n  Object indexedValue)\n</code></pre>","text":""},{"location":"core/KVStore/#delete","title":"delete <pre><code>void delete(\n  Class&lt;?&gt; type,\n  Object naturalKey)\n</code></pre>","text":""},{"location":"core/KVStore/#getmetadata","title":"getMetadata <pre><code>&lt;T&gt; T getMetadata(\n  Class&lt;T&gt; klass)\n</code></pre>","text":""},{"location":"core/KVStore/#read","title":"read <pre><code>&lt;T&gt; T read(\n  Class&lt;T&gt; klass,\n  Object naturalKey)\n</code></pre>","text":""},{"location":"core/KVStore/#removeallbyindexvalues","title":"removeAllByIndexValues <pre><code>&lt;T&gt; boolean removeAllByIndexValues(\n  Class&lt;T&gt; klass,\n  String index,\n  Collection&lt;?&gt; indexValues)\n</code></pre>","text":""},{"location":"core/KVStore/#setmetadata","title":"setMetadata <pre><code>void setMetadata(\n  Object value)\n</code></pre>","text":""},{"location":"core/KVStore/#view","title":"view <pre><code>&lt;T&gt; KVStoreView&lt;T&gt; view(\n  Class&lt;T&gt; type)\n</code></pre> <p><code>KVStoreView</code> over entities of the given <code>type</code></p>","text":""},{"location":"core/KVStore/#write","title":"write <pre><code>void write(\n  Object value)\n</code></pre>","text":""},{"location":"core/KVStore/#implementations","title":"Implementations","text":"<ul> <li>ElementTrackingStore</li> <li>InMemoryStore</li> <li>LevelDB</li> </ul>"},{"location":"core/LevelDB/","title":"LevelDB","text":"<p><code>LevelDB</code> is a KVStore for FsHistoryProvider.</p>"},{"location":"core/LevelDB/#creating-instance","title":"Creating Instance","text":"<p><code>LevelDB</code> takes the following to be created:</p> <ul> <li> Path <li> <code>KVStoreSerializer</code> <p><code>LevelDB</code> is created\u00a0when:</p> <ul> <li><code>KVUtils</code> utility is used to <code>open</code> (a LevelDB store)</li> </ul>"},{"location":"core/RetryingBlockFetcher/","title":"RetryingBlockFetcher","text":"<p>RetryingBlockFetcher is...FIXME</p> <p>RetryingBlockFetcher is &lt;&gt; and immediately &lt;&gt; when: <ul> <li><code>NettyBlockTransferService</code> is requested to storage:NettyBlockTransferService.md#fetchBlocks[fetchBlocks] (when network:TransportConf.md#io.maxRetries[maxIORetries] is greater than <code>0</code> which it is by default)</li> </ul> <p>RetryingBlockFetcher uses a &lt;&gt; to core:BlockFetchStarter.md#createAndStart[createAndStart] when requested to &lt;&gt; and later &lt;&gt;. <p>[[outstandingBlocksIds]] RetryingBlockFetcher uses <code>outstandingBlocksIds</code> internal registry of outstanding block IDs to fetch that is initially the &lt;&gt; when &lt;&gt;. <p>At &lt;&gt;, RetryingBlockFetcher prints out the following INFO message to the logs (with the number of &lt;&gt;): <pre><code>Retrying fetch ([retryCount]/[maxRetries]) for [size] outstanding blocks after [retryWaitTime] ms\n</code></pre> <p>On &lt;&gt; and &lt;&gt;, &lt;&gt; removes the block ID from &lt;&gt;. <p>[[currentListener]] RetryingBlockFetcher uses a &lt;&gt; to remove block IDs from the &lt;&gt; internal registry. <p>== [[creating-instance]] Creating RetryingBlockFetcher Instance</p> <p>RetryingBlockFetcher takes the following when created:</p> <ul> <li>[[conf]] network:TransportConf.md[]</li> <li>[[fetchStarter]] core:BlockFetchStarter.md[]</li> <li>[[blockIds]] Block IDs to fetch</li> <li>[[listener]] core:BlockFetchingListener.md[]</li> </ul> <p>== [[start]] Starting RetryingBlockFetcher -- <code>start</code> Method</p>"},{"location":"core/RetryingBlockFetcher/#source-java","title":"[source, java]","text":""},{"location":"core/RetryingBlockFetcher/#void-start","title":"void start()","text":"<p><code>start</code> simply &lt;&gt;. <p><code>start</code> is used when:</p> <ul> <li><code>NettyBlockTransferService</code> is requested to storage:NettyBlockTransferService.md#fetchBlocks[fetchBlocks] (when network:TransportConf.md#io.maxRetries[maxIORetries] is greater than <code>0</code> which it is by default)</li> </ul> <p>== [[initiateRetry]] <code>initiateRetry</code> Internal Method</p>"},{"location":"core/RetryingBlockFetcher/#source-java_1","title":"[source, java]","text":""},{"location":"core/RetryingBlockFetcher/#synchronized-void-initiateretry","title":"synchronized void initiateRetry()","text":"<p><code>initiateRetry</code>...FIXME</p>"},{"location":"core/RetryingBlockFetcher/#note","title":"[NOTE]","text":"<p><code>initiateRetry</code> is used when:</p> <ul> <li>RetryingBlockFetcher is requested to &lt;&gt;"},{"location":"core/RetryingBlockFetcher/#retryingblockfetchlistener-is-requested-to","title":"* <code>RetryingBlockFetchListener</code> is requested to &lt;&gt; <p>== [[fetchAllOutstanding]] <code>fetchAllOutstanding</code> Internal Method</p>","text":""},{"location":"core/RetryingBlockFetcher/#source-java_2","title":"[source, java]","text":""},{"location":"core/RetryingBlockFetcher/#void-fetchalloutstanding","title":"void fetchAllOutstanding()","text":"<p><code>fetchAllOutstanding</code> requests &lt;&gt; to core:BlockFetchStarter.md#createAndStart[createAndStart] for the &lt;&gt;. <p>NOTE: <code>fetchAllOutstanding</code> is used when RetryingBlockFetcher is requested to &lt;&gt; and &lt;&gt;. <p>== [[RetryingBlockFetchListener]] RetryingBlockFetchListener</p> <p><code>RetryingBlockFetchListener</code> is a core:BlockFetchingListener.md[] that &lt;&gt; uses to remove block IDs from the &lt;&gt; internal registry. <p>=== [[RetryingBlockFetchListener-onBlockFetchSuccess]] <code>onBlockFetchSuccess</code> Method</p>"},{"location":"core/RetryingBlockFetcher/#source-scala","title":"[source, scala]","text":""},{"location":"core/RetryingBlockFetcher/#void-onblockfetchsuccessstring-blockid-managedbuffer-data","title":"void onBlockFetchSuccess(String blockId, ManagedBuffer data)","text":"<p>NOTE: <code>onBlockFetchSuccess</code> is part of core:BlockFetchingListener.md#onBlockFetchSuccess[BlockFetchingListener Contract].</p> <p><code>onBlockFetchSuccess</code>...FIXME</p> <p>=== [[RetryingBlockFetchListener-onBlockFetchFailure]] <code>onBlockFetchFailure</code> Method</p>"},{"location":"core/RetryingBlockFetcher/#source-scala_1","title":"[source, scala]","text":""},{"location":"core/RetryingBlockFetcher/#void-onblockfetchfailurestring-blockid-throwable-exception","title":"void onBlockFetchFailure(String blockId, Throwable exception)","text":"<p>NOTE: <code>onBlockFetchFailure</code> is part of core:BlockFetchingListener.md#onBlockFetchFailure[BlockFetchingListener Contract].</p> <p><code>onBlockFetchFailure</code>...FIXME</p>"},{"location":"demo/","title":"Demos","text":"<p>The following demos are available:</p> <ul> <li>DiskBlockManager and Block Data</li> </ul>"},{"location":"demo/diskblockmanager-and-block-data/","title":"Demo: DiskBlockManager and Block Data","text":"<p>The demo shows how Spark stores data blocks on local disk (using DiskBlockManager and DiskStore among the services).</p>"},{"location":"demo/diskblockmanager-and-block-data/#configure-local-directories","title":"Configure Local Directories","text":"<p>Spark uses spark.local.dir configuration property for one or more local directories to store data blocks.</p> <p>Start <code>spark-shell</code> with the property set to a directory of your choice (say <code>local-dirs</code>). Use one directory for easier monitoring.</p> <pre><code>$SPARK_HOME/bin/spark-shell --conf spark.local.dir=local-dirs\n</code></pre> <p>When started, Spark will create a proper directory layout. You are interested in <code>blockmgr-[uuid]</code> directory.</p>"},{"location":"demo/diskblockmanager-and-block-data/#create-data-blocks","title":"\"Create\" Data Blocks","text":"<p>Execute the following Spark application that forces persisting (caching) data to disk.</p> <pre><code>import org.apache.spark.storage.StorageLevel\nspark.range(2).persist(StorageLevel.DISK_ONLY).count\n</code></pre>"},{"location":"demo/diskblockmanager-and-block-data/#observe-block-files","title":"Observe Block Files","text":""},{"location":"demo/diskblockmanager-and-block-data/#command-line","title":"Command Line","text":"<p>Go to the <code>blockmgr-[uuid]</code> directory and observe the block files. There should be a few. Do you know how many and why?</p> <pre><code>$ tree local-dirs/blockmgr-b7167b5a-ae8d-404b-8de2-1a0fb101fe00/\nlocal-dirs/blockmgr-b7167b5a-ae8d-404b-8de2-1a0fb101fe00/\n\u251c\u2500\u2500 00\n\u251c\u2500\u2500 04\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_8_0.data\n\u251c\u2500\u2500 06\n\u251c\u2500\u2500 08\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_8_0.index\n...\n\u251c\u2500\u2500 37\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_7_0.index\n\u251c\u2500\u2500 38\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_4_0.data\n\u251c\u2500\u2500 39\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 shuffle_0_9_0.index\n\u2514\u2500\u2500 3a\n    \u2514\u2500\u2500 shuffle_0_6_0.data\n\n47 directories, 48 files\n</code></pre>"},{"location":"demo/diskblockmanager-and-block-data/#diskblockmanager","title":"DiskBlockManager","text":"<p>The files are managed by DiskBlockManager that is available to access all the files as well.</p> <pre><code>import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.diskBlockManager.getAllFiles()\n</code></pre>"},{"location":"demo/diskblockmanager-and-block-data/#use-web-ui","title":"Use web UI","text":"<p>Open http://localhost:4040 and switch to Storage tab (at http://localhost:4040/storage/). You should see one RDD cached.</p> <p></p> <p>Click the link in RDD Name column and review the information.</p>"},{"location":"demo/diskblockmanager-and-block-data/#enable-logging","title":"Enable Logging","text":"<p>Enable ALL logging level for org.apache.spark.storage.DiskStore and org.apache.spark.storage.DiskBlockManager loggers to have an even deeper insight on the block storage internals.</p> <pre><code>log4j.logger.org.apache.spark.storage.DiskBlockManager=ALL\nlog4j.logger.org.apache.spark.storage.DiskStore=ALL\n</code></pre>"},{"location":"dynamic-allocation/","title":"Dynamic Allocation of Executors","text":"<p>Dynamic Allocation of Executors (Dynamic Resource Allocation or Elastic Scaling) is a Spark service for adding and removing Spark executors dynamically on demand to match workload.</p> <p>Unlike the \"traditional\" static allocation where a Spark application reserves CPU and memory resources upfront (irrespective of how much it may eventually use), in dynamic allocation you get as much as needed and no more. It scales the number of executors up and down based on workload, i.e. idle executors are removed, and when there are pending tasks waiting for executors to be launched on, dynamic allocation requests them.</p> <p>Dynamic Allocation is enabled (and <code>SparkContext</code> creates an ExecutorAllocationManager) when:</p> <ol> <li> <p>spark.dynamicAllocation.enabled configuration property is enabled</p> </li> <li> <p>spark.master is non-<code>local</code></p> </li> <li> <p>SchedulerBackend is an ExecutorAllocationClient</p> </li> </ol> <p>ExecutorAllocationManager is the heart of Dynamic Resource Allocation.</p> <p>When enabled, it is recommended to use the External Shuffle Service.</p> <p>Dynamic Allocation comes with the policy of scaling executors up and down as follows:</p> <ol> <li>Scale Up Policy requests new executors when there are pending tasks and increases the number of executors exponentially since executors start slow and Spark application may need slightly more.</li> <li>Scale Down Policy removes executors that have been idle for spark.dynamicAllocation.executorIdleTimeout seconds.</li> </ol>"},{"location":"dynamic-allocation/#performance-metrics","title":"Performance Metrics","text":"<p>ExecutorAllocationManagerSource metric source is used to report performance metrics.</p>"},{"location":"dynamic-allocation/#sparkcontextkillexecutors","title":"SparkContext.killExecutors","text":"<p>SparkContext.killExecutors is unsupported with Dynamic Allocation enabled.</p>"},{"location":"dynamic-allocation/#programmable-dynamic-allocation","title":"Programmable Dynamic Allocation","text":"<p><code>SparkContext</code> offers a developer API to scale executors up or down.</p>"},{"location":"dynamic-allocation/#getting-initial-number-of-executors-for-dynamic-allocation","title":"Getting Initial Number of Executors for Dynamic Allocation <pre><code>getDynamicAllocationInitialExecutors(conf: SparkConf): Int\n</code></pre> <p><code>getDynamicAllocationInitialExecutors</code> first makes sure that &lt;&gt; is equal or greater than &lt;&gt;. <p>NOTE: &lt;&gt; falls back to &lt;&gt; if not set. Why to print the WARN message to the logs? <p>If not, you should see the following WARN message in the logs:</p> <pre><code>spark.dynamicAllocation.initialExecutors less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.\n</code></pre> <p><code>getDynamicAllocationInitialExecutors</code> makes sure that executor:Executor.md#spark.executor.instances[spark.executor.instances] is greater than &lt;&gt;. <p>NOTE: Both executor:Executor.md#spark.executor.instances[spark.executor.instances] and &lt;&gt; fall back to <code>0</code> when no defined explicitly. <p>If not, you should see the following WARN message in the logs:</p> <pre><code>spark.executor.instances less than spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please update your configs.\n</code></pre> <p><code>getDynamicAllocationInitialExecutors</code> sets the initial number of executors to be the maximum of:</p> <ul> <li>spark.dynamicAllocation.minExecutors</li> <li>spark.dynamicAllocation.initialExecutors</li> <li>spark.executor.instances</li> <li><code>0</code></li> </ul> <p>You should see the following INFO message in the logs:</p> <pre><code>Using initial executors = [initialExecutors], max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances\n</code></pre> <p><code>getDynamicAllocationInitialExecutors</code> is used when <code>ExecutorAllocationManager</code> is requested to set the initial number of executors.</p>","text":""},{"location":"dynamic-allocation/#resources","title":"Resources","text":""},{"location":"dynamic-allocation/#documentation","title":"Documentation","text":"<ul> <li>Dynamic Allocation in the official documentation of Apache Spark</li> <li>Dynamic allocation in the documentation of Cloudera Data Platform (CDP)</li> </ul>"},{"location":"dynamic-allocation/#slides","title":"Slides","text":"<ul> <li>Dynamic Allocation in Spark by Databricks</li> </ul>"},{"location":"dynamic-allocation/ExecutorAllocationClient/","title":"ExecutorAllocationClient","text":"<p><code>ExecutorAllocationClient</code> is an abstraction of schedulers that can communicate with a cluster manager to request or kill executors.</p>"},{"location":"dynamic-allocation/ExecutorAllocationClient/#contract","title":"Contract","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#active-executor-ids","title":"Active Executor IDs <pre><code>getExecutorIds(): Seq[String]\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested for active executors</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#isexecutoractive","title":"isExecutorActive <pre><code>isExecutorActive(\n  id: String): Boolean\n</code></pre> <p>Whether a given executor (by ID) is active (and can be used to execute tasks)</p> <p>Used when:</p> <ul> <li>FIXME</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#killing-executors","title":"Killing Executors <pre><code>killExecutors(\n  executorIds: Seq[String],\n  adjustTargetNumExecutors: Boolean,\n  countFailures: Boolean,\n  force: Boolean = false): Seq[String]\n</code></pre> <p>Requests a cluster manager to kill given executors and returns whether the request has been acknowledged by the cluster manager (<code>true</code>) or not (<code>false</code>).</p> <p>Used when:</p> <ul> <li><code>ExecutorAllocationClient</code> is requested to kill an executor</li> <li><code>ExecutorAllocationManager</code> is requested to removeExecutors</li> <li><code>SparkContext</code> is requested to kill executors and killAndReplaceExecutor</li> <li><code>BlacklistTracker</code> is requested to kill an executor</li> <li><code>DriverEndpoint</code> is requested to handle a KillExecutorsOnHost message</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#killing-executors-on-host","title":"Killing Executors on Host <pre><code>killExecutorsOnHost(\n  host: String): Boolean\n</code></pre> <p>Used when:</p> <ul> <li><code>BlacklistTracker</code> is requested to kill executors on a blacklisted node</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#requesting-additional-executors","title":"Requesting Additional Executors <pre><code>requestExecutors(\n  numAdditionalExecutors: Int): Boolean\n</code></pre> <p>Requests additional executors from a cluster manager and returns whether the request has been acknowledged by the cluster manager (<code>true</code>) or not (<code>false</code>).</p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested for additional executors</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#updating-total-executors","title":"Updating Total Executors <pre><code>requestTotalExecutors(\n  resourceProfileIdToNumExecutors: Map[Int, Int],\n  numLocalityAwareTasksPerResourceProfileId: Map[Int, Int],\n  hostToLocalTaskCount: Map[Int, Map[String, Int]]): Boolean\n</code></pre> <p>Updates a cluster manager with the exact number of executors desired. Returns whether the request has been acknowledged by the cluster manager (<code>true</code>) or not (<code>false</code>).</p> <p>Used when:</p> <ul> <li> <p><code>SparkContext</code> is requested to update the number of total executors</p> </li> <li> <p><code>ExecutorAllocationManager</code> is requested to start, updateAndSyncNumExecutorsTarget, addExecutors, removeExecutors</p> </li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#implementations","title":"Implementations","text":"<ul> <li>CoarseGrainedSchedulerBackend</li> <li><code>KubernetesClusterSchedulerBackend</code> (Spark on Kubernetes)</li> <li><code>MesosCoarseGrainedSchedulerBackend</code></li> <li><code>StandaloneSchedulerBackend</code> ([Spark Standalone]https://books.japila.pl/spark-standalone-internals/StandaloneSchedulerBackend))</li> <li><code>YarnSchedulerBackend</code></li> </ul>"},{"location":"dynamic-allocation/ExecutorAllocationClient/#killing-single-executor","title":"Killing Single Executor <pre><code>killExecutor(\n  executorId: String): Boolean\n</code></pre> <p><code>killExecutor</code> kill the given executor.</p> <p><code>killExecutor</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> removes an executor.</li> <li><code>SparkContext</code> is requested to kill executors.</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#decommissioning-executors","title":"Decommissioning Executors <pre><code>decommissionExecutors(\n  executorsAndDecomInfo: Array[(String, ExecutorDecommissionInfo)],\n  adjustTargetNumExecutors: Boolean,\n  triggeredByExecutor: Boolean): Seq[String]\n</code></pre> <p><code>decommissionExecutors</code> kills the given executors.</p> <p><code>decommissionExecutors</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationClient</code> is requested to decommission a single executor</li> <li><code>ExecutorAllocationManager</code> is requested to remove executors</li> <li><code>StandaloneSchedulerBackend</code> (Spark Standalone) is requested to <code>executorDecommissioned</code></li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationClient/#decommissioning-single-executor","title":"Decommissioning Single Executor <pre><code>decommissionExecutor(\n  executorId: String,\n  decommissionInfo: ExecutorDecommissionInfo,\n  adjustTargetNumExecutors: Boolean,\n  triggeredByExecutor: Boolean = false): Boolean\n</code></pre> <p><code>decommissionExecutor</code>...FIXME</p> <p><code>decommissionExecutor</code>\u00a0is used when:</p> <ul> <li><code>DriverEndpoint</code> is requested to handle a ExecutorDecommissioning message</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationListener/","title":"ExecutorAllocationListener","text":"<p><code>ExecutorAllocationListener</code> is a SparkListener.md[] that intercepts events about stages, tasks, and executors, i.e. onStageSubmitted, onStageCompleted, onTaskStart, onTaskEnd, onExecutorAdded, and onExecutorRemoved. Using the events ExecutorAllocationManager can manage the pool of dynamically managed executors.</p> <p>Internal Class</p> <p><code>ExecutorAllocationListener</code> is an internal class of ExecutorAllocationManager with full access to internal registries.</p>"},{"location":"dynamic-allocation/ExecutorAllocationManager/","title":"ExecutorAllocationManager","text":"<p><code>ExecutorAllocationManager</code> can be used to dynamically allocate executors based on processing workload.</p> <p><code>ExecutorAllocationManager</code> intercepts Spark events using the internal ExecutorAllocationListener that keeps track of the workload.</p>"},{"location":"dynamic-allocation/ExecutorAllocationManager/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorAllocationManager</code> takes the following to be created:</p> <ul> <li>ExecutorAllocationClient</li> <li> LiveListenerBus <li> SparkConf <li> ContextCleaner (default: <code>None</code>) <li> <code>Clock</code> (default: <code>SystemClock</code>) <p><code>ExecutorAllocationManager</code> is created (and started) when SparkContext is created (with Dynamic Allocation of Executors enabled)</p>"},{"location":"dynamic-allocation/ExecutorAllocationManager/#validating-configuration","title":"Validating Configuration <pre><code>validateSettings(): Unit\n</code></pre> <p><code>validateSettings</code> makes sure that the settings for dynamic allocation are correct.</p> <p><code>validateSettings</code> throws a <code>SparkException</code> when the following are not met:</p> <ul> <li> <p>spark.dynamicAllocation.minExecutors must be positive</p> </li> <li> <p>spark.dynamicAllocation.maxExecutors must be <code>0</code> or greater</p> </li> <li> <p>spark.dynamicAllocation.minExecutors must be less than or equal to spark.dynamicAllocation.maxExecutors</p> </li> <li> <p>spark.dynamicAllocation.executorIdleTimeout must be greater than <code>0</code></p> </li> <li> <p>spark.shuffle.service.enabled must be enabled.</p> </li> <li> <p>The number of tasks per core, i.e. spark.executor.cores divided by spark.task.cpus, is not zero.</p> </li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#performance-metrics","title":"Performance Metrics","text":"<p><code>ExecutorAllocationManager</code> uses ExecutorAllocationManagerSource for performance metrics.</p>"},{"location":"dynamic-allocation/ExecutorAllocationManager/#executormonitor","title":"ExecutorMonitor <p><code>ExecutorAllocationManager</code> creates an ExecutorMonitor when created.</p> <p><code>ExecutorMonitor</code> is added to the management queue (of LiveListenerBus) when <code>ExecutorAllocationManager</code> is started.</p> <p><code>ExecutorMonitor</code> is attached (to the ContextCleaner) when <code>ExecutorAllocationManager</code> is started.</p> <p><code>ExecutorMonitor</code> is requested to reset when <code>ExecutorAllocationManager</code> is requested to reset.</p> <p><code>ExecutorMonitor</code> is used for the performance metrics:</p> <ul> <li>numberExecutorsPendingToRemove (based on pendingRemovalCount)</li> <li>numberAllExecutors (based on executorCount)</li> </ul> <p><code>ExecutorMonitor</code> is used for the following:</p> <ul> <li>timedOutExecutors when <code>ExecutorAllocationManager</code> is requested to schedule</li> <li>executorCount when <code>ExecutorAllocationManager</code> is requested to addExecutors</li> <li>executorCount, pendingRemovalCount and executorsKilled when <code>ExecutorAllocationManager</code> is requested to removeExecutors</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#executorallocationlistener","title":"ExecutorAllocationListener <p><code>ExecutorAllocationManager</code> creates an ExecutorAllocationListener when created to intercept Spark events that impact the allocation policy.</p> <p><code>ExecutorAllocationListener</code> is added to the management queue (of LiveListenerBus) when <code>ExecutorAllocationManager</code> is started.</p> <p><code>ExecutorAllocationListener</code> is used to calculate the maximum number of executors needed.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#sparkdynamicallocationexecutorallocationratio","title":"spark.dynamicAllocation.executorAllocationRatio <p><code>ExecutorAllocationManager</code> uses spark.dynamicAllocation.executorAllocationRatio configuration property for maxNumExecutorsNeeded.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#tasksperexecutorforfullparallelism","title":"tasksPerExecutorForFullParallelism <p><code>ExecutorAllocationManager</code> uses spark.executor.cores and spark.task.cpus configuration properties for the number of tasks that can be submitted to an executor for full parallelism.</p> <p>Used when:</p> <ul> <li>maxNumExecutorsNeeded</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#maximum-number-of-executors-needed","title":"Maximum Number of Executors Needed <pre><code>maxNumExecutorsNeeded(): Int\n</code></pre> <p><code>maxNumExecutorsNeeded</code> requests the ExecutorAllocationListener for the number of pending and running tasks.</p> <p><code>maxNumExecutorsNeeded</code> is the smallest integer value that is greater than or equal to the multiplication of the total number of pending and running tasks by executorAllocationRatio divided by tasksPerExecutorForFullParallelism.</p> <p><code>maxNumExecutorsNeeded</code> is used for:</p> <ul> <li>updateAndSyncNumExecutorsTarget</li> <li>numberMaxNeededExecutors performance metric</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#executorallocationclient","title":"ExecutorAllocationClient <p><code>ExecutorAllocationManager</code> is given an ExecutorAllocationClient when created.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#starting-executorallocationmanager","title":"Starting ExecutorAllocationManager <pre><code>start(): Unit\n</code></pre> <p><code>start</code> requests the LiveListenerBus to add to the management queue:</p> <ul> <li>ExecutorAllocationListener</li> <li>ExecutorMonitor</li> </ul> <p><code>start</code> requests the ContextCleaner (if defined) to attach the ExecutorMonitor.</p> <p>creates a <code>scheduleTask</code> (a Java Runnable) for schedule when started.</p> <p><code>start</code> requests the ScheduledExecutorService to schedule the <code>scheduleTask</code> every <code>100</code> ms.</p>  <p>Note</p> <p>The schedule delay of <code>100</code> is not configurable.</p>  <p><code>start</code> requests the ExecutorAllocationClient to request the total executors with the following:</p> <ul> <li>numExecutorsTarget</li> <li>localityAwareTasks</li> <li>hostToLocalTaskCount</li> </ul> <p><code>start</code> is used when <code>SparkContext</code> is created.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#scheduling-executors","title":"Scheduling Executors <pre><code>schedule(): Unit\n</code></pre> <p><code>schedule</code> requests the ExecutorMonitor for timedOutExecutors.</p> <p>If there are executors to be removed, <code>schedule</code> turns the initializing internal flag off.</p> <p><code>schedule</code> updateAndSyncNumExecutorsTarget with the current time.</p> <p>In the end, <code>schedule</code> removes the executors to be removed if there are any.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#updateandsyncnumexecutorstarget","title":"updateAndSyncNumExecutorsTarget <pre><code>updateAndSyncNumExecutorsTarget(\n  now: Long): Int\n</code></pre> <p><code>updateAndSyncNumExecutorsTarget</code> maxNumExecutorsNeeded.</p> <p><code>updateAndSyncNumExecutorsTarget</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#stopping-executorallocationmanager","title":"Stopping ExecutorAllocationManager <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> shuts down &lt;&gt;.  <p>Note</p> <p><code>stop</code> waits 10 seconds for the termination to be complete.</p>  <p><code>stop</code> is used when <code>SparkContext</code> is requested to stop</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#spark-dynamic-executor-allocation-allocation-executor","title":"spark-dynamic-executor-allocation Allocation Executor <p><code>spark-dynamic-executor-allocation</code> allocation executor is a...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#executorallocationmanagersource","title":"ExecutorAllocationManagerSource <p>ExecutorAllocationManagerSource</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#removing-executors","title":"Removing Executors <pre><code>removeExecutors(\n  executors: Seq[(String, Int)]): Seq[String]\n</code></pre> <p><code>removeExecutors</code>...FIXME</p> <p><code>removeExecutors</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is requested to schedule executors</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.ExecutorAllocationManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.ExecutorAllocationManager=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/","title":"ExecutorAllocationManagerSource","text":"<p><code>ExecutorAllocationManagerSource</code> is a metric source for Dynamic Allocation of Executors.</p>"},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#source-name","title":"Source Name <p><code>ExecutorAllocationManagerSource</code> is registered under the name ExecutorAllocationManager.</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#gauges","title":"Gauges","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numberexecutorstoadd","title":"numberExecutorsToAdd <p><code>executors/numberExecutorsToAdd</code> for numExecutorsToAdd</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numberexecutorspendingtoremove","title":"numberExecutorsPendingToRemove <p><code>executors/numberExecutorsPendingToRemove</code> for pendingRemovalCount</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numberallexecutors","title":"numberAllExecutors <p><code>executors/numberAllExecutors</code> for executorCount</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numbertargetexecutors","title":"numberTargetExecutors <p><code>executors/numberTargetExecutors</code> for numExecutorsTarget</p>","text":""},{"location":"dynamic-allocation/ExecutorAllocationManagerSource/#numbermaxneededexecutors","title":"numberMaxNeededExecutors <p><code>executors/numberMaxNeededExecutors</code> for maxNumExecutorsNeeded</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/","title":"ExecutorMonitor","text":"<p><code>ExecutorMonitor</code> is a SparkListener and a CleanerListener.</p>"},{"location":"dynamic-allocation/ExecutorMonitor/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorMonitor</code> takes the following to be created:</p> <ul> <li> SparkConf <li> ExecutorAllocationClient <li> LiveListenerBus <li> <code>Clock</code> <p><code>ExecutorMonitor</code> is created\u00a0when:</p> <ul> <li><code>ExecutorAllocationManager</code> is created</li> </ul>"},{"location":"dynamic-allocation/ExecutorMonitor/#shuffleids-registry","title":"shuffleIds Registry <pre><code>shuffleIds: Set[Int]\n</code></pre> <p><code>ExecutorMonitor</code> uses a mutable <code>HashSet</code> to track shuffle IDs...FIXME</p> <p><code>shuffleIds</code> is initialized only when shuffleTrackingEnabled is enabled.</p> <p><code>shuffleIds</code> is used by <code>Tracker</code> internal class for the following:</p> <ul> <li><code>updateTimeout</code>, <code>addShuffle</code>, <code>removeShuffle</code> and <code>updateActiveShuffles</code></li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#executors-registry","title":"Executors Registry <pre><code>executors: ConcurrentHashMap[String, Tracker]\n</code></pre> <p><code>ExecutorMonitor</code> uses a Java ConcurrentHashMap to track available executors.</p> <p>An executor is added when (via ensureExecutorIsTracked):</p> <ul> <li>onBlockUpdated</li> <li>onExecutorAdded</li> <li>onTaskStart</li> </ul> <p>An executor is removed when onExecutorRemoved.</p> <p>All executors are removed when reset.</p> <p><code>executors</code> is used when:</p> <ul> <li>onOtherEvent (cleanupShuffle)</li> <li>executorCount</li> <li>executorsKilled</li> <li>onUnpersistRDD</li> <li>onTaskEnd</li> <li>onJobStart</li> <li>onJobEnd</li> <li>pendingRemovalCount</li> <li>timedOutExecutors</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#fetchfromshufflesvcenabled-flag","title":"fetchFromShuffleSvcEnabled Flag <pre><code>fetchFromShuffleSvcEnabled: Boolean\n</code></pre> <p><code>ExecutorMonitor</code> initializes <code>fetchFromShuffleSvcEnabled</code> internal flag based on the values of spark.shuffle.service.enabled and spark.shuffle.service.fetch.rdd.enabled configuration properties.</p> <p><code>fetchFromShuffleSvcEnabled</code> is enabled (<code>true</code>) when the aforementioned configuration properties are.</p> <p><code>fetchFromShuffleSvcEnabled</code> is used when:</p> <ul> <li>onBlockUpdated</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#shuffletrackingenabled-flag","title":"shuffleTrackingEnabled Flag <pre><code>shuffleTrackingEnabled: Boolean\n</code></pre> <p><code>ExecutorMonitor</code> initializes <code>shuffleTrackingEnabled</code> internal flag based on the values of spark.shuffle.service.enabled and spark.dynamicAllocation.shuffleTracking.enabled configuration properties.</p> <p><code>shuffleTrackingEnabled</code> is enabled (<code>true</code>) when the following holds:</p> <ol> <li>spark.shuffle.service.enabled is disabled</li> <li>spark.dynamicAllocation.shuffleTracking.enabled is enabled</li> </ol> <p>When enabled, <code>shuffleTrackingEnabled</code> is used to skip execution of the following (making them noops):</p> <ul> <li>onJobStart</li> <li>onJobEnd</li> </ul> <p>When disabled, <code>shuffleTrackingEnabled</code> is used for the following:</p> <ul> <li>onTaskEnd</li> <li>shuffleCleaned</li> <li>shuffleIds</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#sparkdynamicallocationcachedexecutoridletimeout","title":"spark.dynamicAllocation.cachedExecutorIdleTimeout <p><code>ExecutorMonitor</code> reads spark.dynamicAllocation.cachedExecutorIdleTimeout configuration property for <code>Tracker</code> to updateTimeout.</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onblockupdated","title":"onBlockUpdated <pre><code>onBlockUpdated(\n  event: SparkListenerBlockUpdated): Unit\n</code></pre> <p><code>onBlockUpdated</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onBlockUpdated</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onexecutoradded","title":"onExecutorAdded <pre><code>onExecutorAdded(\n  event: SparkListenerExecutorAdded): Unit\n</code></pre> <p><code>onExecutorAdded</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onExecutorAdded</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onexecutorremoved","title":"onExecutorRemoved <pre><code>onExecutorRemoved(\n  event: SparkListenerExecutorRemoved): Unit\n</code></pre> <p><code>onExecutorRemoved</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onExecutorRemoved</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onjobend","title":"onJobEnd <pre><code>onJobEnd(\n  event: SparkListenerJobEnd): Unit\n</code></pre> <p><code>onJobEnd</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onJobEnd</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onjobstart","title":"onJobStart <pre><code>onJobStart(\n  event: SparkListenerJobStart): Unit\n</code></pre> <p><code>onJobStart</code>\u00a0is part of the SparkListenerInterface abstraction.</p>  <p>Note</p> <p><code>onJobStart</code> does nothing and simply returns when the shuffleTrackingEnabled flag is turned off (<code>false</code>).</p>  <p><code>onJobStart</code> requests the input <code>SparkListenerJobStart</code> for the StageInfos and converts...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onotherevent","title":"onOtherEvent <pre><code>onOtherEvent(\n  event: SparkListenerEvent): Unit\n</code></pre> <p><code>onOtherEvent</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onOtherEvent</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#cleanupshuffle","title":"cleanupShuffle <pre><code>cleanupShuffle(\n  id: Int): Unit\n</code></pre> <p><code>cleanupShuffle</code>...FIXME</p> <p><code>cleanupShuffle</code>\u00a0is used when onOtherEvent</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#ontaskend","title":"onTaskEnd <pre><code>onTaskEnd(\n  event: SparkListenerTaskEnd): Unit\n</code></pre> <p><code>onTaskEnd</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onTaskEnd</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#ontaskstart","title":"onTaskStart <pre><code>onTaskStart(\n  event: SparkListenerTaskStart): Unit\n</code></pre> <p><code>onTaskStart</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onTaskStart</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#onunpersistrdd","title":"onUnpersistRDD <pre><code>onUnpersistRDD(\n  event: SparkListenerUnpersistRDD): Unit\n</code></pre> <p><code>onUnpersistRDD</code>\u00a0is part of the SparkListenerInterface abstraction.</p> <p><code>onUnpersistRDD</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#reset","title":"reset <pre><code>reset(): Unit\n</code></pre> <p><code>reset</code>...FIXME</p> <p><code>reset</code>\u00a0is used when:</p> <ul> <li>FIXME</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#shufflecleaned","title":"shuffleCleaned <pre><code>shuffleCleaned(\n  shuffleId: Int): Unit\n</code></pre> <p><code>shuffleCleaned</code>\u00a0is part of the CleanerListener abstraction.</p> <p><code>shuffleCleaned</code>...FIXME</p>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#timedoutexecutors","title":"timedOutExecutors <pre><code>timedOutExecutors(): Seq[String]\ntimedOutExecutors(\n  when: Long): Seq[String]\n</code></pre> <p><code>timedOutExecutors</code>...FIXME</p> <p><code>timedOutExecutors</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is requested to schedule</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#executorcount","title":"executorCount <pre><code>executorCount: Int\n</code></pre> <p><code>executorCount</code>...FIXME</p> <p><code>executorCount</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is requested to addExecutors and removeExecutors</li> <li><code>ExecutorAllocationManagerSource</code> is requested for numberAllExecutors performance metric</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#pendingremovalcount","title":"pendingRemovalCount <pre><code>pendingRemovalCount: Int\n</code></pre> <p><code>pendingRemovalCount</code>...FIXME</p> <p><code>pendingRemovalCount</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is requested to removeExecutors</li> <li><code>ExecutorAllocationManagerSource</code> is requested for numberExecutorsPendingToRemove performance metric</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#executorskilled","title":"executorsKilled <pre><code>executorsKilled(\n  ids: Seq[String]): Unit\n</code></pre> <p><code>executorsKilled</code>...FIXME</p> <p><code>executorsKilled</code>\u00a0is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is requested to removeExecutors</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#ensureexecutoristracked","title":"ensureExecutorIsTracked <pre><code>ensureExecutorIsTracked(\n  id: String,\n  resourceProfileId: Int): Tracker\n</code></pre> <p><code>ensureExecutorIsTracked</code>...FIXME</p> <p><code>ensureExecutorIsTracked</code>\u00a0is used when:</p> <ul> <li>onBlockUpdated</li> <li>onExecutorAdded</li> <li>onTaskStart</li> </ul>","text":""},{"location":"dynamic-allocation/ExecutorMonitor/#getresourceprofileid","title":"getResourceProfileId <pre><code>getResourceProfileId(\n  executorId: String): Int\n</code></pre> <p><code>getResourceProfileId</code>...FIXME</p> <p><code>getResourceProfileId</code>\u00a0is used for testing only.</p>","text":""},{"location":"dynamic-allocation/Tracker/","title":"Tracker","text":"<p><code>Tracker</code> is a private internal class of ExecutorMonitor.</p>"},{"location":"dynamic-allocation/Tracker/#creating-instance","title":"Creating Instance","text":"<p><code>Tracker</code> takes the following to be created:</p> <ul> <li> resourceProfileId <p><code>Tracker</code> is created\u00a0when:</p> <ul> <li><code>ExecutorMonitor</code> is requested to ensureExecutorIsTracked</li> </ul>"},{"location":"dynamic-allocation/Tracker/#cachedblocks-internal-registry","title":"cachedBlocks Internal Registry <pre><code>cachedBlocks: Map[Int, BitSet]\n</code></pre> <p><code>Tracker</code> uses <code>cachedBlocks</code> internal registry for cached blocks (RDD IDs and partition IDs stored in an executor).</p> <p><code>cachedBlocks</code> is used when:</p> <ul> <li><code>ExecutorMonitor</code> is requested to onBlockUpdated, onUnpersistRDD</li> <li><code>Tracker</code> is requested to updateTimeout</li> </ul>","text":""},{"location":"dynamic-allocation/Tracker/#removeshuffle","title":"removeShuffle <pre><code>removeShuffle(\n  id: Int): Unit\n</code></pre> <p><code>removeShuffle</code>...FIXME</p> <p><code>removeShuffle</code>\u00a0is used when:</p> <ul> <li><code>ExecutorMonitor</code> is requested to cleanupShuffle</li> </ul>","text":""},{"location":"dynamic-allocation/Tracker/#updateactiveshuffles","title":"updateActiveShuffles <pre><code>updateActiveShuffles(\n  ids: Iterable[Int]): Unit\n</code></pre> <p><code>updateActiveShuffles</code>...FIXME</p> <p><code>updateActiveShuffles</code>\u00a0is used when:</p> <ul> <li><code>ExecutorMonitor</code> is requested to onJobStart and onJobEnd</li> </ul>","text":""},{"location":"dynamic-allocation/Tracker/#updaterunningtasks","title":"updateRunningTasks <pre><code>updateRunningTasks(\n  delta: Int): Unit\n</code></pre> <p><code>updateRunningTasks</code>...FIXME</p> <p><code>updateRunningTasks</code>\u00a0is used when:</p> <ul> <li><code>ExecutorMonitor</code> is requested to onTaskStart, onTaskEnd and onExecutorAdded</li> </ul>","text":""},{"location":"dynamic-allocation/Tracker/#updatetimeout","title":"updateTimeout <pre><code>updateTimeout(): Unit\n</code></pre> <p><code>updateTimeout</code>...FIXME</p> <p><code>updateTimeout</code>\u00a0is used when:</p> <ul> <li><code>ExecutorMonitor</code> is requested to onBlockUpdated and onUnpersistRDD</li> <li><code>Tracker</code> is requested to updateRunningTasks, removeShuffle, updateActiveShuffles</li> </ul>","text":""},{"location":"dynamic-allocation/configuration-properties/","title":"Spark Configuration Properties","text":""},{"location":"dynamic-allocation/configuration-properties/#sparkdynamicallocation","title":"spark.dynamicAllocation","text":""},{"location":"dynamic-allocation/configuration-properties/#cachedexecutoridletimeout","title":"cachedExecutorIdleTimeout <p>spark.dynamicAllocation.cachedExecutorIdleTimeout</p> <p>How long (in seconds) to keep blocks cached</p> <p>Default: The largest value representable as an Int</p> <p>Must be &gt;= <code>0</code></p> <p>Used when:</p> <ul> <li><code>ExecutorMonitor</code> is created</li> <li><code>RDD</code> is requested to localCheckpoint (simply to print out a WARN message)</li> </ul>","text":""},{"location":"dynamic-allocation/configuration-properties/#enabled","title":"enabled <p>spark.dynamicAllocation.enabled</p> <p>Enables Dynamic Allocation of Executors</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>BarrierJobAllocationFailed</code> is requested for ERROR_MESSAGE_RUN_BARRIER_WITH_DYN_ALLOCATION (for reporting purposes)</li> <li><code>RDD</code> is requested to localCheckpoint (for reporting purposes)</li> <li><code>SparkSubmitArguments</code> is requested to loadEnvironmentArguments (for validation purposes)</li> <li><code>Utils</code> is requested to isDynamicAllocationEnabled</li> </ul>","text":""},{"location":"dynamic-allocation/configuration-properties/#executorallocationratio","title":"executorAllocationRatio <p>spark.dynamicAllocation.executorAllocationRatio</p> <p>Default: <code>1.0</code></p> <p>Must be between <code>0</code> (exclusive) and <code>1.0</code> (inclusive)</p> <p>Used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is created</li> </ul>","text":""},{"location":"dynamic-allocation/configuration-properties/#executoridletimeout","title":"executorIdleTimeout <p>spark.dynamicAllocation.executorIdleTimeout</p> <p>Default: <code>60</code></p>","text":""},{"location":"dynamic-allocation/configuration-properties/#initialexecutors","title":"initialExecutors <p>spark.dynamicAllocation.initialExecutors</p> <p>Default: spark.dynamicAllocation.minExecutors</p>","text":""},{"location":"dynamic-allocation/configuration-properties/#maxexecutors","title":"maxExecutors <p>spark.dynamicAllocation.maxExecutors</p> <p>Default: <code>Int.MaxValue</code></p>","text":""},{"location":"dynamic-allocation/configuration-properties/#minexecutors","title":"minExecutors <p>spark.dynamicAllocation.minExecutors</p> <p>Default: <code>0</code></p>","text":""},{"location":"dynamic-allocation/configuration-properties/#schedulerbacklogtimeout","title":"schedulerBacklogTimeout <p>spark.dynamicAllocation.schedulerBacklogTimeout</p> <p>(in seconds)</p> <p>Default: <code>1</code></p>","text":""},{"location":"dynamic-allocation/configuration-properties/#shuffletrackingenabled","title":"shuffleTracking.enabled <p>spark.dynamicAllocation.shuffleTracking.enabled</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>ExecutorMonitor</code> is created</li> </ul>","text":""},{"location":"dynamic-allocation/configuration-properties/#shuffletrackingtimeout","title":"shuffleTracking.timeout <p>spark.dynamicAllocation.shuffleTracking.timeout</p> <p>(in millis)</p> <p>Default: The largest value representable as an Int</p>","text":""},{"location":"dynamic-allocation/configuration-properties/#sustainedschedulerbacklogtimeout","title":"sustainedSchedulerBacklogTimeout <p>spark.dynamicAllocation.sustainedSchedulerBacklogTimeout</p> <p>Default: spark.dynamicAllocation.schedulerBacklogTimeout</p>","text":""},{"location":"executor/","title":"Executor","text":"<p>Spark applications start one or more Executors for executing tasks.</p> <p>By default (in Static Allocation of Executors) executors run for the entire lifetime of a Spark application (unlike in Dynamic Allocation).</p> <p>Executors are managed by ExecutorBackend.</p> <p>Executors reports heartbeat and partial metrics for active tasks to the HeartbeatReceiver RPC Endpoint on the driver.</p> <p></p> <p>Executors provide in-memory storage for <code>RDD</code>s that are cached in Spark applications (via BlockManager).</p> <p>When started, an executor first registers itself with the driver that establishes a communication channel directly to the driver to accept tasks for execution.</p> <p></p> <p>Executor offers are described by executor id and the host on which an executor runs.</p> <p>Executors can run multiple tasks over their lifetime, both in parallel and sequentially, and track running tasks.</p> <p>Executors use an Executor task launch worker thread pool for launching tasks.</p> <p>Executors send metrics (and heartbeats) using the Heartbeat Sender Thread.</p>"},{"location":"executor/CoarseGrainedExecutorBackend/","title":"CoarseGrainedExecutorBackend","text":"<p><code>CoarseGrainedExecutorBackend</code> is an ExecutorBackend that controls the lifecycle of a single executor and sends executor status updates to the driver.</p> <p></p> <p><code>CoarseGrainedExecutorBackend</code> is started in a resource container (as a standalone application).</p>"},{"location":"executor/CoarseGrainedExecutorBackend/#creating-instance","title":"Creating Instance","text":"<p><code>CoarseGrainedExecutorBackend</code> takes the following to be created:</p> <ul> <li> RpcEnv <li> Driver URL <li> Executor ID <li> Bind Address (unused) <li> Hostname <li> Number of CPU cores <li> SparkEnv <li> Resources Configuration File <li> ResourceProfile <p>Note</p> <p>driverUrl, executorId, hostname, cores and userClassPath correspond to <code>CoarseGrainedExecutorBackend</code> standalone application's command-line arguments.</p> <p><code>CoarseGrainedExecutorBackend</code> is created upon launching CoarseGrainedExecutorBackend standalone application.</p>"},{"location":"executor/CoarseGrainedExecutorBackend/#executor","title":"Executor","text":"<p><code>CoarseGrainedExecutorBackend</code> manages the lifecycle of a single Executor:</p> <ul> <li>An <code>Executor</code> is created upon receiving a RegisteredExecutor message</li> <li>Stopped upon receiving a Shutdown message (that happens on a separate <code>CoarseGrainedExecutorBackend-stop-executor</code> thread)</li> </ul> <p>The <code>Executor</code> is used for the following:</p> <ul> <li>decommissionSelf</li> <li>Launching a task (upon receiving a LaunchTask message)</li> <li>Killing a task (upon receiving a KillTask message)</li> <li>Reporting the number of CPU cores used for a given task in statusUpdate</li> </ul>"},{"location":"executor/CoarseGrainedExecutorBackend/#statusUpdate","title":"Reporting Task Status","text":"ExecutorBackend <pre><code>statusUpdate(\n  taskId: Long,\n  state: TaskState,\n  data: ByteBuffer): Unit\n</code></pre> <p><code>statusUpdate</code> is part of the ExecutorBackend abstraction.</p> <p><code>statusUpdate</code>...FIXME</p>"},{"location":"executor/CoarseGrainedExecutorBackend/#messages","title":"Messages","text":""},{"location":"executor/CoarseGrainedExecutorBackend/#DecommissionExecutor","title":"DecommissionExecutor","text":"<p><code>DecommissionExecutor</code> is sent out when <code>CoarseGrainedSchedulerBackend</code> is requested to decommissionExecutors</p> <p>When received, <code>CoarseGrainedExecutorBackend</code> decommissionSelf.</p>"},{"location":"executor/CoarseGrainedExecutorBackend/#logging","title":"Logging","text":"<p>Enable <code>ALL</code> logging level for <code>org.apache.spark.executor.CoarseGrainedExecutorBackend</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.CoarseGrainedExecutorBackend.name = org.apache.spark.executor.CoarseGrainedExecutorBackend\nlogger.CoarseGrainedExecutorBackend.level = all\n</code></pre> <p>Refer to Logging.</p>"},{"location":"executor/Executor/","title":"Executor","text":""},{"location":"executor/Executor/#creating-instance","title":"Creating Instance","text":"<p><code>Executor</code> takes the following to be created:</p> <ul> <li> Executor ID <li> Host name <li> SparkEnv <li>User-defined jars</li> <li>isLocal flag</li> <li> <code>UncaughtExceptionHandler</code> (default: <code>SparkUncaughtExceptionHandler</code>) <li> Resources (<code>Map[String, ResourceInformation]</code>) <p><code>Executor</code> is created\u00a0when:</p> <ul> <li><code>CoarseGrainedExecutorBackend</code> is requested to handle a RegisteredExecutor message (after having registered with the driver)</li> <li><code>LocalEndpoint</code> is created</li> </ul>"},{"location":"executor/Executor/#when-created","title":"When Created","text":"<p>When created, <code>Executor</code> prints out the following INFO messages to the logs:</p> <pre><code>Starting executor ID [executorId] on host [executorHostname]\n</code></pre> <p>(only for non-local modes) <code>Executor</code> sets <code>SparkUncaughtExceptionHandler</code> as the default handler invoked when a thread abruptly terminates due to an uncaught exception.</p> <p>(only for non-local modes) <code>Executor</code> requests the BlockManager to initialize (with the Spark application id of the SparkConf).</p> <p> <p>(only for non-local modes) <code>Executor</code> requests the MetricsSystem to register the following metric sources:</p> <ul> <li>ExecutorSource</li> <li><code>JVMCPUSource</code></li> <li>ExecutorMetricsSource</li> <li>ShuffleMetricsSource (of the BlockManager)</li> </ul> <p><code>Executor</code> uses <code>SparkEnv</code> to access the MetricsSystem and BlockManager.</p> <p><code>Executor</code> creates a task class loader (optionally with REPL support) and requests the system <code>Serializer</code> to use as the default classloader (for deserializing tasks).</p> <p><code>Executor</code> starts sending heartbeats with the metrics of active tasks.</p>"},{"location":"executor/Executor/#plugincontainer","title":"PluginContainer <p><code>Executor</code> creates a PluginContainer (with the SparkEnv and the resources).</p> <p>The <code>PluginContainer</code> is used to create a TaskRunner for launching a task.</p> <p>The <code>PluginContainer</code> is requested to shutdown in stop.</p>","text":""},{"location":"executor/Executor/#executorsource","title":"ExecutorSource <p>When created, <code>Executor</code> creates an ExecutorSource (with the threadPool, the executorId and the schemes).</p> <p>The <code>ExecutorSource</code> is then registered with the application's MetricsSystem (in local and non-local modes) to report metrics.</p> <p>The metrics are updated right after a TaskRunner has finished executing a task.</p>","text":""},{"location":"executor/Executor/#executormetricssource","title":"ExecutorMetricsSource <p><code>Executor</code> creates an ExecutorMetricsSource when created with the spark.metrics.executorMetricsSource.enabled enabled.</p> <p><code>Executor</code> uses the <code>ExecutorMetricsSource</code> to create the ExecutorMetricsPoller.</p> <p><code>Executor</code> requests the <code>ExecutorMetricsSource</code> to register immediately when created with the isLocal flag disabled.</p>","text":""},{"location":"executor/Executor/#executormetricspoller","title":"ExecutorMetricsPoller <p><code>Executor</code> creates an ExecutorMetricsPoller when created with the following:</p> <ul> <li>MemoryManager of the SparkEnv</li> <li>spark.executor.metrics.pollingInterval</li> <li>ExecutorMetricsSource</li> </ul> <p><code>Executor</code> requests the <code>ExecutorMetricsPoller</code> to start immediately when created and to stop when requested to stop.</p> <p><code>TaskRunner</code> requests the <code>ExecutorMetricsPoller</code> to onTaskStart and onTaskCompletion at the beginning and the end of run, respectively.</p> <p>When requested to reportHeartBeat with pollOnHeartbeat enabled, <code>Executor</code> requests the <code>ExecutorMetricsPoller</code> to poll.</p>","text":""},{"location":"executor/Executor/#fetching-file-and-jar-dependencies","title":"Fetching File and Jar Dependencies <pre><code>updateDependencies(\n  newFiles: Map[String, Long],\n  newJars: Map[String, Long]): Unit\n</code></pre> <p><code>updateDependencies</code> fetches missing or outdated extra files (in the given <code>newFiles</code>). For every name-timestamp pair that...FIXME..., <code>updateDependencies</code> prints out the following INFO message to the logs:</p> <pre><code>Fetching [name] with timestamp [timestamp]\n</code></pre> <p><code>updateDependencies</code> fetches missing or outdated extra jars (in the given <code>newJars</code>). For every name-timestamp pair that...FIXME..., <code>updateDependencies</code> prints out the following INFO message to the logs:</p> <pre><code>Fetching [name] with timestamp [timestamp]\n</code></pre> <p><code>updateDependencies</code> fetches the file to the SparkFiles root directory.</p> <p><code>updateDependencies</code>...FIXME</p> <p><code>updateDependencies</code> is used when:</p> <ul> <li><code>TaskRunner</code> is requested to start (and run a task)</li> </ul>","text":""},{"location":"executor/Executor/#sparkdrivermaxresultsize","title":"spark.driver.maxResultSize <p><code>Executor</code> uses the spark.driver.maxResultSize for <code>TaskRunner</code> when requested to run a task (and decide on a serialized task result).</p>","text":""},{"location":"executor/Executor/#maximum-size-of-direct-results","title":"Maximum Size of Direct Results <p><code>Executor</code> uses the minimum of spark.task.maxDirectResultSize and spark.rpc.message.maxSize when <code>TaskRunner</code> is requested to run a task (and decide on the type of a serialized task result).</p>","text":""},{"location":"executor/Executor/#islocal-flag","title":"isLocal Flag <p><code>Executor</code> is given the <code>isLocal</code> flag when created to indicate a non-local mode (whether the executor and the Spark application runs with <code>local</code> or cluster-specific master URL).</p> <p><code>isLocal</code> is disabled (<code>false</code>) by default and is off explicitly when <code>CoarseGrainedExecutorBackend</code> is requested to handle a RegisteredExecutor message.</p> <p><code>isLocal</code> is enabled (<code>true</code>) when <code>LocalEndpoint</code> is created</p>","text":""},{"location":"executor/Executor/#sparkexecutoruserclasspathfirst","title":"spark.executor.userClassPathFirst <p><code>Executor</code> reads the value of the spark.executor.userClassPathFirst configuration property when created.</p> <p>When enabled, <code>Executor</code> uses <code>ChildFirstURLClassLoader</code> (not <code>MutableURLClassLoader</code>) when requested to createClassLoader (and addReplClassLoaderIfNeeded).</p>","text":""},{"location":"executor/Executor/#user-defined-jars","title":"User-Defined Jars <p><code>Executor</code> is given user-defined jars when created. No jars are assumed by default.</p> <p>The jars are specified using spark.executor.extraClassPath configuration property (via --user-class-path command-line option of <code>CoarseGrainedExecutorBackend</code>).</p>","text":""},{"location":"executor/Executor/#running-tasks-registry","title":"Running Tasks Registry <pre><code>runningTasks: Map[Long, TaskRunner]\n</code></pre> <p><code>Executor</code> tracks TaskRunners by task IDs.</p>","text":""},{"location":"executor/Executor/#heartbeatreceiver-rpc-endpoint-reference","title":"HeartbeatReceiver RPC Endpoint Reference <p>When created, <code>Executor</code> creates an RPC endpoint reference to HeartbeatReceiver (running on the driver).</p> <p><code>Executor</code> uses the RPC endpoint reference when requested to reportHeartBeat.</p>","text":""},{"location":"executor/Executor/#launching-task","title":"Launching Task <pre><code>launchTask(\n  context: ExecutorBackend,\n  taskDescription: TaskDescription): Unit\n</code></pre> <p><code>launchTask</code> creates a TaskRunner (with the given ExecutorBackend, the TaskDescription and the PluginContainer) and adds it to the runningTasks internal registry.</p> <p><code>launchTask</code> requests the \"Executor task launch worker\" thread pool to execute the <code>TaskRunner</code> (sometime in the future).</p> <p>In case the decommissioned flag is enabled, <code>launchTask</code> prints out the following ERROR message to the logs:</p> <pre><code>Launching a task while in decommissioned state.\n</code></pre> <p></p> <p><code>launchTask</code> is used when:</p> <ul> <li><code>CoarseGrainedExecutorBackend</code> is requested to handle a LaunchTask message</li> <li><code>LocalEndpoint</code> RPC endpoint (of LocalSchedulerBackend) is requested to reviveOffers</li> </ul>","text":""},{"location":"executor/Executor/#sending-heartbeats-and-active-tasks-metrics","title":"Sending Heartbeats and Active Tasks Metrics <p>Executors keep sending metrics for active tasks to the driver every spark.executor.heartbeatInterval (defaults to <code>10s</code> with some random initial delay so the heartbeats from different executors do not pile up on the driver).</p> <p></p> <p>An executor sends heartbeats using the Heartbeat Sender Thread.</p> <p></p> <p>For each task in TaskRunner (in runningTasks internal registry), the task's metrics are computed and become part of the heartbeat (with accumulators).</p> <p>A blocking Heartbeat message that holds the executor id, all accumulator updates (per task id), and BlockManagerId is sent to HeartbeatReceiver RPC endpoint.</p> <p>If the response requests to re-register BlockManager, <code>Executor</code> prints out the following INFO message to the logs:</p> <pre><code>Told to re-register on heartbeat\n</code></pre> <p><code>BlockManager</code> is requested to reregister.</p> <p>The internal heartbeatFailures counter is reset.</p> <p>If there are any issues with communicating with the driver, <code>Executor</code> prints out the following WARN message to the logs:</p> <pre><code>Issue communicating with driver in heartbeater\n</code></pre> <p>The internal heartbeatFailures is incremented and checked to be less than the spark.executor.heartbeat.maxFailures. If the number is greater, the following ERROR is printed out to the logs:</p> <pre><code>Exit as unable to send heartbeats to driver more than [HEARTBEAT_MAX_FAILURES] times\n</code></pre> <p>The executor exits (using <code>System.exit</code> and exit code 56).</p>","text":""},{"location":"executor/Executor/#heartbeat-sender-thread","title":"Heartbeat Sender Thread <p><code>heartbeater</code> is a <code>ScheduledThreadPoolExecutor</code> (Java) with a single thread.</p> <p>The name of the thread pool is driver-heartbeater.</p>","text":""},{"location":"executor/Executor/#executor-task-launch-worker-thread-pool","title":"Executor task launch worker Thread Pool <p>When created, <code>Executor</code> creates <code>threadPool</code> daemon cached thread pool with the name Executor task launch worker-[ID] (with <code>ID</code> being the task id).</p> <p>The <code>threadPool</code> thread pool is used for launching tasks.</p>","text":""},{"location":"executor/Executor/#executor-memory","title":"Executor Memory <p>The amount of memory per executor is configured using spark.executor.memory configuration property. It sets the available memory equally for all executors per application.</p> <p>You can find the value displayed as Memory per Node in the web UI of the standalone Master.</p> <p></p>","text":""},{"location":"executor/Executor/#heartbeating-with-partial-metrics-for-active-tasks-to-driver","title":"Heartbeating With Partial Metrics For Active Tasks To Driver <pre><code>reportHeartBeat(): Unit\n</code></pre> <p><code>reportHeartBeat</code> collects TaskRunners for currently running tasks (active tasks) with their tasks deserialized (i.e. either ready for execution or already started).</p> <p>TaskRunner has task deserialized when it runs the task.</p> <p>For every running task, <code>reportHeartBeat</code> takes the TaskMetrics and:</p> <ul> <li>Requests ShuffleRead metrics to be merged</li> <li>Sets jvmGCTime metrics</li> </ul> <p><code>reportHeartBeat</code> then records the latest values of internal and external accumulators for every task.</p>  <p>Note</p> <p>Internal accumulators are a task's metrics while external accumulators are a Spark application's accumulators that a user has created.</p>  <p><code>reportHeartBeat</code> sends a blocking Heartbeat message to the HeartbeatReceiver (on the driver). <code>reportHeartBeat</code> uses the value of spark.executor.heartbeatInterval configuration property for the RPC timeout.</p>  <p>Note</p> <p>A <code>Heartbeat</code> message contains the executor identifier, the accumulator updates, and the identifier of the BlockManager.</p>  <p>If the response (from HeartbeatReceiver) is to re-register the <code>BlockManager</code>, <code>reportHeartBeat</code> prints out the following INFO message to the logs and requests the <code>BlockManager</code> to re-register (which will register the blocks the <code>BlockManager</code> manages with the driver).</p> <pre><code>Told to re-register on heartbeat\n</code></pre> <p><code>HeartbeatResponse</code> requests the <code>BlockManager</code> to re-register when either TaskScheduler or HeartbeatReceiver know nothing about the executor.</p> <p>When posting the <code>Heartbeat</code> was successful, <code>reportHeartBeat</code> resets heartbeatFailures internal counter.</p> <p>In case of a non-fatal exception, you should see the following WARN message in the logs (followed by the stack trace).</p> <pre><code>Issue communicating with driver in heartbeater\n</code></pre> <p>Every failure <code>reportHeartBeat</code> increments heartbeat failures up to spark.executor.heartbeat.maxFailures configuration property. When the heartbeat failures reaches the maximum, <code>reportHeartBeat</code> prints out the following ERROR message to the logs and the executor terminates with the error code: <code>56</code>.</p> <pre><code>Exit as unable to send heartbeats to driver more than [HEARTBEAT_MAX_FAILURES] times\n</code></pre> <p><code>reportHeartBeat</code> is used when:</p> <ul> <li><code>Executor</code> is requested to schedule reporting heartbeat and partial metrics for active tasks to the driver (that happens every spark.executor.heartbeatInterval).</li> </ul>","text":""},{"location":"executor/Executor/#sparkexecutorheartbeatmaxfailures","title":"spark.executor.heartbeat.maxFailures <p><code>Executor</code> uses spark.executor.heartbeat.maxFailures configuration property in reportHeartBeat.</p>","text":""},{"location":"executor/Executor/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.executor.Executor</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.executor.Executor=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"executor/ExecutorBackend/","title":"ExecutorBackend","text":"<p><code>ExecutorBackend</code> is an abstraction of executor backends (that TaskRunners use to report task status updates to a scheduler).</p> <p></p> <p><code>ExecutorBackend</code> acts as a bridge between executors and the driver.</p>"},{"location":"executor/ExecutorBackend/#contract","title":"Contract","text":""},{"location":"executor/ExecutorBackend/#statusUpdate","title":"Reporting Task Status","text":"<pre><code>statusUpdate(\n  taskId: Long,\n  state: TaskState,\n  data: ByteBuffer): Unit\n</code></pre> <p>Reports task status of the given task to a scheduler</p> <p>See:</p> <ul> <li>CoarseGrainedExecutorBackend</li> </ul> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run a task</li> </ul>"},{"location":"executor/ExecutorBackend/#implementations","title":"Implementations","text":"<ul> <li>CoarseGrainedExecutorBackend</li> <li>LocalSchedulerBackend</li> <li><code>MesosExecutorBackend</code></li> </ul>"},{"location":"executor/ExecutorLogUrlHandler/","title":"ExecutorLogUrlHandler","text":""},{"location":"executor/ExecutorLogUrlHandler/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorLogUrlHandler</code> takes the following to be created:</p> <ul> <li> Optional Log URL Pattern <p><code>ExecutorLogUrlHandler</code> is created\u00a0for the following:</p> <ul> <li>DriverEndpoint</li> <li>HistoryAppStatusStore</li> </ul>"},{"location":"executor/ExecutorLogUrlHandler/#applying-pattern","title":"Applying Pattern <pre><code>applyPattern(\n  logUrls: Map[String, String],\n  attributes: Map[String, String]): Map[String, String]\n</code></pre> <p><code>applyPattern</code> doApplyPattern for logUrlPattern defined or simply returns the given <code>logUrls</code> back.</p> <p><code>applyPattern</code>\u00a0is used when:</p> <ul> <li><code>DriverEndpoint</code> is requested to handle a RegisterExecutor message (and creates a ExecutorData)</li> <li><code>HistoryAppStatusStore</code> is requested to replaceLogUrls</li> </ul>","text":""},{"location":"executor/ExecutorLogUrlHandler/#doapplypattern","title":"doApplyPattern <pre><code>doApplyPattern(\n  logUrls: Map[String, String],\n  attributes: Map[String, String],\n  urlPattern: String): Map[String, String]\n</code></pre> <p><code>doApplyPattern</code>...FIXME</p>","text":""},{"location":"executor/ExecutorMetricType/","title":"ExecutorMetricType","text":"<p><code>ExecutorMetricType</code>  is an abstraction of executor metric types.</p>"},{"location":"executor/ExecutorMetricType/#contract","title":"Contract","text":""},{"location":"executor/ExecutorMetricType/#metric-values","title":"Metric Values <pre><code>getMetricValues(\n  memoryManager: MemoryManager): Array[Long]\n</code></pre> <p>Used when:</p> <ul> <li><code>ExecutorMetrics</code> utility is used for the current metric values</li> </ul>","text":""},{"location":"executor/ExecutorMetricType/#metric-names","title":"Metric Names <pre><code>names: Seq[String]\n</code></pre> <p>Used when:</p> <ul> <li><code>ExecutorMetricType</code> utility is used for the metricToOffset and number of metrics</li> </ul>","text":""},{"location":"executor/ExecutorMetricType/#implementations","title":"Implementations","text":"Sealed Trait <p><code>ExecutorMetricType</code> is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).</p> <p>Learn more in the Scala Language Specification.</p> <ul> <li><code>GarbageCollectionMetrics</code></li> <li><code>ProcessTreeMetrics</code></li> <li><code>SingleValueExecutorMetricType</code></li> <li><code>JVMHeapMemory</code></li> <li><code>JVMOffHeapMemory</code></li> <li><code>MBeanExecutorMetricType</code></li> <li><code>DirectPoolMemory</code></li> <li><code>MappedPoolMemory</code></li> <li><code>MemoryManagerExecutorMetricType</code></li> <li><code>OffHeapExecutionMemory</code></li> <li><code>OffHeapStorageMemory</code></li> <li><code>OffHeapUnifiedMemory</code></li> <li><code>OnHeapExecutionMemory</code></li> <li><code>OnHeapStorageMemory</code></li> <li><code>OnHeapUnifiedMemory</code></li> </ul>"},{"location":"executor/ExecutorMetricType/#executor-metric-getters-ordered-executormetrictypes","title":"Executor Metric Getters (Ordered ExecutorMetricTypes) <p><code>ExecutorMetricType</code> defines an ordered collection of ExecutorMetricTypes:</p> <ol> <li><code>JVMHeapMemory</code></li> <li><code>JVMOffHeapMemory</code></li> <li><code>OnHeapExecutionMemory</code></li> <li><code>OffHeapExecutionMemory</code></li> <li><code>OnHeapStorageMemory</code></li> <li><code>OffHeapStorageMemory</code></li> <li><code>OnHeapUnifiedMemory</code></li> <li><code>OffHeapUnifiedMemory</code></li> <li><code>DirectPoolMemory</code></li> <li><code>MappedPoolMemory</code></li> <li><code>ProcessTreeMetrics</code></li> <li><code>GarbageCollectionMetrics</code></li> </ol> <p>This ordering allows for passing metric values as arrays (to save space) with indices being a metric of a metric type.</p> <p><code>metricGetters</code> is used when:</p> <ul> <li><code>ExecutorMetrics</code> utility is used for the current metric values</li> <li><code>ExecutorMetricType</code> utility is used to get the metricToOffset and the numMetrics</li> </ul>","text":""},{"location":"executor/ExecutorMetrics/","title":"ExecutorMetrics","text":"<p><code>ExecutorMetrics</code> is a collection of executor metrics.</p>","tags":["DeveloperApi"]},{"location":"executor/ExecutorMetrics/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorMetrics</code> takes the following to be created:</p> <ul> <li> Metrics <p><code>ExecutorMetrics</code> is created when:</p> <ul> <li><code>SparkContext</code> is requested to reportHeartBeat</li> <li><code>DAGScheduler</code> is requested to post a SparkListenerTaskEnd event</li> <li><code>ExecutorMetricsPoller</code> is requested to getExecutorUpdates</li> <li><code>ExecutorMetricsJsonDeserializer</code> is requested to <code>deserialize</code></li> <li><code>JsonProtocol</code> is requested to executorMetricsFromJson</li> </ul>","tags":["DeveloperApi"]},{"location":"executor/ExecutorMetrics/#current-metric-values","title":"Current Metric Values <pre><code>getCurrentMetrics(\n  memoryManager: MemoryManager): Array[Long]\n</code></pre> <p><code>getCurrentMetrics</code> gives metric values for every metric getter.</p> <p>Given that one metric getter (type) can report multiple metrics, the length of the result collection is the number of metrics (and at least the number of metric getters). The order matters and is exactly as metricGetters.</p> <p><code>getCurrentMetrics</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to reportHeartBeat</li> <li><code>ExecutorMetricsPoller</code> is requested to poll</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"executor/ExecutorMetricsPoller/","title":"ExecutorMetricsPoller","text":""},{"location":"executor/ExecutorMetricsPoller/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorMetricsPoller</code> takes the following to be created:</p> <ul> <li> MemoryManager <li> spark.executor.metrics.pollingInterval <li> ExecutorMetricsSource <p><code>ExecutorMetricsPoller</code> is created when:</p> <ul> <li><code>Executor</code> is created</li> </ul>"},{"location":"executor/ExecutorMetricsPoller/#executor-metrics-poller","title":"executor-metrics-poller <p><code>ExecutorMetricsPoller</code> creates a <code>ScheduledExecutorService</code> (Java) when created with the spark.executor.metrics.pollingInterval greater than <code>0</code>.</p> <p>The <code>ScheduledExecutorService</code> manages 1 daemon thread with <code>executor-metrics-poller</code> name prefix.</p> <p>The <code>ScheduledExecutorService</code> is requested to poll at every pollingInterval when <code>ExecutorMetricsPoller</code> is requested to start until stop.</p>","text":""},{"location":"executor/ExecutorMetricsPoller/#poll","title":"poll <pre><code>poll(): Unit\n</code></pre> <p><code>poll</code>...FIXME</p> <p><code>poll</code> is used when:</p> <ul> <li><code>Executor</code> is requested to reportHeartBeat</li> <li><code>ExecutorMetricsPoller</code> is requested to start</li> </ul>","text":""},{"location":"executor/ExecutorMetricsSource/","title":"ExecutorMetricsSource","text":"<p><code>ExecutorMetricsSource</code> is a metrics source.</p>"},{"location":"executor/ExecutorMetricsSource/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorMetricsSource</code> takes no arguments to be created.</p> <p><code>ExecutorMetricsSource</code> is created when:</p> <ul> <li><code>SparkContext</code> is created (with spark.metrics.executorMetricsSource.enabled enabled)</li> <li><code>Executor</code> is created (with spark.metrics.executorMetricsSource.enabled enabled)</li> </ul>"},{"location":"executor/ExecutorMetricsSource/#source-name","title":"Source Name <pre><code>sourceName: String\n</code></pre> <p><code>sourceName</code> is ExecutorMetrics.</p> <p><code>sourceName</code> is part of the Source abstraction.</p>","text":""},{"location":"executor/ExecutorMetricsSource/#registering-with-metricssystem","title":"Registering with MetricsSystem <pre><code>register(\n  metricsSystem: MetricsSystem): Unit\n</code></pre> <p><code>register</code> creates <code>ExecutorMetricGauge</code>s for every executor metric.</p> <p><code>register</code> requests the MetricRegistry to register every metric type.</p> <p>In the end, <code>register</code> requests the MetricRegistry to register this <code>ExecutorMetricsSource</code>.</p> <p><code>register</code> is used when:</p> <ul> <li><code>SparkContext</code> is created</li> <li><code>Executor</code> is created (for non-local mode)</li> </ul>","text":""},{"location":"executor/ExecutorMetricsSource/#metrics-snapshot","title":"Metrics Snapshot <p><code>ExecutorMetricsSource</code> defines <code>metricsSnapshot</code> internal registry of values of every metric.</p> <p>The values are updated in updateMetricsSnapshot and read using <code>ExecutorMetricGauge</code>s.</p>","text":""},{"location":"executor/ExecutorMetricsSource/#updatemetricssnapshot","title":"updateMetricsSnapshot <pre><code>updateMetricsSnapshot(\n  metricsUpdates: Array[Long]): Unit\n</code></pre> <p><code>updateMetricsSnapshot</code> updates the metricsSnapshot registry with the given <code>metricsUpdates</code>.</p> <p><code>updateMetricsSnapshot</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to reportHeartBeat</li> <li><code>ExecutorMetricsPoller</code> is requested to poll</li> </ul>","text":""},{"location":"executor/ExecutorSource/","title":"ExecutorSource","text":"<p><code>ExecutorSource</code> is a Source of Executors.</p> <p></p>"},{"location":"executor/ExecutorSource/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorSource</code> takes the following to be created:</p> <ul> <li> ThreadPoolExecutor <li> Executor ID (unused) <li> File System Schemes (to report based on spark.executor.metrics.fileSystemSchemes) <p><code>ExecutorSource</code> is created\u00a0when:</p> <ul> <li><code>Executor</code> is created</li> </ul>"},{"location":"executor/ExecutorSource/#name","title":"Name <p><code>ExecutorSource</code> is known under the name executor.</p>","text":""},{"location":"executor/ExecutorSource/#metrics","title":"Metrics <pre><code>metricRegistry: MetricRegistry\n</code></pre> <p><code>metricRegistry</code> is part of the Source abstraction.</p>    Name Description     threadpool.activeTasks Approximate number of threads that are actively executing tasks (based on ThreadPoolExecutor.getActiveCount)   others","text":""},{"location":"executor/ShuffleReadMetrics/","title":"ShuffleReadMetrics","text":"<p><code>ShuffleReadMetrics</code> is a collection of metrics (accumulators) on reading shuffle data.</p>","tags":["DeveloperApi"]},{"location":"executor/ShuffleReadMetrics/#taskmetrics","title":"TaskMetrics <p><code>ShuffleReadMetrics</code> is available using TaskMetrics.shuffleReadMetrics.</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/ShuffleReadMetrics/#serializable","title":"Serializable <p><code>ShuffleReadMetrics</code> is a <code>Serializable</code> (Java).</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/","title":"ShuffleWriteMetrics","text":"<p><code>ShuffleWriteMetrics</code> is a ShuffleWriteMetricsReporter of metrics (accumulators) related to writing shuffle data (in shuffle map tasks):</p> <ul> <li>Shuffle Bytes Written</li> <li>Shuffle Write Time</li> <li>Shuffle Records Written</li> </ul>","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleWriteMetrics</code> takes no input arguments to be created.</p> <p><code>ShuffleWriteMetrics</code> is created\u00a0when:</p> <ul> <li><code>TaskMetrics</code> is created</li> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile</li> <li><code>MapIterator</code> (of BytesToBytesMap) is requested to <code>spill</code></li> <li><code>ExternalAppendOnlyMap</code> is created</li> <li><code>ExternalSorter</code> is requested to spillMemoryIteratorToDisk</li> <li><code>UnsafeExternalSorter</code> is requested to spill</li> <li><code>SpillableIterator</code> (of UnsafeExternalSorter) is requested to <code>spill</code></li> </ul>","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/#taskmetrics","title":"TaskMetrics <p><code>ShuffleWriteMetrics</code> is available using TaskMetrics.shuffleWriteMetrics.</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/ShuffleWriteMetrics/#serializable","title":"Serializable <p><code>ShuffleWriteMetrics</code> is a <code>Serializable</code> (Java).</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/","title":"TaskMetrics","text":"<p><code>TaskMetrics</code> is a collection of metrics (accumulators) tracked during execution of a task.</p>","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#creating-instance","title":"Creating Instance","text":"<p><code>TaskMetrics</code> takes no input arguments to be created.</p> <p><code>TaskMetrics</code> is created\u00a0when:</p> <ul> <li><code>Stage</code> is requested to makeNewStageAttempt</li> </ul>","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#metrics","title":"Metrics","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#shufflewritemetrics","title":"ShuffleWriteMetrics <p>ShuffleWriteMetrics</p> <ul> <li>shuffle.write.bytesWritten</li> <li>shuffle.write.recordsWritten</li> <li>shuffle.write.writeTime</li> </ul> <p><code>ShuffleWriteMetrics</code> is exposed using Dropwizard metrics system using ExecutorSource (when <code>TaskRunner</code> is about to finish running):</p> <ul> <li>shuffleBytesWritten</li> <li>shuffleRecordsWritten</li> <li>shuffleWriteTime</li> </ul> <p><code>ShuffleWriteMetrics</code> can be monitored using:</p> <ul> <li>StatsReportListener (when a stage completes)<ul> <li>shuffle bytes written</li> </ul> </li> <li>JsonProtocol (when requested to taskMetricsToJson)<ul> <li>Shuffle Bytes Written</li> <li>Shuffle Write Time</li> <li>Shuffle Records Written</li> </ul> </li> </ul> <p><code>shuffleWriteMetrics</code> is used when:</p> <ul> <li><code>ShuffleWriteProcessor</code> is requested for a ShuffleWriteMetricsReporter</li> <li><code>SortShuffleWriter</code> is created</li> <li><code>AppStatusListener</code> is requested to handle a SparkListenerTaskEnd</li> <li><code>LiveTask</code> is requested to <code>updateMetrics</code></li> <li><code>ExternalSorter</code> is requested to writePartitionedFile (to create a DiskBlockObjectWriter), writePartitionedMapOutput</li> <li><code>ShuffleExchangeExec</code> (Spark SQL) is requested for a <code>ShuffleWriteProcessor</code> (to create a ShuffleDependency)</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#memory-bytes-spilled","title":"Memory Bytes Spilled <p>Number of in-memory bytes spilled by the tasks (of a stage)</p> <p><code>_memoryBytesSpilled</code> is a <code>LongAccumulator</code> with <code>internal.metrics.memoryBytesSpilled</code> name.</p> <p><code>memoryBytesSpilled</code> metric is exposed using ExecutorSource as memoryBytesSpilled (using Dropwizard metrics system).</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#memorybytesspilled","title":"memoryBytesSpilled","text":"<pre><code>memoryBytesSpilled: Long\n</code></pre> <p><code>memoryBytesSpilled</code> is the sum of all memory bytes spilled across all tasks.</p> <p><code>memoryBytesSpilled</code> is used when:</p> <ul> <li><code>SpillListener</code> is requested to onStageCompleted</li> <li><code>TaskRunner</code> is requested to run (and updates task metrics in the Dropwizard metrics system)</li> <li><code>LiveTask</code> is requested to <code>updateMetrics</code></li> <li><code>JsonProtocol</code> is requested to taskMetricsToJson</li> </ul>","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#incmemorybytesspilled","title":"incMemoryBytesSpilled","text":"<pre><code>incMemoryBytesSpilled(\n  v: Long): Unit\n</code></pre> <p><code>incMemoryBytesSpilled</code> adds the <code>v</code> value to the _memoryBytesSpilled metric.</p> <p><code>incMemoryBytesSpilled</code> is used when:</p> <ul> <li><code>Aggregator</code> is requested to updateMetrics</li> <li><code>BasePythonRunner.ReaderIterator</code> is requested to <code>handleTimingData</code></li> <li><code>CoGroupedRDD</code> is requested to compute a partition</li> <li><code>ShuffleExternalSorter</code> is requested to spill</li> <li><code>JsonProtocol</code> is requested to taskMetricsFromJson</li> <li><code>ExternalSorter</code> is requested to insertAllAndUpdateMetrics, writePartitionedFile, writePartitionedMapOutput</li> <li><code>UnsafeExternalSorter</code> is requested to createWithExistingInMemorySorter, spill</li> <li><code>UnsafeExternalSorter.SpillableIterator</code> is requested to <code>spill</code></li> </ul>","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#taskcontext","title":"TaskContext <p><code>TaskMetrics</code> is available using TaskContext.taskMetrics.</p> <pre><code>TaskContext.get.taskMetrics\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#serializable","title":"Serializable <p><code>TaskMetrics</code> is a <code>Serializable</code> (Java).</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#task","title":"Task <p><code>TaskMetrics</code> is part of Task.</p> <pre><code>task.metrics\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#sparklistener","title":"SparkListener <p><code>TaskMetrics</code> is available using SparkListener and intercepting SparkListenerTaskEnd events.</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#statsreportlistener","title":"StatsReportListener <p>StatsReportListener can be used for summary statistics at runtime (after a stage completes).</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskMetrics/#spark-history-server","title":"Spark History Server <p>Spark History Server uses EventLoggingListener to intercept post-execution statistics (incl. <code>TaskMetrics</code>).</p>","text":"","tags":["DeveloperApi"]},{"location":"executor/TaskRunner/","title":"TaskRunner","text":"<p><code>TaskRunner</code> is a thread of execution to run a task.</p> <p></p> <p>Internal Class</p> <p><code>TaskRunner</code> is an internal class of Executor with full access to internal registries.</p> <p><code>TaskRunner</code> is a java.lang.Runnable so once a TaskRunner has completed execution it must not be restarted.</p>"},{"location":"executor/TaskRunner/#creating-instance","title":"Creating Instance","text":"<p><code>TaskRunner</code> takes the following to be created:</p> <ul> <li> ExecutorBackend (that manages the parent Executor) <li> TaskDescription <li>PluginContainer</li> <p><code>TaskRunner</code> is created\u00a0when:</p> <ul> <li><code>Executor</code> is requested to launch a task</li> </ul>"},{"location":"executor/TaskRunner/#plugincontainer","title":"PluginContainer <p><code>TaskRunner</code> may be given a PluginContainer when created.</p> <p>The <code>PluginContainer</code> is used when <code>TaskRunner</code> is requested to run (for the Task to run).</p>","text":""},{"location":"executor/TaskRunner/#demo","title":"Demo <pre><code>./bin/spark-shell --conf spark.driver.maxResultSize=1m\n</code></pre> <pre><code>scala&gt; println(sc.version)\n3.0.1\n</code></pre> <pre><code>val maxResultSize = sc.getConf.get(\"spark.driver.maxResultSize\")\nassert(maxResultSize == \"1m\")\n</code></pre> <pre><code>val rddOver1m = sc.range(0, 1024 * 1024 + 10, 1)\n</code></pre> <pre><code>scala&gt; rddOver1m.collect\nERROR TaskSetManager: Total size of serialized results of 2 tasks (1030.8 KiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nERROR TaskSetManager: Total size of serialized results of 3 tasks (1546.2 KiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nERROR TaskSetManager: Total size of serialized results of 4 tasks (2.0 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nWARN TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\nWARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\nWARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\nERROR TaskSetManager: Total size of serialized results of 5 tasks (2.5 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\nWARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, 192.168.68.105, executor driver): TaskKilled (Tasks result size has exceeded maxResultSize)\n...\norg.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 2 tasks (1030.8 KiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)\n  at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)\n  ...\n</code></pre>","text":""},{"location":"executor/TaskRunner/#thread-name","title":"Thread Name <p><code>TaskRunner</code> uses the following thread name (with the taskId of the TaskDescription):</p> <pre><code>Executor task launch worker for task [taskId]\n</code></pre>","text":""},{"location":"executor/TaskRunner/#running-task","title":"Running Task <pre><code>run(): Unit\n</code></pre> <p><code>run</code> is part of the java.lang.Runnable abstraction.</p>","text":""},{"location":"executor/TaskRunner/#initialization","title":"Initialization <p><code>run</code> initializes the threadId internal registry as the current thread identifier (using Thread.getId).</p> <p><code>run</code> sets the name of the current thread of execution as the threadName.</p> <p><code>run</code> creates a TaskMemoryManager (for the current MemoryManager and taskId). <code>run</code> uses <code>SparkEnv</code> to access the current MemoryManager.</p> <p><code>run</code> starts tracking the time to deserialize a task and sets the current thread's context classloader.</p> <p><code>run</code> creates a closure Serializer. <code>run</code> uses <code>SparkEnv</code> to access the closure Serializer.</p> <p><code>run</code> prints out the following INFO message to the logs (with the taskName and taskId):</p> <pre><code>Running [taskName] (TID [taskId])\n</code></pre> <p><code>run</code> notifies the ExecutorBackend that the status of the task has changed to <code>RUNNING</code> (for the taskId).</p> <p><code>run</code> computes the total amount of time this JVM process has spent in garbage collection.</p> <p><code>run</code> uses the addedFiles and addedJars (of the given TaskDescription) to update dependencies.</p> <p><code>run</code> takes the serializedTask of the given TaskDescription and requests the closure <code>Serializer</code> to deserialize the task. <code>run</code> sets the task internal reference to hold the deserialized task.</p> <p>For non-local environments, <code>run</code> prints out the following DEBUG message to the logs before requesting the <code>MapOutputTrackerWorker</code> to update the epoch (using the epoch of the Task to be executed). <code>run</code> uses <code>SparkEnv</code> to access the MapOutputTrackerWorker.</p> <pre><code>Task [taskId]'s epoch is [epoch]\n</code></pre> <p><code>run</code> requests the <code>metricsPoller</code>...FIXME</p> <p><code>run</code> records the current time as the task's start time (<code>taskStartTimeNs</code>).</p> <p><code>run</code> requests the Task to run (with <code>taskAttemptId</code> as taskId, <code>attemptNumber</code> from <code>TaskDescription</code>, and <code>metricsSystem</code> as the current MetricsSystem).</p>  <p>Note</p> <p><code>run</code> uses <code>SparkEnv</code> to access the MetricsSystem.</p>   <p>Note</p> <p>The task runs inside a \"monitored\" block (<code>try-finally</code> block) to detect any memory and lock leaks after the task's run finishes regardless of the final outcome - the computed value or an exception thrown.</p>  <p><code>run</code> creates a Serializer and requests it to serialize the task result (<code>valueBytes</code>).</p>  <p>Note</p> <p><code>run</code> uses <code>SparkEnv</code> to access the Serializer.</p>  <p><code>run</code> updates the metrics of the Task executed.</p> <p><code>run</code> updates the metric counters in the ExecutorSource.</p> <p><code>run</code> requests the Task executed for accumulator updates and the ExecutorMetricsPoller for metric peaks.</p>","text":""},{"location":"executor/TaskRunner/#serialized-task-result","title":"Serialized Task Result <p><code>run</code> creates a DirectTaskResult (with the serialized task result, the accumulator updates and the metric peaks) and requests the closure Serializer to serialize it.</p>  <p>Note</p> <p>The serialized <code>DirectTaskResult</code> is a java.nio.ByteBuffer.</p>  <p><code>run</code> selects between the <code>DirectTaskResult</code> and an IndirectTaskResult based on the size of the serialized task result (limit of this <code>serializedDirectResult</code> byte buffer):</p> <ol> <li> <p>With the size above spark.driver.maxResultSize, <code>run</code> prints out the following WARN message to the logs and serializes an <code>IndirectTaskResult</code> with a TaskResultBlockId.</p> <pre><code>Finished [taskName] (TID [taskId]). Result is larger than maxResultSize ([resultSize] &gt; [maxResultSize]), dropping it.\n</code></pre> </li> <li> <p>With the size above maxDirectResultSize, <code>run</code> creates an <code>TaskResultBlockId</code> and requests the <code>BlockManager</code> to store the task result locally (with <code>MEMORY_AND_DISK_SER</code>). <code>run</code> prints out the following INFO message to the logs and serializes an <code>IndirectTaskResult</code> with a <code>TaskResultBlockId</code>.</p> <pre><code>Finished [taskName] (TID [taskId]). [resultSize] bytes result sent via BlockManager)\n</code></pre> </li> <li> <p><code>run</code> prints out the following INFO message to the logs and uses the <code>DirectTaskResult</code> created earlier.</p> <pre><code>Finished [taskName] (TID [taskId]). [resultSize] bytes result sent to driver\n</code></pre> </li> </ol>  <p>Note</p> <p><code>serializedResult</code> is either a IndirectTaskResult (possibly with the block stored in <code>BlockManager</code>) or a DirectTaskResult.</p>","text":""},{"location":"executor/TaskRunner/#incrementing-succeededtasks-counter","title":"Incrementing succeededTasks Counter <p><code>run</code> requests the ExecutorSource to increment <code>succeededTasks</code> counter.</p>","text":""},{"location":"executor/TaskRunner/#marking-task-finished","title":"Marking Task Finished <p><code>run</code> setTaskFinishedAndClearInterruptStatus.</p>","text":""},{"location":"executor/TaskRunner/#notifying-executorbackend-that-task-finished","title":"Notifying ExecutorBackend that Task Finished <p><code>run</code> notifies the ExecutorBackend that the status of the taskId has changed to <code>FINISHED</code>.</p>  <p>Note</p> <p><code>ExecutorBackend</code> is given when the TaskRunner is created.</p>","text":""},{"location":"executor/TaskRunner/#wrapping-up","title":"Wrapping Up <p>In the end, regardless of the task's execution status (successful or failed), <code>run</code> removes the taskId from runningTasks registry.</p> <p>In case a onTaskStart notification was sent out, <code>run</code> requests the ExecutorMetricsPoller to onTaskCompletion.</p>","text":""},{"location":"executor/TaskRunner/#exceptions","title":"Exceptions <p><code>run</code> handles certain exceptions.</p>    Exception Type TaskState Serialized ByteBuffer     FetchFailedException <code>FAILED</code> <code>TaskFailedReason</code>   TaskKilledException <code>KILLED</code> <code>TaskKilled</code>   InterruptedException <code>KILLED</code> <code>TaskKilled</code>   CommitDeniedException <code>FAILED</code> <code>TaskFailedReason</code>   Throwable <code>FAILED</code> <code>ExceptionFailure</code>","text":""},{"location":"executor/TaskRunner/#fetchfailedexception","title":"FetchFailedException <p>When shuffle:FetchFailedException.md[FetchFailedException] is reported while running a task, run &lt;&gt;. <p>run shuffle:FetchFailedException.md#toTaskFailedReason[requests <code>FetchFailedException</code> for the <code>TaskFailedReason</code>], serializes it and ExecutorBackend.md#statusUpdate[notifies <code>ExecutorBackend</code> that the task has failed] (with &lt;&gt;, <code>TaskState.FAILED</code>, and a serialized reason). <p>NOTE: <code>ExecutorBackend</code> was specified when &lt;&gt;. <p>NOTE:  run uses a closure serializer:Serializer.md[Serializer] to serialize the failure reason. The <code>Serializer</code> was created before run ran the task.</p>","text":""},{"location":"executor/TaskRunner/#taskkilledexception","title":"TaskKilledException <p>When <code>TaskKilledException</code> is reported while running a task, you should see the following INFO message in the logs:</p> <pre><code>Executor killed [taskName] (TID [taskId]), reason: [reason]\n</code></pre> <p>run then &lt;&gt; and ExecutorBackend.md#statusUpdate[notifies <code>ExecutorBackend</code> that the task has been killed] (with &lt;&gt;, <code>TaskState.KILLED</code>, and a serialized <code>TaskKilled</code> object).","text":""},{"location":"executor/TaskRunner/#interruptedexception-with-task-killed","title":"InterruptedException (with Task Killed) <p>When <code>InterruptedException</code> is reported while running a task, and the task has been killed, you should see the following INFO message in the logs:</p> <pre><code>Executor interrupted and killed [taskName] (TID [taskId]), reason: [killReason]\n</code></pre> <p>run then &lt;&gt; and ExecutorBackend.md#statusUpdate[notifies <code>ExecutorBackend</code> that the task has been killed] (with &lt;&gt;, <code>TaskState.KILLED</code>, and a serialized <code>TaskKilled</code> object). <p>NOTE: The difference between this <code>InterruptedException</code> and &lt;&gt; is the INFO message in the logs.","text":""},{"location":"executor/TaskRunner/#commitdeniedexception","title":"CommitDeniedException <p>When <code>CommitDeniedException</code> is reported while running a task, run &lt;&gt; and ExecutorBackend.md#statusUpdate[notifies <code>ExecutorBackend</code> that the task has failed] (with &lt;&gt;, <code>TaskState.FAILED</code>, and a serialized <code>TaskKilled</code> object). <p>NOTE: The difference between this <code>CommitDeniedException</code> and &lt;&gt; is just the reason being sent to <code>ExecutorBackend</code>.","text":""},{"location":"executor/TaskRunner/#throwable","title":"Throwable <p>When run catches a <code>Throwable</code>, you should see the following ERROR message in the logs (followed by the exception).</p> <pre><code>Exception in [taskName] (TID [taskId])\n</code></pre> <p>run then records the following task metrics (only when &lt;&gt; is available): <ul> <li>TaskMetrics.md#setExecutorRunTime[executorRunTime]</li> <li>TaskMetrics.md#setJvmGCTime[jvmGCTime]</li> </ul> <p>run then scheduler:Task.md#collectAccumulatorUpdates[collects the latest values of internal and external accumulators] (with <code>taskFailed</code> flag enabled to inform that the collection is for a failed task).</p> <p>Otherwise, when &lt;&gt; is not available, the accumulator collection is empty. <p>run converts the task accumulators to collection of <code>AccumulableInfo</code>, creates a <code>ExceptionFailure</code> (with the accumulators), and serializer:Serializer.md#serialize[serializes them].</p> <p>NOTE: run uses a closure serializer:Serializer.md[Serializer] to serialize the <code>ExceptionFailure</code>.</p> <p>CAUTION: FIXME Why does run create <code>new ExceptionFailure(t, accUpdates).withAccums(accums)</code>, i.e. accumulators occur twice in the object.</p> <p>run &lt;&gt; and ExecutorBackend.md#statusUpdate[notifies <code>ExecutorBackend</code> that the task has failed] (with &lt;&gt;, <code>TaskState.FAILED</code>, and the serialized <code>ExceptionFailure</code>). <p>run may also trigger <code>SparkUncaughtExceptionHandler.uncaughtException(t)</code> if this is a fatal error.</p> <p>NOTE: The difference between this most <code>Throwable</code> case and other <code>FAILED</code> cases (i.e. &lt;&gt; and &lt;&gt;) is just the serialized <code>ExceptionFailure</code> vs a reason being sent to <code>ExecutorBackend</code>, respectively.","text":""},{"location":"executor/TaskRunner/#collectaccumulatorsandresetstatusonfailure","title":"collectAccumulatorsAndResetStatusOnFailure <pre><code>collectAccumulatorsAndResetStatusOnFailure(\n  taskStartTimeNs: Long)\n</code></pre> <p><code>collectAccumulatorsAndResetStatusOnFailure</code>...FIXME</p>","text":""},{"location":"executor/TaskRunner/#killing-task","title":"Killing Task <pre><code>kill(\n  interruptThread: Boolean,\n  reason: String): Unit\n</code></pre> <p><code>kill</code> marks the TaskRunner as &lt;&gt; and scheduler:Task.md#kill[kills the task] (if available and not &lt;&gt; already). <p>NOTE: <code>kill</code> passes the input <code>interruptThread</code> on to the task itself while killing it.</p> <p>When executed, you should see the following INFO message in the logs:</p> <pre><code>Executor is trying to kill [taskName] (TID [taskId]), reason: [reason]\n</code></pre> <p>NOTE: &lt;&gt; flag is checked periodically in &lt;&gt; to stop executing the task. Once killed, the task will eventually stop.","text":""},{"location":"executor/TaskRunner/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.executor.Executor</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.executor.Executor=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"executor/TaskRunner/#internal-properties","title":"Internal Properties","text":""},{"location":"executor/TaskRunner/#finished-flag","title":"finished Flag <p>finished flag says whether the &lt;&gt; has finished (<code>true</code>) or not (<code>false</code>) <p>Default: <code>false</code></p> <p>Enabled (<code>true</code>) after TaskRunner has been requested to &lt;&gt; <p>Used when TaskRunner is requested to &lt;&gt;","text":""},{"location":"executor/TaskRunner/#reasonifkilled","title":"reasonIfKilled <p>Reason to &lt;&gt; (and avoid &lt;&gt;) <p>Default: <code>(empty)</code> (<code>None</code>)</p>","text":""},{"location":"executor/TaskRunner/#startgctime-timestamp","title":"startGCTime Timestamp <p>Timestamp (which is really the Executor.md#computeTotalGcTime[total amount of time this Executor JVM process has already spent in garbage collection]) that is used to mark the GC \"zero\" time (when &lt;&gt;) and then compute the JVM GC time metric when: <ul> <li> <p>TaskRunner is requested to &lt;&gt; and &lt;&gt;  <li> <p><code>Executor</code> is requested to Executor.md#reportHeartBeat[reportHeartBeat]</p> </li>","text":""},{"location":"executor/TaskRunner/#task","title":"Task <p>Deserialized scheduler:Task.md[task] to execute</p> <p>Used when:</p> <ul> <li> <p>TaskRunner is requested to &lt;&gt;, &lt;&gt;, &lt;&gt;, &lt;&gt;  <li> <p><code>Executor</code> is requested to Executor.md#reportHeartBeat[reportHeartBeat]</p> </li>","text":""},{"location":"executor/TaskRunner/#task-name","title":"Task Name <p>The name of the task (of the TaskDescription) that is used exclusively for &lt;&gt; purposes when TaskRunner is requested to &lt;&gt; and &lt;&gt; the task","text":""},{"location":"executor/TaskRunner/#thread-id","title":"Thread Id <p>Current thread ID</p> <p>Default: <code>-1</code></p> <p>Set immediately when TaskRunner is requested to &lt;&gt; and used exclusively when <code>TaskReaper</code> is requested for the thread info of the current thread (aka thread dump)","text":""},{"location":"exercises/spark-examples-wordcount-spark-shell/","title":"WordCount using Spark shell","text":"<p>== WordCount using Spark shell</p> <p>It is like any introductory big data example should somehow demonstrate how to count words in distributed fashion.</p> <p>In the following example you're going to count the words in <code>README.md</code> file that sits in your Spark distribution and save the result under <code>README.count</code> directory.</p> <p>You're going to use spark-shell.md[the Spark shell] for the example. Execute <code>spark-shell</code>.</p>"},{"location":"exercises/spark-examples-wordcount-spark-shell/#sourcescala","title":"[source,scala]","text":"<p>val lines = sc.textFile(\"README.md\")               // &lt;1&gt;</p> <p>val words = lines.flatMap(_.split(\"\\s+\"))         // &lt;2&gt;</p> <p>val wc = words.map(w =&gt; (w, 1)).reduceByKey(_ + _) // &lt;3&gt;</p>"},{"location":"exercises/spark-examples-wordcount-spark-shell/#wcsaveastextfilereadmecount-4","title":"wc.saveAsTextFile(\"README.count\")                  // &lt;4&gt;","text":"<p>&lt;1&gt; Read the text file - refer to spark-io.md[Using Input and Output (I/O)]. &lt;2&gt; Split each line into words and flatten the result. &lt;3&gt; Map each word into a pair and count them by word (key). &lt;4&gt; Save the result into text files - one per partition.</p> <p>After you have executed the example, see the contents of the <code>README.count</code> directory:</p> <pre><code>$ ls -lt README.count\ntotal 16\n-rw-r--r--  1 jacek  staff     0  9 pa\u017a 13:36 _SUCCESS\n-rw-r--r--  1 jacek  staff  1963  9 pa\u017a 13:36 part-00000\n-rw-r--r--  1 jacek  staff  1663  9 pa\u017a 13:36 part-00001\n</code></pre> <p>The files <code>part-0000x</code> contain the pairs of word and the count.</p> <pre><code>$ cat README.count/part-00000\n(package,1)\n(this,1)\n(Version\"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1)\n(Because,1)\n(Python,2)\n(cluster.,1)\n(its,1)\n([run,1)\n...\n</code></pre> <p>=== Further (self-)development</p> <p>Please read the questions and give answers first before looking at the link given.</p> <ol> <li>Why are there two files under the directory?</li> <li>How could you have only one?</li> <li>How to <code>filter</code> out words by name?</li> <li>How to <code>count</code> words?</li> </ol> <p>Please refer to the chapter spark-rdd-partitions.md[Partitions] to find some of the answers.</p>"},{"location":"exercises/spark-exercise-custom-scheduler-listener/","title":"Developing Custom SparkListener to monitor DAGScheduler in Scala","text":"<p>== Exercise: Developing Custom SparkListener to monitor DAGScheduler in Scala</p> <p>The example shows how to develop a custom Spark Listener. You should read SparkListener.md[] first to understand the motivation for the example.</p> <p>=== Requirements</p> <ol> <li>https://www.jetbrains.com/idea/[IntelliJ IDEA] (or eventually http://www.scala-sbt.org/[sbt] alone if you're adventurous).</li> <li>Access to Internet to download Apache Spark's dependencies.</li> </ol> <p>=== Setting up Scala project using IntelliJ IDEA</p> <p>Create a new project <code>custom-spark-listener</code>.</p> <p>Add the following line to <code>build.sbt</code> (the main configuration file for the sbt project) that adds the dependency on Apache Spark.</p> <pre><code>libraryDependencies += \"org.apache.spark\" %% \"spark-core\" % \"2.0.1\"\n</code></pre> <p><code>build.sbt</code> should look as follows:</p>"},{"location":"exercises/spark-exercise-custom-scheduler-listener/#source-scala","title":"[source, scala]","text":"<p>name := \"custom-spark-listener\" organization := \"pl.jaceklaskowski.spark\" version := \"1.0\"</p> <p>scalaVersion := \"2.11.8\"</p>"},{"location":"exercises/spark-exercise-custom-scheduler-listener/#librarydependencies-orgapachespark-spark-core-201","title":"libraryDependencies += \"org.apache.spark\" %% \"spark-core\" % \"2.0.1\"","text":"<p>=== Custom Listener - pl.jaceklaskowski.spark.CustomSparkListener</p> <p>Create a Scala class -- <code>CustomSparkListener</code> -- for your custom <code>SparkListener</code>. It should be under <code>src/main/scala</code> directory (create one if it does not exist).</p> <p>The aim of the class is to intercept scheduler events about jobs being started and tasks completed.</p>"},{"location":"exercises/spark-exercise-custom-scheduler-listener/#sourcescala","title":"[source,scala]","text":"<p>package pl.jaceklaskowski.spark</p> <p>import org.apache.spark.scheduler.{SparkListenerStageCompleted, SparkListener, SparkListenerJobStart}</p> <p>class CustomSparkListener extends SparkListener {   override def onJobStart(jobStart: SparkListenerJobStart) {     println(s\"Job started with ${jobStart.stageInfos.size} stages: $jobStart\")   }</p> <p>override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = {     println(s\"Stage ${stageCompleted.stageInfo.stageId} completed with ${stageCompleted.stageInfo.numTasks} tasks.\")   } }</p> <p>=== Creating deployable package</p> <p>Package the custom Spark listener. Execute <code>sbt package</code> command in the <code>custom-spark-listener</code> project's main directory.</p> <pre><code>$ sbt package\n[info] Loading global plugins from /Users/jacek/.sbt/0.13/plugins\n[info] Loading project definition from /Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/project\n[info] Updating {file:/Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/project/}custom-spark-listener-build...\n[info] Resolving org.fusesource.jansi#jansi;1.4 ...\n[info] Done updating.\n[info] Set current project to custom-spark-listener (in build file:/Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/)\n[info] Updating {file:/Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/}custom-spark-listener...\n[info] Resolving jline#jline;2.12.1 ...\n[info] Done updating.\n[info] Compiling 1 Scala source to /Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/target/scala-2.11/classes...\n[info] Packaging /Users/jacek/dev/workshops/spark-workshop/solutions/custom-spark-listener/target/scala-2.11/custom-spark-listener_2.11-1.0.jar ...\n[info] Done packaging.\n[success] Total time: 8 s, completed Oct 27, 2016 11:23:50 AM\n</code></pre> <p>You should find the result jar file with the custom scheduler listener ready under <code>target/scala-2.11</code> directory, e.g. <code>target/scala-2.11/custom-spark-listener_2.11-1.0.jar</code>.</p> <p>=== Activating Custom Listener in Spark shell</p> <p>Start ../spark-shell.md[spark-shell] with additional configurations for the extra custom listener and the jar that includes the class.</p> <pre><code>$ spark-shell \\\n  --conf spark.logConf=true \\\n  --conf spark.extraListeners=pl.jaceklaskowski.spark.CustomSparkListener \\\n  --driver-class-path target/scala-2.11/custom-spark-listener_2.11-1.0.jar\n</code></pre> <p>Create a ../spark-sql-Dataset.md#implicits[Dataset] and execute an action like <code>show</code> to start a job as follows:</p> <pre><code>scala&gt; spark.read.text(\"README.md\").count\n[CustomSparkListener] Job started with 2 stages: SparkListenerJobStart(1,1473946006715,WrappedArray(org.apache.spark.scheduler.StageInfo@71515592, org.apache.spark.scheduler.StageInfo@6852819d),{spark.rdd.scope.noOverride=true, spark.rdd.scope={\"id\":\"14\",\"name\":\"collect\"}, spark.sql.execution.id=2})\n[CustomSparkListener] Stage 1 completed with 1 tasks.\n[CustomSparkListener] Stage 2 completed with 1 tasks.\nres0: Long = 7\n</code></pre> <p>The lines with <code>[CustomSparkListener]</code> came from your custom Spark listener. Congratulations! The exercise's over.</p> <p>=== BONUS Activating Custom Listener in Spark Application</p> <p>TIP: Read SparkContext.md#addSparkListener[Registering SparkListener].</p> <p>=== Questions</p> <ol> <li>What are the pros and cons of using the command line version vs inside a Spark application?</li> </ol>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/","title":"Working with Datasets from JDBC Data Sources (and PostgreSQL)","text":"<p>== Working with Datasets from JDBC Data Sources (and PostgreSQL)</p> <p>Start <code>spark-shell</code> with the JDBC driver for the database you want to use. In our case, it is PostgreSQL JDBC Driver.</p> <p>NOTE: Download the jar for PostgreSQL JDBC Driver 42.1.1 directly from the http://central.maven.org/maven2/org/postgresql/postgresql/42.1.1/postgresql-42.1.1.jar[Maven repository].</p>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#tip","title":"[TIP]","text":"<p>Execute the command to have the jar downloaded into <code>~/.ivy2/jars</code> directory by <code>spark-shell</code> itself:</p> <pre><code>./bin/spark-shell --packages org.postgresql:postgresql:42.1.1\n</code></pre> <p>The entire path to the driver file is then like <code>/Users/jacek/.ivy2/jars/org.postgresql_postgresql-42.1.1.jar</code>.</p> <p>You should see the following while <code>spark-shell</code> downloads the driver.</p>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#ivy-default-cache-set-to-usersjacekivy2cache-the-jars-for-the-packages-stored-in-usersjacekivy2jars-loading-settings-url-jarfileusersjacekdevosssparkassemblytargetscala-211jarsivy-240jarorgapacheivycoresettingsivysettingsxml-orgpostgresqlpostgresql-added-as-a-dependency-resolving-dependencies-orgapachesparkspark-submit-parent10-confs-default-found-orgpostgresqlpostgresql4211-in-central-downloading-httpsrepo1mavenorgmaven2orgpostgresqlpostgresql4211postgresql-4211jar-successful-orgpostgresqlpostgresql4211postgresqljarbundle-205ms-resolution-report-resolve-1887ms-artifacts-dl-207ms-modules-in-use-orgpostgresqlpostgresql4211-from-central-in-default-modules-artifacts-conf-number-searchdwnldedevicted-numberdwnlded-default-1-1-1-0-1-1-retrieving-orgapachesparkspark-submit-parent-confs-default-1-artifacts-copied-0-already-retrieved-695kb8ms","title":"<pre><code>Ivy Default Cache set to: /Users/jacek/.ivy2/cache\nThe jars for the packages stored in: /Users/jacek/.ivy2/jars\n:: loading settings :: url = jar:file:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\norg.postgresql#postgresql added as a dependency\n:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0\n    confs: [default]\n    found org.postgresql#postgresql;42.1.1 in central\ndownloading https://repo1.maven.org/maven2/org/postgresql/postgresql/42.1.1/postgresql-42.1.1.jar ...\n    [SUCCESSFUL ] org.postgresql#postgresql;42.1.1!postgresql.jar(bundle) (205ms)\n:: resolution report :: resolve 1887ms :: artifacts dl 207ms\n    :: modules in use:\n    org.postgresql#postgresql;42.1.1 from central in [default]\n    ---------------------------------------------------------------------\n    |                  |            modules            ||   artifacts   |\n    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|\n    ---------------------------------------------------------------------\n    |      default     |   1   |   1   |   1   |   0   ||   1   |   1   |\n    ---------------------------------------------------------------------\n:: retrieving :: org.apache.spark#spark-submit-parent\n    confs: [default]\n    1 artifacts copied, 0 already retrieved (695kB/8ms)\n</code></pre>","text":"<p>Start <code>./bin/spark-shell</code> with spark-submit/index.md#driver-class-path[--driver-class-path] command line option and the driver jar.</p> <pre><code>SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell --driver-class-path /Users/jacek/.ivy2/jars/org.postgresql_postgresql-42.1.1.jar\n</code></pre> <p>It will give you the proper setup for accessing PostgreSQL using the JDBC driver.</p> <p>Execute the following to access <code>projects</code> table in <code>sparkdb</code>.</p>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#source-scala","title":"[source, scala]","text":"<p>// that gives an one-partition Dataset val opts = Map(   \"url\" -&gt; \"jdbc:postgresql:sparkdb\",   \"dbtable\" -&gt; \"projects\") val df = spark.   read.   format(\"jdbc\").   options(opts).   load</p> <p>NOTE: Use <code>user</code> and <code>password</code> options to specify the credentials if needed.</p>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#source-scala_1","title":"[source, scala]","text":"<p>// Note the number of partition (aka numPartitions) scala&gt; df.explain == Physical Plan == *Scan JDBCRelation(projects) [numPartitions=1] [id#0,name#1,website#2] ReadSchema: struct <p>scala&gt; df.show(truncate = false) +---+------------+-----------------------+ |id |name        |website                | +---+------------+-----------------------+ |1  |Apache Spark|http://spark.apache.org| |2  |Apache Hive |http://hive.apache.org | |3  |Apache Kafka|http://kafka.apache.org| |4  |Apache Flink|http://flink.apache.org| +---+------------+-----------------------+</p> <p>// use jdbc method with predicates to define partitions import java.util.Properties val df4parts = spark.   read.   jdbc(     url = \"jdbc:postgresql:sparkdb\",     table = \"projects\",     predicates = Array(\"id=1\", \"id=2\", \"id=3\", \"id=4\"),     connectionProperties = new Properties())</p> <p>// Note the number of partitions (aka numPartitions) scala&gt; df4parts.explain == Physical Plan == *Scan JDBCRelation(projects) [numPartitions=4] [id#16,name#17,website#18] ReadSchema: struct <p>scala&gt; df4parts.show(truncate = false) +---+------------+-----------------------+ |id |name        |website                | +---+------------+-----------------------+ |1  |Apache Spark|http://spark.apache.org| |2  |Apache Hive |http://hive.apache.org | |3  |Apache Kafka|http://kafka.apache.org| |4  |Apache Flink|http://flink.apache.org| +---+------------+-----------------------+</p> <p>=== Troubleshooting</p> <p>If things can go wrong, they sooner or later go wrong. Here is a list of possible issues and their solutions.</p> <p>==== java.sql.SQLException: No suitable driver</p> <p>Ensure that the JDBC driver sits on the CLASSPATH. Use spark-submit/index.md#driver-class-path[--driver-class-path] as described above (<code>--packages</code> or <code>--jars</code> do not work).</p> <pre><code>scala&gt; val df = spark.\n     |   read.\n     |   format(\"jdbc\").\n     |   options(opts).\n     |   load\njava.sql.SQLException: No suitable driver\n  at java.sql.DriverManager.getDriver(DriverManager.java:315)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)\n  at scala.Option.getOrElse(Option.scala:121)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.&lt;init&gt;(JDBCOptions.scala:83)\n  at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.&lt;init&gt;(JDBCOptions.scala:34)\n  at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)\n  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:301)\n  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)\n  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:158)\n  ... 52 elided\n</code></pre> <p>=== PostgreSQL Setup</p> <p>NOTE: I'm on Mac OS X so YMMV (aka Your Mileage May Vary).</p> <p>Use the sections to have a properly configured PostgreSQL database.</p> <ul> <li>&lt;&gt; <li>&lt;&gt; <li>&lt;&gt; <li>&lt;&gt; <li>&lt;&gt; <li>&lt;&gt; <li>&lt;&gt; <p>==== [[installation]] Installation</p> <p>Install PostgreSQL as described in...TK</p> <p>CAUTION: This page serves as a cheatsheet for the author so he does not have to search Internet to find the installation steps.</p> <pre><code>$ initdb /usr/local/var/postgres -E utf8\nThe files belonging to this database system will be owned by user \"jacek\".\nThis user must also own the server process.\n\nThe database cluster will be initialized with locale \"pl_pl.utf-8\".\ninitdb: could not find suitable text search configuration for locale \"pl_pl.utf-8\"\nThe default text search configuration will be set to \"simple\".\n\nData page checksums are disabled.\n\ncreating directory /usr/local/var/postgres ... ok\ncreating subdirectories ... ok\nselecting default max_connections ... 100\nselecting default shared_buffers ... 128MB\nselecting dynamic shared memory implementation ... posix\ncreating configuration files ... ok\ncreating template1 database in /usr/local/var/postgres/base/1 ... ok\ninitializing pg_authid ... ok\ninitializing dependencies ... ok\ncreating system views ... ok\nloading system objects' descriptions ... ok\ncreating collations ... ok\ncreating conversions ... ok\ncreating dictionaries ... ok\nsetting privileges on built-in objects ... ok\ncreating information schema ... ok\nloading PL/pgSQL server-side language ... ok\nvacuuming database template1 ... ok\ncopying template1 to template0 ... ok\ncopying template1 to postgres ... ok\nsyncing data to disk ... ok\n\nWARNING: enabling \"trust\" authentication for local connections\nYou can change this by editing pg_hba.conf or using the option -A, or\n--auth-local and --auth-host, the next time you run initdb.\n\nSuccess. You can now start the database server using:\n\n    pg_ctl -D /usr/local/var/postgres -l logfile start\n</code></pre> <p>==== [[starting-database-server]] Starting Database Server</p> <p>NOTE: Consult http://www.postgresql.org/docs/current/static/server-start.html[17.3. Starting the Database Server] in the official documentation.</p>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#tip_1","title":"[TIP]","text":"<p>Enable <code>all</code> logs in PostgreSQL to see query statements.</p> <pre><code>log_statement = 'all'\n</code></pre>"},{"location":"exercises/spark-exercise-dataframe-jdbc-postgresql/#add-log_statement-all-to-usrlocalvarpostgrespostgresqlconf-on-mac-os-x-with-postgresql-installed-using-brew","title":"Add <code>log_statement = 'all'</code> to <code>/usr/local/var/postgres/postgresql.conf</code> on Mac OS X with PostgreSQL installed using <code>brew</code>.","text":"<p>Start the database server using <code>pg_ctl</code>.</p> <pre><code>$ pg_ctl -D /usr/local/var/postgres -l logfile start\nserver starting\n</code></pre> <p>Alternatively, you can run the database server using <code>postgres</code>.</p> <pre><code>$ postgres -D /usr/local/var/postgres\n</code></pre> <p>==== [[creating-database]] Create Database</p> <pre><code>$ createdb sparkdb\n</code></pre> <p>TIP: Consult http://www.postgresql.org/docs/current/static/app-createdb.html[createdb] in the official documentation.</p> <p>==== Accessing Database</p> <p>Use <code>psql sparkdb</code> to access the database.</p> <pre><code>$ psql sparkdb\npsql (9.6.2)\nType \"help\" for help.\n\nsparkdb=#\n</code></pre> <p>Execute <code>SELECT version()</code> to know the version of the database server you have connected to.</p> <pre><code>sparkdb=# SELECT version();\n                                                   version\n--------------------------------------------------------------------------------------------------------------\n PostgreSQL 9.6.2 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.2 (clang-700.1.81), 64-bit\n(1 row)\n</code></pre> <p>Use <code>\\h</code> for help and <code>\\q</code> to leave a session.</p> <p>==== Creating Table</p> <p>Create a table using <code>CREATE TABLE</code> command.</p> <pre><code>CREATE TABLE projects (\n  id SERIAL PRIMARY KEY,\n  name text,\n  website text\n);\n</code></pre> <p>Insert rows to initialize the table with data.</p> <pre><code>INSERT INTO projects (name, website) VALUES ('Apache Spark', 'http://spark.apache.org');\nINSERT INTO projects (name, website) VALUES ('Apache Hive', 'http://hive.apache.org');\nINSERT INTO projects VALUES (DEFAULT, 'Apache Kafka', 'http://kafka.apache.org');\nINSERT INTO projects VALUES (DEFAULT, 'Apache Flink', 'http://flink.apache.org');\n</code></pre> <p>Execute <code>select * from projects;</code> to ensure that you have the following records in <code>projects</code> table:</p> <pre><code>sparkdb=# select * from projects;\n id |     name     |         website\n----+--------------+-------------------------\n  1 | Apache Spark | http://spark.apache.org\n  2 | Apache Hive  | http://hive.apache.org\n  3 | Apache Kafka | http://kafka.apache.org\n  4 | Apache Flink | http://flink.apache.org\n(4 rows)\n</code></pre> <p>==== Dropping Database</p> <pre><code>$ dropdb sparkdb\n</code></pre> <p>TIP: Consult http://www.postgresql.org/docs/current/static/app-dropdb.html[dropdb] in the official documentation.</p> <p>==== Stopping Database Server</p> <pre><code>pg_ctl -D /usr/local/var/postgres stop\n</code></pre>"},{"location":"exercises/spark-exercise-failing-stage/","title":"Causing Stage to Fail","text":"<p>== Exercise: Causing Stage to Fail</p> <p>The example shows how Spark re-executes a stage in case of stage failure.</p> <p>=== Recipe</p> <p>Start a Spark cluster, e.g. 1-node Hadoop YARN.</p> <pre><code>start-yarn.sh\n</code></pre> <pre><code>// 2-stage job -- it _appears_ that a stage can be failed only when there is a shuffle\nsc.parallelize(0 to 3e3.toInt, 2).map(n =&gt; (n % 2, n)).groupByKey.count\n</code></pre> <p>Use 2 executors at least so you can kill one and keep the application up and running (on one executor).</p> <pre><code>YARN_CONF_DIR=hadoop-conf ./bin/spark-shell --master yarn \\\n  -c spark.shuffle.service.enabled=true \\\n  --num-executors 2\n</code></pre>"},{"location":"exercises/spark-exercise-pairrddfunctions-oneliners/","title":"One-liners using PairRDDFunctions","text":"<p>== Exercise: One-liners using PairRDDFunctions</p> <p>This is a set of one-liners to give you a entry point into using rdd:PairRDDFunctions.md[PairRDDFunctions].</p> <p>=== Exercise</p> <p>How would you go about solving a requirement to pair elements of the same key and creating a new RDD out of the matched values?</p>"},{"location":"exercises/spark-exercise-pairrddfunctions-oneliners/#source-scala","title":"[source, scala]","text":"<p>val users = Seq((1, \"user1\"), (1, \"user2\"), (2, \"user1\"), (2, \"user3\"), (3,\"user2\"), (3,\"user4\"), (3,\"user1\"))</p> <p>// Input RDD val us = sc.parallelize(users)</p> <p>// ...your code here</p> <p>// Desired output Seq(\"user1\",\"user2\"),(\"user1\",\"user3\"),(\"user1\",\"user4\"),(\"user2\",\"user4\"))</p>"},{"location":"exercises/spark-exercise-standalone-master-ha/","title":"Spark Standalone - Using ZooKeeper for High-Availability of Master","text":"<p>== Spark Standalone - Using ZooKeeper for High-Availability of Master</p> <p>TIP: Read  ../spark-standalone-Master.md#recovery-mode[Recovery Mode] to know the theory.</p> <p>You're going to start two standalone Masters.</p> <p>You'll need 4 terminals (adjust addresses as needed):</p> <p>Start ZooKeeper.</p> <p>Create a configuration file <code>ha.conf</code> with the content as follows:</p> <pre><code>spark.deploy.recoveryMode=ZOOKEEPER\nspark.deploy.zookeeper.url=&lt;zookeeper_host&gt;:2181\nspark.deploy.zookeeper.dir=/spark\n</code></pre> <p>Start the first standalone Master.</p> <pre><code>./sbin/start-master.sh -h localhost -p 7077 --webui-port 8080 --properties-file ha.conf\n</code></pre> <p>Start the second standalone Master.</p> <p>NOTE: It is not possible to start another instance of standalone Master on the same machine using <code>./sbin/start-master.sh</code>. The reason is that the script assumes one instance per machine only. We're going to change the script to make it possible.</p> <pre><code>$ cp ./sbin/start-master{,-2}.sh\n\n$ grep \"CLASS 1\" ./sbin/start-master-2.sh\n\"$\\{SPARK_HOME}/sbin\"/spark-daemon.sh start $CLASS 1 \\\n\n$ sed -i -e 's/CLASS 1/CLASS 2/' sbin/start-master-2.sh\n\n$ grep \"CLASS 1\" ./sbin/start-master-2.sh\n\n$ grep \"CLASS 2\" ./sbin/start-master-2.sh\n\"$\\{SPARK_HOME}/sbin\"/spark-daemon.sh start $CLASS 2 \\\n\n$ ./sbin/start-master-2.sh -h localhost -p 17077 --webui-port 18080 --properties-file ha.conf\n</code></pre> <p>You can check how many instances you're currently running using <code>jps</code> command as follows:</p> <pre><code>$ jps -lm\n5024 sun.tools.jps.Jps -lm\n4994 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080 -h localhost -p 17077 --webui-port 18080 --properties-file ha.conf\n4808 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8080 -h localhost -p 7077 --webui-port 8080 --properties-file ha.conf\n4778 org.apache.zookeeper.server.quorum.QuorumPeerMain config/zookeeper.properties\n</code></pre> <p>Start a standalone Worker.</p> <pre><code>./sbin/start-slave.sh spark://localhost:7077,localhost:17077\n</code></pre> <p>Start Spark shell.</p> <pre><code>./bin/spark-shell --master spark://localhost:7077,localhost:17077\n</code></pre> <p>Wait till the Spark shell connects to an active standalone Master.</p> <p>Find out which standalone Master is active (there can only be one). Kill it. Observe how the other standalone Master takes over and lets the Spark shell register with itself. Check out the master's UI.</p> <p>Optionally, kill the worker, make sure it goes away instantly in the active master's logs.</p>"},{"location":"exercises/spark-exercise-take-multiple-jobs/","title":"Learning Jobs and Partitions Using take Action","text":"<p>== Exercise: Learning Jobs and Partitions Using take Action</p> <p>The exercise aims for introducing <code>take</code> action and using <code>spark-shell</code> and web UI. It should introduce you to the concepts of partitions and jobs.</p> <p>The following snippet creates an RDD of 16 elements with 16 partitions.</p> <pre><code>scala&gt; val r1 = sc.parallelize(0 to 15, 16)\nr1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at parallelize at &lt;console&gt;:18\n\nscala&gt; r1.partitions.size\nres63: Int = 16\n\nscala&gt; r1.foreachPartition(it =&gt; println(\"&gt;&gt;&gt; partition size: \" + it.size))\n...\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n... // the machine has 8 cores\n... // so first 8 tasks get executed immediately\n... // with the others after a core is free to take on new tasks.\n&gt;&gt;&gt; partition size: 1\n...\n&gt;&gt;&gt; partition size: 1\n...\n&gt;&gt;&gt; partition size: 1\n...\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n...\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n&gt;&gt;&gt; partition size: 1\n</code></pre> <p>All 16 partitions have one element.</p> <p>When you execute <code>r1.take(1)</code> only one job gets run since it is enough to compute one task on one partition.</p> <p>CAUTION: FIXME Snapshot from web UI - note the number of tasks</p> <p>However, when you execute <code>r1.take(2)</code> two jobs get run as the implementation assumes one job with one partition, and if the elements didn't total to the number of elements requested in <code>take</code>, quadruple the partitions to work on in the following jobs.</p> <p>CAUTION: FIXME Snapshot from web UI - note the number of tasks</p> <p>Can you guess how many jobs are run for <code>r1.take(15)</code>? How many tasks per job?</p> <p>CAUTION: FIXME Snapshot from web UI - note the number of tasks</p> <p>Answer: 3.</p>"},{"location":"exercises/spark-first-app/","title":"Your first complete Spark application (using Scala and sbt)","text":"<p>== Your first Spark application (using Scala and sbt)</p> <p>This page gives you the exact steps to develop and run a complete Spark application using http://www.scala-lang.org/[Scala] programming language and http://www.scala-sbt.org/[sbt] as the build tool.</p> <p>[TIP] Refer to Quick Start's  http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/quick-start.html#self-contained-applications[Self-Contained Applications] in the official documentation.</p> <p>The sample application called SparkMe App is...FIXME</p> <p>=== Overview</p> <p>You're going to use http://www.scala-sbt.org/[sbt] as the project build tool. It uses <code>build.sbt</code> for the project's description as well as the dependencies, i.e. the version of Apache Spark and others.</p> <p>The application's main code is under <code>src/main/scala</code> directory, in <code>SparkMeApp.scala</code> file.</p> <p>With the files in a directory, executing <code>sbt package</code> results in a package that can be deployed onto a Spark cluster using <code>spark-submit</code>.</p> <p>In this example, you're going to use Spark's local/spark-local.md[local mode].</p> <p>=== Project's build - build.sbt</p> <p>Any Scala project managed by sbt uses <code>build.sbt</code> as the central place for configuration, including project dependencies denoted as <code>libraryDependencies</code>.</p> <p>build.sbt <pre><code>name         := \"SparkMe Project\"\nversion      := \"1.0\"\norganization := \"pl.japila\"\n\nscalaVersion := \"2.11.7\"\n\nlibraryDependencies += \"org.apache.spark\" %% \"spark-core\" % \"1.6.0-SNAPSHOT\"  // &lt;1&gt;\nresolvers += Resolver.mavenLocal\n</code></pre> &lt;1&gt; Use the development version of Spark 1.6.0-SNAPSHOT</p> <p>=== SparkMe Application</p> <p>The application uses a single command-line parameter (as <code>args(0)</code>) that is the file to process. The file is read and the number of lines printed out.</p> <pre><code>package pl.japila.spark\n\nimport org.apache.spark.{SparkContext, SparkConf}\n\nobject SparkMeApp {\n  def main(args: Array[String]) {\n    val conf = new SparkConf().setAppName(\"SparkMe Application\")\n    val sc = new SparkContext(conf)\n\n    val fileName = args(0)\n    val lines = sc.textFile(fileName).cache\n\n    val c = lines.count\n    println(s\"There are $c lines in $fileName\")\n  }\n}\n</code></pre> <p>=== sbt version - project/build.properties</p> <p>sbt (launcher) uses <code>project/build.properties</code> file to set (the real) sbt up</p> <pre><code>sbt.version=0.13.9\n</code></pre> <p>TIP: With the file the build is more predictable as the version of sbt doesn't depend on the sbt launcher.</p> <p>=== Packaging Application</p> <p>Execute <code>sbt package</code> to package the application.</p> <pre><code>\u279c  sparkme-app  sbt package\n[info] Loading global plugins from /Users/jacek/.sbt/0.13/plugins\n[info] Loading project definition from /Users/jacek/dev/sandbox/sparkme-app/project\n[info] Set current project to SparkMe Project (in build file:/Users/jacek/dev/sandbox/sparkme-app/)\n[info] Compiling 1 Scala source to /Users/jacek/dev/sandbox/sparkme-app/target/scala-2.11/classes...\n[info] Packaging /Users/jacek/dev/sandbox/sparkme-app/target/scala-2.11/sparkme-project_2.11-1.0.jar ...\n[info] Done packaging.\n[success] Total time: 3 s, completed Sep 23, 2015 12:47:52 AM\n</code></pre> <p>The application uses only classes that comes with Spark so <code>package</code> is enough.</p> <p>In <code>target/scala-2.11/sparkme-project_2.11-1.0.jar</code> there is the final application ready for deployment.</p> <p>=== Submitting Application to Spark (local)</p> <p>NOTE: The application is going to be deployed to <code>local[*]</code>. Change it to whatever cluster you have available (refer to spark-cluster.md[Running Spark in cluster]).</p> <p><code>spark-submit</code> the SparkMe application and specify the file to process (as it is the only and required input parameter to the application), e.g. <code>build.sbt</code> of the project.</p> <p>NOTE: <code>build.sbt</code> is sbt's build definition and is only used as an input file for demonstration purposes. Any file is going to work fine.</p> <pre><code>\u279c  sparkme-app  ~/dev/oss/spark/bin/spark-submit --master \"local[*]\" --class pl.japila.spark.SparkMeApp target/scala-2.11/sparkme-project_2.11-1.0.jar build.sbt\nUsing Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties\nTo adjust logging level use sc.setLogLevel(\"INFO\")\n15/09/23 01:06:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n15/09/23 01:06:04 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.\nThere are 8 lines in build.sbt\n</code></pre> <p>NOTE: Disregard the two above WARN log messages.</p> <p>You're done. Sincere congratulations!</p>"},{"location":"exercises/spark-hello-world-using-spark-shell/","title":"Spark's Hello World using Spark shell and Scala","text":"<p>== Exercise: Spark's Hello World using Spark shell and Scala</p> <p>Run Spark shell and count the number of words in a file using MapReduce pattern.</p> <ul> <li>Use <code>sc.textFile</code> to read the file into memory</li> <li>Use <code>RDD.flatMap</code> for a mapper step</li> <li>Use <code>reduceByKey</code> for a reducer step</li> </ul>"},{"location":"exercises/spark-sql-hive-orc-example/","title":"Using Spark SQL to update data in Hive using ORC files","text":"<p>== Using Spark SQL to update data in Hive using ORC files</p> <p>The example has showed up on Spark's users mailing list.</p>"},{"location":"exercises/spark-sql-hive-orc-example/#caution","title":"[CAUTION]","text":"<ul> <li>FIXME Offer a complete working solution in Scala</li> <li>FIXME Load ORC files into dataframe ** <code>val df = hiveContext.read.format(\"orc\").load(to/path)</code> ====</li> </ul> <p>Solution was to use Hive in ORC format with partitions:</p> <ul> <li>A table in Hive stored as an ORC file (using partitioning)</li> <li>Using <code>SQLContext.sql</code> to insert data into the table</li> <li>Using <code>SQLContext.sql</code> to periodically run <code>ALTER TABLE...CONCATENATE</code> to merge your many small files into larger files optimized for your HDFS block size ** Since the <code>CONCATENATE</code> command operates on files in place it is transparent to any downstream processing</li> <li>Hive solution is just to concatenate the files ** it does not alter or change records. ** it's possible to update data in Hive using ORC format ** With transactional tables in Hive together with insert, update, delete, it does the \"concatenate \" for you automatically in regularly intervals. Currently this works only with tables in orc.format (stored as orc) ** Alternatively, use Hbase with Phoenix as the SQL layer on top ** Hive was originally not designed for updates,  because it was.purely warehouse focused, the most recent one can do updates, deletes etc in a transactional way.</li> </ul> <p>Criteria:</p> <ul> <li>spark-streaming/spark-streaming.md[Spark Streaming] jobs are receiving a lot of small events (avg 10kb)</li> <li>Events are stored to HDFS, e.g. for Pig jobs</li> <li>There are a lot of small files in HDFS (several millions)</li> </ul>"},{"location":"external-shuffle-service/","title":"External Shuffle Service","text":"<p>External Shuffle Service is a Spark service to serve RDD and shuffle blocks outside and for Executors.</p> <p>ExternalShuffleService can be started as a command-line application or automatically as part of a worker node in a Spark cluster (e.g. Spark Standalone).</p> <p>External Shuffle Service is enabled in a Spark application using spark.shuffle.service.enabled configuration property.</p>"},{"location":"external-shuffle-service/ExecutorShuffleInfo/","title":"ExecutorShuffleInfo","text":"<p><code>ExecutorShuffleInfo</code> is...FIXME</p>"},{"location":"external-shuffle-service/ExternalBlockHandler/","title":"ExternalBlockHandler","text":"<p><code>ExternalBlockHandler</code> is an <code>RpcHandler</code>.</p>"},{"location":"external-shuffle-service/ExternalBlockHandler/#creating-instance","title":"Creating Instance","text":"<p><code>ExternalBlockHandler</code> takes the following to be created:</p> <ul> <li> TransportConf <li>Registered Executors File</li> <p><code>ExternalBlockHandler</code> creates the following:</p> <ul> <li>ShuffleMetrics</li> <li>OneForOneStreamManager</li> <li>ExternalShuffleBlockResolver</li> </ul> <p><code>ExternalBlockHandler</code> is created\u00a0when:</p> <ul> <li><code>ExternalShuffleService</code> is requested for an ExternalBlockHandler</li> <li><code>YarnShuffleService</code> is requested to <code>serviceInit</code></li> </ul>"},{"location":"external-shuffle-service/ExternalBlockHandler/#oneforonestreammanager","title":"OneForOneStreamManager <p><code>ExternalBlockHandler</code> can be given or creates an <code>OneForOneStreamManager</code> when created.</p>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#externalshuffleblockresolver","title":"ExternalShuffleBlockResolver <p><code>ExternalBlockHandler</code> can be given or creates an ExternalShuffleBlockResolver to be created.</p> <p><code>ExternalShuffleBlockResolver</code> is used for the following:</p> <ul> <li>registerExecutor when <code>ExternalBlockHandler</code> is requested to handle a RegisterExecutor message</li> <li>removeBlocks when <code>ExternalBlockHandler</code> is requested to handle a RemoveBlocks message</li> <li>getLocalDirs when <code>ExternalBlockHandler</code> is requested to handle a GetLocalDirsForExecutors message</li> <li>applicationRemoved when <code>ExternalBlockHandler</code> is requested to applicationRemoved</li> <li>executorRemoved when <code>ExternalBlockHandler</code> is requested to executorRemoved</li> <li>registerExecutor when <code>ExternalBlockHandler</code> is requested to reregisterExecutor</li> </ul> <p><code>ExternalShuffleBlockResolver</code> is used for the following:</p> <ul> <li>getBlockData and getRddBlockData for <code>ManagedBufferIterator</code></li> <li>getBlockData and getContinuousBlocksData for <code>ShuffleManagedBufferIterator</code></li> </ul> <p><code>ExternalShuffleBlockResolver</code> is closed when is ExternalBlockHandler.</p>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#registered-executors-file","title":"Registered Executors File <p><code>ExternalBlockHandler</code> can be given a Java's File (or <code>null</code>) to be created.</p> <p>This file is simply to create an ExternalShuffleBlockResolver.</p>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#messages","title":"Messages","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#fetchshuffleblocks","title":"FetchShuffleBlocks <p>Request to read a set of blocks</p> <p>\"Posted\" (created) when:</p> <ul> <li><code>OneForOneBlockFetcher</code> is requested to createFetchShuffleBlocksMsg</li> </ul> <p>When received, <code>ExternalBlockHandler</code> requests the OneForOneStreamManager to <code>registerStream</code> (with a <code>ShuffleManagedBufferIterator</code>).</p> <p><code>ExternalBlockHandler</code> prints out the following TRACE message to the logs:</p> <pre><code>Registered streamId [streamId] with [numBlockIds] buffers for client [clientId] from host [remoteAddress]\n</code></pre> <p>In the end, <code>ExternalBlockHandler</code> responds with a <code>StreamHandle</code> (of <code>streamId</code> and <code>numBlockIds</code>).</p>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#getlocaldirsforexecutors","title":"GetLocalDirsForExecutors","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#openblocks","title":"OpenBlocks  <p>Note</p> <p>For backward compatibility and like FetchShuffleBlocks.</p>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#registerexecutor","title":"RegisterExecutor","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#removeblocks","title":"RemoveBlocks","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#shufflemetrics","title":"ShuffleMetrics","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#executor-removed-notification","title":"Executor Removed Notification <pre><code>void executorRemoved(\n  String executorId,\n  String appId)\n</code></pre> <p><code>executorRemoved</code> requests the ExternalShuffleBlockResolver to executorRemoved.</p> <p><code>executorRemoved</code>\u00a0is used when:</p> <ul> <li><code>ExternalShuffleService</code> is requested to executorRemoved</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#application-finished-notification","title":"Application Finished Notification <pre><code>void applicationRemoved(\n  String appId,\n  boolean cleanupLocalDirs)\n</code></pre> <p><code>applicationRemoved</code> requests the ExternalShuffleBlockResolver to applicationRemoved.</p> <p><code>applicationRemoved</code>\u00a0is used when:</p> <ul> <li><code>ExternalShuffleService</code> is requested to applicationRemoved</li> <li><code>YarnShuffleService</code> (Spark on YARN) is requested to <code>stopApplication</code></li> </ul>","text":""},{"location":"external-shuffle-service/ExternalBlockHandler/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.network.shuffle.ExternalBlockHandler</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.network.shuffle.ExternalBlockHandler=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/","title":"ExternalShuffleBlockResolver","text":"<p><code>ExternalShuffleBlockResolver</code> manages converting shuffle BlockIds into physical segments of local files (from a process outside of Executors).</p>"},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#creating-instance","title":"Creating Instance","text":"<p><code>ExternalShuffleBlockResolver</code> takes the following to be created:</p> <ul> <li> TransportConf <li> <code>registeredExecutor</code> File (Java's File) <li>Directory Cleaner</li> <p><code>ExternalShuffleBlockResolver</code> is created\u00a0when:</p> <ul> <li><code>ExternalBlockHandler</code> is created</li> </ul>"},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#executors","title":"Executors <p><code>ExternalShuffleBlockResolver</code> uses a mapping of <code>ExecutorShuffleInfo</code>s by <code>AppExecId</code>.</p> <p><code>ExternalShuffleBlockResolver</code> can (re)load this mapping from a registeredExecutor file or simply start from scratch.</p> <p>A new mapping is added when registering an executor.</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#directory-cleaner-executor","title":"Directory Cleaner Executor <p><code>ExternalShuffleBlockResolver</code> can be given a Java Executor or use a single worker thread executor (with spark-shuffle-directory-cleaner thread prefix).</p> <p>The <code>Executor</code> is used to schedule a thread to clean up executor's local directories and non-shuffle and non-RDD files in executor's local directories.</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#sparkshuffleservicefetchrddenabled","title":"spark.shuffle.service.fetch.rdd.enabled <p><code>ExternalShuffleBlockResolver</code> uses spark.shuffle.service.fetch.rdd.enabled configuration property to control whether or not to remove cached RDD files (alongside shuffle output files).</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#registering-executor","title":"Registering Executor <pre><code>void registerExecutor(\n  String appId,\n  String execId,\n  ExecutorShuffleInfo executorInfo)\n</code></pre> <p><code>registerExecutor</code>...FIXME</p> <p><code>registerExecutor</code> is used when:</p> <ul> <li><code>ExternalBlockHandler</code> is requested to handle a RegisterExecutor message and reregisterExecutor</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#cleaning-up-local-directories-for-removed-executor","title":"Cleaning Up Local Directories for Removed Executor <pre><code>void executorRemoved(\n  String executorId,\n  String appId)\n</code></pre> <p><code>executorRemoved</code> prints out the following INFO message to the logs:</p> <pre><code>Clean up non-shuffle and non-RDD files associated with the finished executor [executorId]\n</code></pre> <p><code>executorRemoved</code> looks up the executor in the executors internal registry.</p> <p>When found, <code>executorRemoved</code> prints out the following INFO message to the logs and requests the Directory Cleaner Executor to execute asynchronous deletion of the executor's local directories (on a separate thread).</p> <pre><code>Cleaning up non-shuffle and non-RDD files in executor [AppExecId]'s [localDirs] local dirs\n</code></pre> <p>When not found, <code>executorRemoved</code> prints out the following INFO message to the logs:</p> <pre><code>Executor is not registered (appId=[appId], execId=[executorId])\n</code></pre> <p><code>executorRemoved</code>\u00a0is used when:</p> <ul> <li><code>ExternalBlockHandler</code> is requested to executorRemoved</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#deletenonshuffleserviceservedfiles","title":"deleteNonShuffleServiceServedFiles <pre><code>void deleteNonShuffleServiceServedFiles(\n  String[] dirs)\n</code></pre> <p><code>deleteNonShuffleServiceServedFiles</code> creates a Java FilenameFilter for files that meet all of the following:</p> <ol> <li>A file name does not end with <code>.index</code> or <code>.data</code></li> <li>With rddFetchEnabled is enabled, a file name does not start with <code>rdd_</code> prefix</li> </ol> <p><code>deleteNonShuffleServiceServedFiles</code> deletes files and directories (based on the <code>FilenameFilter</code>) in every directory (in the input <code>dirs</code>).</p> <p><code>deleteNonShuffleServiceServedFiles</code> prints out the following DEBUG message to the logs:</p> <pre><code>Successfully cleaned up files not served by shuffle service in directory: [localDir]\n</code></pre> <p>In case of any exceptions, <code>deleteNonShuffleServiceServedFiles</code> prints out the following ERROR message to the logs:</p> <pre><code>Failed to delete files not served by shuffle service in directory: [localDir]\n</code></pre>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#application-removed-notification","title":"Application Removed Notification <pre><code>void applicationRemoved(\n  String appId,\n  boolean cleanupLocalDirs)\n</code></pre> <p><code>applicationRemoved</code>...FIXME</p> <p><code>applicationRemoved</code> is used when:</p> <ul> <li><code>ExternalBlockHandler</code> is requested to applicationRemoved</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#deleteexecutordirs","title":"deleteExecutorDirs <pre><code>void deleteExecutorDirs(\n  String[] dirs)\n</code></pre> <p><code>deleteExecutorDirs</code>...FIXME</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#fetching-block-data","title":"Fetching Block Data <pre><code>ManagedBuffer getBlockData(\n  String appId,\n  String execId,\n  int shuffleId,\n  long mapId,\n  int reduceId)\n</code></pre> <p><code>getBlockData</code>...FIXME</p> <p><code>getBlockData</code> is used when:</p> <ul> <li><code>ManagedBufferIterator</code> is created</li> <li><code>ShuffleManagedBufferIterator</code> is requested for next <code>ManagedBuffer</code></li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleBlockResolver/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.network.shuffle.ExternalShuffleBlockResolver</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.network.shuffle.ExternalShuffleBlockResolver=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/","title":"ExternalShuffleService","text":"<p><code>ExternalShuffleService</code> is a Spark service that can serve RDD and shuffle blocks.</p> <p><code>ExternalShuffleService</code> manages shuffle output files so they are available to executors. As the shuffle output files are managed externally to the executors it offers an uninterrupted access to the shuffle output files regardless of executors being killed or down (esp. with Dynamic Allocation of Executors).</p> <p><code>ExternalShuffleService</code> can be launched from command line.</p> <p><code>ExternalShuffleService</code> is enabled on the driver and executors using spark.shuffle.service.enabled configuration property.</p> <p>Note</p> <p>Spark on YARN uses a custom external shuffle service (<code>YarnShuffleService</code>).</p>"},{"location":"external-shuffle-service/ExternalShuffleService/#launching-externalshuffleservice","title":"Launching ExternalShuffleService <p><code>ExternalShuffleService</code> can be launched as a standalone application using spark-class.</p> <pre><code>spark-class org.apache.spark.deploy.ExternalShuffleService\n</code></pre>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#main-entry-point","title":"main Entry Point <pre><code>main(\n  args: Array[String]): Unit\n</code></pre> <p><code>main</code> is the entry point of <code>ExternalShuffleService</code> standalone application.</p> <p><code>main</code> prints out the following INFO message to the logs:</p> <pre><code>Started daemon with process name: [name]\n</code></pre> <p><code>main</code> registers signal handlers for <code>TERM</code>, <code>HUP</code>, <code>INT</code> signals.</p> <p><code>main</code> loads the default Spark properties.</p> <p><code>main</code> creates a <code>SecurityManager</code>.</p> <p><code>main</code> turns spark.shuffle.service.enabled to <code>true</code> explicitly (since this service is started from the command line for a reason).</p> <p><code>main</code> creates an ExternalShuffleService and starts it.</p> <p><code>main</code> prints out the following DEBUG message to the logs:</p> <pre><code>Adding shutdown hook\n</code></pre> <p><code>main</code> registers a shutdown hook. When triggered, the shutdown hook prints the following INFO message to the logs and requests the <code>ExternalShuffleService</code> to stop.</p> <pre><code>Shutting down shuffle service.\n</code></pre>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#creating-instance","title":"Creating Instance <p><code>ExternalShuffleService</code> takes the following to be created:</p> <ul> <li> SparkConf <li> <code>SecurityManager</code>  <p><code>ExternalShuffleService</code> is created\u00a0when:</p> <ul> <li><code>ExternalShuffleService</code> standalone application is started</li> <li><code>Worker</code> (Spark Standalone) is created (and initializes an <code>ExternalShuffleService</code>)</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#transportserver","title":"TransportServer <pre><code>server: TransportServer\n</code></pre> <p><code>ExternalShuffleService</code> uses an internal reference to a <code>TransportServer</code> that is created when <code>ExternalShuffleService</code> is started.</p> <p><code>ExternalShuffleService</code> uses an ExternalBlockHandler to handle RPC messages (and serve RDD blocks and shuffle blocks).</p> <p><code>TransportServer</code> is requested to <code>close</code> when <code>ExternalShuffleService</code> is requested to stop.</p> <p><code>TransportServer</code> is used for metrics.</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#port","title":"Port <p><code>ExternalShuffleService</code> uses spark.shuffle.service.port configuration property for the port to listen to when started.</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#sparkshuffleserviceenabled","title":"spark.shuffle.service.enabled <p><code>ExternalShuffleService</code> uses spark.shuffle.service.enabled configuration property to control whether or not is enabled (and should be started when requested).</p>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#externalblockhandler","title":"ExternalBlockHandler <pre><code>blockHandler: ExternalBlockHandler\n</code></pre> <p><code>ExternalShuffleService</code> creates an ExternalBlockHandler when created.</p> <p>With spark.shuffle.service.db.enabled and spark.shuffle.service.enabled configuration properties enabled, the <code>ExternalBlockHandler</code> is given a local directory with a registeredExecutors.ldb file.</p> <p><code>blockHandler</code>\u00a0is used to create a TransportContext that creates the TransportServer.</p> <p><code>blockHandler</code>\u00a0is used when:</p> <ul> <li>applicationRemoved</li> <li>executorRemoved</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#findregisteredexecutorsdbfile","title":"findRegisteredExecutorsDBFile <pre><code>findRegisteredExecutorsDBFile(\n  dbName: String): File\n</code></pre> <p><code>findRegisteredExecutorsDBFile</code> returns one of the local directories (defined using spark.local.dir configuration property) with the input <code>dbName</code> file or <code>null</code> when no directories defined.</p> <p><code>findRegisteredExecutorsDBFile</code> searches the local directories (defined using spark.local.dir configuration property) for the input <code>dbName</code> file. Unless found, <code>findRegisteredExecutorsDBFile</code> takes the first local directory.</p> <p>With no local directories defined in spark.local.dir configuration property, <code>findRegisteredExecutorsDBFile</code> prints out the following WARN message to the logs and returns <code>null</code>.</p> <pre><code>'spark.local.dir' should be set first when we use db in ExternalShuffleService. Note that this only affects standalone mode.\n</code></pre>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#starting-externalshuffleservice","title":"Starting ExternalShuffleService <pre><code>start(): Unit\n</code></pre> <p><code>start</code> prints out the following INFO message to the logs:</p> <pre><code>Starting shuffle service on port [port] (auth enabled = [authEnabled])\n</code></pre> <p><code>start</code> creates a <code>AuthServerBootstrap</code> with authentication enabled (using SecurityManager).</p> <p><code>start</code> creates a TransportContext (with the ExternalBlockHandler) and requests it to create a server (on the port).</p> <p><code>start</code>...FIXME</p> <p><code>start</code>\u00a0is used when:</p> <ul> <li><code>ExternalShuffleService</code> is requested to startIfEnabled and is launched (as a command-line application)</li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#startifenabled","title":"startIfEnabled <pre><code>startIfEnabled(): Unit\n</code></pre> <p><code>startIfEnabled</code> starts the external shuffle service if enabled.</p> <p><code>startIfEnabled</code>\u00a0is used when:</p> <ul> <li><code>Worker</code> (Spark Standalone) is requested to <code>startExternalShuffleService</code></li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#executor-removed-notification","title":"Executor Removed Notification <pre><code>executorRemoved(\n  executorId: String,\n  appId: String): Unit\n</code></pre> <p><code>executorRemoved</code> requests the ExternalBlockHandler to executorRemoved.</p> <p><code>executorRemoved</code>\u00a0is used when:</p> <ul> <li><code>Worker</code> (Spark Standalone) is requested to <code>handleExecutorStateChanged</code></li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#application-finished-notification","title":"Application Finished Notification <pre><code>applicationRemoved(\n  appId: String): Unit\n</code></pre> <p><code>applicationRemoved</code> requests the ExternalBlockHandler to applicationRemoved (with <code>cleanupLocalDirs</code> flag enabled).</p> <p><code>applicationRemoved</code>\u00a0is used when:</p> <ul> <li><code>Worker</code> (Spark Standalone) is requested to handle <code>WorkDirCleanup</code> message and <code>maybeCleanupApplication</code></li> </ul>","text":""},{"location":"external-shuffle-service/ExternalShuffleService/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.deploy.ExternalShuffleService</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.deploy.ExternalShuffleService=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"external-shuffle-service/configuration-properties/","title":"Spark Configuration Properties of External Shuffle Service","text":"<p>The following are configuration properties of External Shuffle Service.</p>"},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleservicedbenabled","title":"spark.shuffle.service.db.enabled <p>Whether to use db in ExternalShuffleService. Note that this only affects standalone mode.</p> <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li><code>ExternalShuffleService</code> is requested for an ExternalBlockHandler</li> <li><code>Worker</code> (Spark Standalone) is requested to handle a <code>WorkDirCleanup</code> message</li> </ul>","text":""},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleserviceenabled","title":"spark.shuffle.service.enabled <p>Controls whether to use the External Shuffle Service</p> <p>Default: <code>false</code></p>  <p>Note</p> <p><code>LocalSparkCluster</code> turns this property off explicitly when started.</p>  <p>Used when:</p> <ul> <li><code>BlacklistTracker</code> is requested to updateBlacklistForFetchFailure</li> <li><code>ExecutorMonitor</code> is created</li> <li><code>ExecutorAllocationManager</code> is requested to validateSettings</li> <li><code>SparkEnv</code> utility is requested to create a \"base\" SparkEnv</li> <li><code>ExternalShuffleService</code> is created and started</li> <li><code>Worker</code> (Spark Standalone) is requested to handle a <code>WorkDirCleanup</code> message or started</li> <li><code>ExecutorRunnable</code> (Spark on YARN) is requested to <code>startContainer</code></li> </ul>","text":""},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleservicefetchrddenabled","title":"spark.shuffle.service.fetch.rdd.enabled <p>Enables ExternalShuffleService for fetching disk persisted RDD blocks.</p> <p>When enabled with Dynamic Resource Allocation executors having only disk persisted blocks are considered idle after spark.dynamicAllocation.executorIdleTimeout and will be released accordingly.</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>ExternalShuffleBlockResolver</code> is created</li> <li><code>SparkEnv</code> utility is requested to create a \"base\" SparkEnv</li> <li><code>ExecutorMonitor</code> is created</li> </ul>","text":""},{"location":"external-shuffle-service/configuration-properties/#sparkshuffleserviceport","title":"spark.shuffle.service.port <p>Port of the external shuffle service</p> <p>Default: <code>7337</code></p> <p>Used when:</p> <ul> <li><code>ExternalShuffleService</code> is created</li> <li><code>StorageUtils</code> utility is requested for the port of an external shuffle service</li> </ul>","text":""},{"location":"features/","title":"Features","text":""},{"location":"history-server/","title":"Spark History Server","text":"<p>Spark History Server is the web UI of Spark applications with event log collection enabled (based on spark.eventLog.enabled configuration property).</p> <p></p> <p>Spark History Server is an extension of Spark's web UI.</p> <p>Spark History Server can be started using start-history-server.sh and stopped using stop-history-server.sh shell scripts.</p> <p>Spark History Server supports custom configuration properties that can be defined using <code>--properties-file [propertiesFile]</code> command-line option. The properties file can have any valid <code>spark.</code>-prefixed Spark property.</p> <pre><code>$ ./sbin/start-history-server.sh --properties-file history.properties\n</code></pre> <p>If not specified explicitly, Spark History Server uses the default configuration file, i.e. spark-defaults.conf.</p> <p>Spark History Server can replay events from event log files recorded by EventLoggingListener.</p>"},{"location":"history-server/#start-history-serversh-shell-script","title":"start-history-server.sh Shell Script <p><code>$SPARK_HOME/sbin/start-history-server.sh</code> shell script (where <code>SPARK_HOME</code> is the directory of your Spark installation) is used to start a Spark History Server instance.</p> <pre><code>$ ./sbin/start-history-server.sh\nstarting org.apache.spark.deploy.history.HistoryServer, logging to .../spark/logs/spark-jacek-org.apache.spark.deploy.history.HistoryServer-1-japila.out\n</code></pre> <p>Internally, <code>start-history-server.sh</code> script starts org.apache.spark.deploy.history.HistoryServer standalone application (using <code>spark-daemon.sh</code> shell script).</p> <pre><code>$ ./bin/spark-class org.apache.spark.deploy.history.HistoryServer\n</code></pre>  <p>Tip</p> <p>Using the more explicit approach with <code>spark-class</code> to start Spark History Server could be easier to trace execution by seeing the logs printed out to the standard output and hence terminal directly.</p>  <p>When started, <code>start-history-server.sh</code> prints out the following INFO message to the logs:</p> <pre><code>Started daemon with process name: [processName]\n</code></pre> <p><code>start-history-server.sh</code> registers signal handlers (using <code>SignalUtils</code>) for <code>TERM</code>, <code>HUP</code>, <code>INT</code> to log their execution:</p> <pre><code>RECEIVED SIGNAL [signal]\n</code></pre> <p><code>start-history-server.sh</code> inits security if enabled (based on spark.history.kerberos.enabled configuration property).</p> <p><code>start-history-server.sh</code> creates a <code>SecurityManager</code>.</p> <p><code>start-history-server.sh</code> creates a ApplicationHistoryProvider (based on spark.history.provider configuration property).</p> <p>In the end, <code>start-history-server.sh</code> creates a HistoryServer and requests it to bind to the port (based on spark.history.ui.port configuration property).</p>  <p>Note</p> <p>The host's IP can be specified using <code>SPARK_LOCAL_IP</code> environment variable (defaults to <code>0.0.0.0</code>).</p>  <p><code>start-history-server.sh</code> prints out the following INFO message to the logs:</p> <pre><code>Bound HistoryServer to [host], and started at [webUrl]\n</code></pre> <p><code>start-history-server.sh</code> registers a shutdown hook to call <code>stop</code> on the <code>HistoryServer</code> instance.</p>","text":""},{"location":"history-server/#stop-history-serversh-shell-script","title":"stop-history-server.sh Shell Script <p><code>$SPARK_HOME/sbin/stop-history-server.sh</code> shell script (where <code>SPARK_HOME</code> is the directory of your Spark installation) is used to stop a running instance of Spark History Server.</p> <pre><code>$ ./sbin/stop-history-server.sh\nstopping org.apache.spark.deploy.history.HistoryServer\n</code></pre>","text":""},{"location":"history-server/ApplicationCache/","title":"ApplicationCache","text":"<p>== [[ApplicationCache]] ApplicationCache</p> <p><code>ApplicationCache</code> is...FIXME</p> <p><code>ApplicationCache</code> is &lt;&gt; exclusively when <code>HistoryServer</code> is HistoryServer.md#appCache[created]. <p><code>ApplicationCache</code> uses https://github.com/google/guava/wiki/Release14[Google Guava 14.0.1] library for the internal &lt;&gt;. <p>[[internal-registries]] .ApplicationCache's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| <code>appLoader</code> | [[appLoader]] Google Guava's https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/CacheLoader.html[CacheLoader] with a custom ++https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/CacheLoader.html#load(K)++[load] which is simply &lt;&gt;. <p>Used when...FIXME</p> <p>| <code>removalListener</code> | [[removalListener]]</p> <p>| <code>appCache</code> a| [[appCache]] Google Guava's https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/LoadingCache.html[LoadingCache] of <code>CacheKey</code> keys and <code>CacheEntry</code> entries</p> <p>Used when <code>ApplicationCache</code> is requested for the following:</p> <ul> <li> <p>&lt;&gt; given <code>appId</code> and <code>attemptId</code> IDs <li> <p>FIXME (other uses)</p> </li> <p>| <code>metrics</code> | [[metrics]] |===</p> <p>=== [[creating-instance]] Creating ApplicationCache Instance</p> <p><code>ApplicationCache</code> takes the following when created:</p> <ul> <li>[[operations]] ApplicationCacheOperations.md[ApplicationCacheOperations]</li> <li>[[retainedApplications]] <code>retainedApplications</code></li> <li>[[clock]] <code>Clock</code></li> </ul> <p><code>ApplicationCache</code> initializes the &lt;&gt;. <p>=== [[loadApplicationEntry]] <code>loadApplicationEntry</code> Internal Method</p>"},{"location":"history-server/ApplicationCache/#source-scala","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#loadapplicationentryappid-string-attemptid-optionstring-cacheentry","title":"loadApplicationEntry(appId: String, attemptId: Option[String]): CacheEntry","text":"<p><code>loadApplicationEntry</code>...FIXME</p> <p>NOTE: <code>loadApplicationEntry</code> is used exclusively when <code>ApplicationCache</code> is requested to &lt;&gt;. <p>=== [[load]] Loading Cached Spark Application UI -- <code>load</code> Method</p>"},{"location":"history-server/ApplicationCache/#source-scala_1","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#loadkey-cachekey-cacheentry","title":"load(key: CacheKey): CacheEntry","text":"<p>NOTE: <code>load</code> is part of Google Guava's https://google.github.io/guava/releases/14.0/api/docs/com/google/common/cache/CacheLoader.html[CacheLoader] to retrieve a <code>CacheEntry</code>, based on a <code>CacheKey</code>, for &lt;&gt;. <p><code>load</code> simply relays to &lt;&gt; with the <code>appId</code> and <code>attemptId</code> of the input <code>CacheKey</code>. <p>=== [[get]] Requesting Cached UI of Spark Application (CacheEntry) -- <code>get</code> Method</p>"},{"location":"history-server/ApplicationCache/#source-scala_2","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#getappid-string-attemptid-optionstring-none-cacheentry","title":"get(appId: String, attemptId: Option[String] = None): CacheEntry","text":"<p><code>get</code>...FIXME</p> <p>NOTE: <code>get</code> is used exclusively when <code>ApplicationCache</code> is requested to &lt;&gt;. <p>=== [[withSparkUI]] Executing Closure While Holding Application's UI Read Lock -- <code>withSparkUI</code> Method</p>"},{"location":"history-server/ApplicationCache/#source-scala_3","title":"[source, scala]","text":""},{"location":"history-server/ApplicationCache/#withsparkuitfn-sparkui-t-t","title":"withSparkUIT(fn: SparkUI =&gt; T): T","text":"<p><code>withSparkUI</code>...FIXME</p> <p>NOTE: <code>withSparkUI</code> is used when <code>HistoryServer</code> is requested to HistoryServer.md#withSparkUI[withSparkUI] and HistoryServer.md#loadAppUi[loadAppUi].</p>"},{"location":"history-server/ApplicationCacheOperations/","title":"ApplicationCacheOperations","text":"<p>== [[ApplicationCacheOperations]] ApplicationCacheOperations</p> <p><code>ApplicationCacheOperations</code> is the &lt;&gt; of...FIXME <p>[[contract]] [source, scala]</p> <p>package org.apache.spark.deploy.history</p> <p>trait ApplicationCacheOperations {   // only required methods that have no implementation   // the others follow   def getAppUI(appId: String, attemptId: Option[String]): Option[LoadedAppUI]   def attachSparkUI(     appId: String,     attemptId: Option[String],     ui: SparkUI,     completed: Boolean): Unit   def detachSparkUI(appId: String, attemptId: Option[String], ui: SparkUI): Unit }</p> <p>NOTE: <code>ApplicationCacheOperations</code> is a <code>private[history]</code> contract.</p> <p>.(Subset of) ApplicationCacheOperations Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| <code>getAppUI</code> | [[getAppUI]] spark-webui-SparkUI.md[SparkUI] (the UI of a Spark application)</p> <p>Used exclusively when <code>ApplicationCache</code> is requested for ApplicationCache.md#loadApplicationEntry[loadApplicationEntry]</p> <p>| <code>attachSparkUI</code> | [[attachSparkUI]]</p> <p>| <code>detachSparkUI</code> | [[detachSparkUI]] |===</p> <p>[[implementations]] NOTE: HistoryServer.md[HistoryServer] is the one and only known implementation of &lt;&gt; in Apache Spark."},{"location":"history-server/ApplicationHistoryProvider/","title":"ApplicationHistoryProvider","text":"<p><code>ApplicationHistoryProvider</code> is an abstraction of history providers.</p>"},{"location":"history-server/ApplicationHistoryProvider/#contract","title":"Contract","text":""},{"location":"history-server/ApplicationHistoryProvider/#getapplicationinfo","title":"getApplicationInfo <pre><code>getApplicationInfo(\n  appId: String): Option[ApplicationInfo]\n</code></pre> <p>Used when...FIXME</p>","text":""},{"location":"history-server/ApplicationHistoryProvider/#getappui","title":"getAppUI <pre><code>getAppUI(\n  appId: String,\n  attemptId: Option[String]): Option[LoadedAppUI]\n</code></pre> <p>SparkUI for a given application (by <code>appId</code>)</p> <p>Used when <code>HistoryServer</code> is requested for the UI of a Spark application</p>","text":""},{"location":"history-server/ApplicationHistoryProvider/#getlisting","title":"getListing <pre><code>getListing(): Iterator[ApplicationInfo]\n</code></pre> <p>Used when...FIXME</p>","text":""},{"location":"history-server/ApplicationHistoryProvider/#onuidetached","title":"onUIDetached <pre><code>onUIDetached(\n  appId: String,\n  attemptId: Option[String],\n  ui: SparkUI): Unit\n</code></pre> <p>Used when...FIXME</p>","text":""},{"location":"history-server/ApplicationHistoryProvider/#writeeventlogs","title":"writeEventLogs <pre><code>writeEventLogs(\n  appId: String,\n  attemptId: Option[String],\n  zipStream: ZipOutputStream): Unit\n</code></pre> <p>Writes events to a stream</p> <p>Used when...FIXME</p>","text":""},{"location":"history-server/ApplicationHistoryProvider/#implementations","title":"Implementations","text":"<ul> <li>FsHistoryProvider</li> </ul>"},{"location":"history-server/EventLogFileWriter/","title":"EventLogFileWriter","text":"<p><code>EventLogFileWriter</code> is...FIXME</p>"},{"location":"history-server/EventLoggingListener/","title":"EventLoggingListener","text":"<p><code>EventLoggingListener</code> is a SparkListener that writes out JSON-encoded events of a Spark application with event logging enabled (based on spark.eventLog.enabled configuration property).</p> <p><code>EventLoggingListener</code> supports custom configuration properties.</p> <p><code>EventLoggingListener</code> writes out log files to a directory (based on spark.eventLog.dir configuration property).</p>"},{"location":"history-server/EventLoggingListener/#creating-instance","title":"Creating Instance","text":"<p><code>EventLoggingListener</code> takes the following to be created:</p> <ul> <li> Application ID <li> Application Attempt ID <li> Log Directory <li> SparkConf <li> Hadoop Configuration <p><code>EventLoggingListener</code> is created\u00a0when <code>SparkContext</code> is created (with spark.eventLog.enabled enabled).</p>"},{"location":"history-server/EventLoggingListener/#eventlogfilewriter","title":"EventLogFileWriter <pre><code>logWriter: EventLogFileWriter\n</code></pre> <p><code>EventLoggingListener</code> creates a EventLogFileWriter when created.</p>  <p>Note</p> <p>All arguments to create an EventLoggingListener are passed to the <code>EventLogFileWriter</code>.</p>  <p>The <code>EventLogFileWriter</code> is started when <code>EventLoggingListener</code> is started.</p> <p>The <code>EventLogFileWriter</code> is stopped when <code>EventLoggingListener</code> is stopped.</p> <p>The <code>EventLogFileWriter</code> is requested to writeEvent when <code>EventLoggingListener</code> is requested to start and log an event.</p>","text":""},{"location":"history-server/EventLoggingListener/#starting-eventlogginglistener","title":"Starting EventLoggingListener <pre><code>start(): Unit\n</code></pre> <p><code>start</code> requests the EventLogFileWriter to start and initEventLog.</p>","text":""},{"location":"history-server/EventLoggingListener/#initeventlog","title":"initEventLog <pre><code>initEventLog(): Unit\n</code></pre> <p><code>initEventLog</code>...FIXME</p>","text":""},{"location":"history-server/EventLoggingListener/#logging-event","title":"Logging Event <pre><code>logEvent(\n  event: SparkListenerEvent,\n  flushLogger: Boolean = false): Unit\n</code></pre> <p><code>logEvent</code> persists the given SparkListenerEvent in JSON format.</p> <p><code>logEvent</code> converts the event to JSON format and requests the EventLogFileWriter to write it out.</p>","text":""},{"location":"history-server/EventLoggingListener/#stopping-eventlogginglistener","title":"Stopping EventLoggingListener <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> requests the EventLogFileWriter to stop.</p> <p><code>stop</code> is used when <code>SparkContext</code> is requested to stop.</p>","text":""},{"location":"history-server/EventLoggingListener/#inprogress-file-extension","title":"inprogress File Extension <p><code>EventLoggingListener</code> uses .inprogress file extension for in-flight event log files of active Spark applications.</p>","text":""},{"location":"history-server/EventLoggingListener/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.EventLoggingListener</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.EventLoggingListener=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"history-server/FsHistoryProvider/","title":"FsHistoryProvider","text":"<p><code>FsHistoryProvider</code> is the default ApplicationHistoryProvider for Spark History Server.</p>"},{"location":"history-server/FsHistoryProvider/#creating-instance","title":"Creating Instance","text":"<p><code>FsHistoryProvider</code> takes the following to be created:</p> <ul> <li> SparkConf <li> <code>Clock</code> (default: <code>SystemClock</code>) <p><code>FsHistoryProvider</code> is created\u00a0when <code>HistoryServer</code> standalone application is started (and no spark.history.provider configuration property was defined).</p>"},{"location":"history-server/FsHistoryProvider/#path-of-application-history-cache","title":"Path of Application History Cache <pre><code>storePath: Option[File]\n</code></pre> <p><code>FsHistoryProvider</code> uses spark.history.store.path configuration property for the directory to cache application history.</p> <p>With <code>storePath</code> defined, <code>FsHistoryProvider</code> uses a LevelDB as the KVStore. Otherwise, a InMemoryStore.</p> <p>With <code>storePath</code> defined, <code>FsHistoryProvider</code> uses a HistoryServerDiskManager as the disk manager.</p>","text":""},{"location":"history-server/FsHistoryProvider/#disk-manager","title":"Disk Manager <pre><code>diskManager: Option[HistoryServerDiskManager]\n</code></pre> <p><code>FsHistoryProvider</code> creates a HistoryServerDiskManager when created (with storePath defined based on spark.history.store.path configuration property).</p> <p><code>FsHistoryProvider</code> uses the <code>HistoryServerDiskManager</code> for the following:</p> <ul> <li>startPolling</li> <li>getAppUI</li> <li>onUIDetached</li> <li>cleanAppData</li> </ul>","text":""},{"location":"history-server/FsHistoryProvider/#sparkui-of-spark-application","title":"SparkUI of Spark Application <pre><code>getAppUI(\n  appId: String,\n  attemptId: Option[String]): Option[LoadedAppUI]\n</code></pre> <p><code>getAppUI</code> is part of the ApplicationHistoryProvider abstraction.</p> <p><code>getAppUI</code>...FIXME</p>","text":""},{"location":"history-server/FsHistoryProvider/#onuidetached","title":"onUIDetached <pre><code>onUIDetached(): Unit\n</code></pre> <p><code>onUIDetached</code> is part of the ApplicationHistoryProvider abstraction.</p> <p><code>onUIDetached</code>...FIXME</p>","text":""},{"location":"history-server/FsHistoryProvider/#loaddiskstore","title":"loadDiskStore <pre><code>loadDiskStore(\n  dm: HistoryServerDiskManager,\n  appId: String,\n  attempt: AttemptInfoWrapper): KVStore\n</code></pre> <p><code>loadDiskStore</code>...FIXME</p> <p><code>loadDiskStore</code> is used in getAppUI (with HistoryServerDiskManager available).</p>","text":""},{"location":"history-server/FsHistoryProvider/#createinmemorystore","title":"createInMemoryStore <pre><code>createInMemoryStore(\n  attempt: AttemptInfoWrapper): KVStore\n</code></pre> <p><code>createInMemoryStore</code>...FIXME</p> <p><code>createInMemoryStore</code> is used in getAppUI.</p>","text":""},{"location":"history-server/FsHistoryProvider/#rebuildappstore","title":"rebuildAppStore <pre><code>rebuildAppStore(\n  store: KVStore,\n  reader: EventLogFileReader,\n  lastUpdated: Long): Unit\n</code></pre> <p><code>rebuildAppStore</code>...FIXME</p> <p><code>rebuildAppStore</code> is used in loadDiskStore and createInMemoryStore.</p>","text":""},{"location":"history-server/FsHistoryProvider/#cleanappdata","title":"cleanAppData <pre><code>cleanAppData(\n  appId: String,\n  attemptId: Option[String],\n  logPath: String): Unit\n</code></pre> <p><code>cleanAppData</code>...FIXME</p> <p><code>cleanAppData</code> is used in checkForLogs and deleteAttemptLogs.</p>","text":""},{"location":"history-server/FsHistoryProvider/#polling-for-logs","title":"Polling for Logs <pre><code>startPolling(): Unit\n</code></pre> <p><code>startPolling</code>...FIXME</p> <p><code>startPolling</code> is used in initialize and startSafeModeCheckThread.</p>","text":""},{"location":"history-server/FsHistoryProvider/#checking-available-event-logs","title":"Checking Available Event Logs <pre><code>checkForLogs(): Unit\n</code></pre> <p><code>checkForLogs</code>...FIXME</p>","text":""},{"location":"history-server/FsHistoryProvider/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.deploy.history.FsHistoryProvider</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.deploy.history.FsHistoryProvider=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"history-server/HistoryAppStatusStore/","title":"HistoryAppStatusStore","text":"<p><code>HistoryAppStatusStore</code> is an AppStatusStore for SparkUIs in Spark History Server.</p>"},{"location":"history-server/HistoryAppStatusStore/#creating-instance","title":"Creating Instance","text":"<p><code>HistoryAppStatusStore</code> takes the following to be created:</p> <ul> <li> SparkConf <li> KVStore <p><code>HistoryAppStatusStore</code> is created\u00a0when:</p> <ul> <li><code>FsHistoryProvider</code> is requested for a SparkUI (of a Spark application)</li> </ul>"},{"location":"history-server/HistoryAppStatusStore/#executorlogurlhandler","title":"ExecutorLogUrlHandler <pre><code>logUrlHandler: ExecutorLogUrlHandler\n</code></pre> <p><code>HistoryAppStatusStore</code> creates an ExecutorLogUrlHandler (for the logUrlPattern) when created.</p> <p><code>HistoryAppStatusStore</code> uses it when requested to replaceLogUrls.</p>","text":""},{"location":"history-server/HistoryAppStatusStore/#executorlist","title":"executorList <pre><code>executorList(\n  exec: v1.ExecutorSummary,\n  urlPattern: String): v1.ExecutorSummary\n</code></pre> <p><code>executorList</code>...FIXME</p> <p><code>executorList</code>\u00a0is part of the AppStatusStore abstraction.</p>","text":""},{"location":"history-server/HistoryAppStatusStore/#executorsummary","title":"executorSummary <pre><code>executorSummary(\n  executorId: String): v1.ExecutorSummary\n</code></pre> <p><code>executorSummary</code>...FIXME</p> <p><code>executorSummary</code>\u00a0is part of the AppStatusStore abstraction.</p>","text":""},{"location":"history-server/HistoryAppStatusStore/#replacelogurls","title":"replaceLogUrls <pre><code>replaceLogUrls(\n  exec: v1.ExecutorSummary,\n  urlPattern: String): v1.ExecutorSummary\n</code></pre> <p><code>replaceLogUrls</code>...FIXME</p> <p><code>replaceLogUrls</code>\u00a0is used when <code>HistoryAppStatusStore</code> is requested to executorList and executorSummary.</p>","text":""},{"location":"history-server/HistoryServer/","title":"HistoryServer","text":"<p><code>HistoryServer</code> is an extension of the web UI for reviewing event logs of running (active) and completed Spark applications with event log collection enabled (based on spark.eventLog.enabled configuration property).</p>"},{"location":"history-server/HistoryServer/#starting-historyserver-standalone-application","title":"Starting HistoryServer Standalone Application <pre><code>main(\n  argStrings: Array[String]): Unit\n</code></pre> <p><code>main</code> creates a HistoryServerArguments (with the given <code>argStrings</code> arguments).</p> <p><code>main</code> initializes security.</p> <p><code>main</code> creates an ApplicationHistoryProvider (based on spark.history.provider configuration property).</p> <p><code>main</code> creates a HistoryServer (with the <code>ApplicationHistoryProvider</code> and spark.history.ui.port configuration property) and requests it to bind.</p> <p><code>main</code> requests the <code>ApplicationHistoryProvider</code> to start.</p> <p><code>main</code> registers a shutdown hook that requests the <code>HistoryServer</code> to stop and sleeps...till the end of the world (giving the daemon thread a go).</p>","text":""},{"location":"history-server/HistoryServer/#creating-instance","title":"Creating Instance <p><code>HistoryServer</code> takes the following to be created:</p> <ul> <li> SparkConf <li> ApplicationHistoryProvider <li> <code>SecurityManager</code> <li> Port number  <p>When created, <code>HistoryServer</code> initializes itself.</p> <p><code>HistoryServer</code> is created\u00a0when HistoryServer standalone application is started.</p>","text":""},{"location":"history-server/HistoryServer/#applicationcacheoperations","title":"ApplicationCacheOperations <p><code>HistoryServer</code> is a ApplicationCacheOperations.</p>","text":""},{"location":"history-server/HistoryServer/#uiroot","title":"UIRoot <p><code>HistoryServer</code> is a UIRoot.</p>","text":""},{"location":"history-server/HistoryServer/#initializing-historyserver","title":"Initializing HistoryServer <pre><code>initialize(): Unit\n</code></pre> <p><code>initialize</code> is part of the WebUI abstraction.</p> <p><code>initialize</code>...FIXME</p>","text":""},{"location":"history-server/HistoryServer/#attaching-sparkui","title":"Attaching SparkUI <pre><code>attachSparkUI(\n  appId: String,\n  attemptId: Option[String],\n  ui: SparkUI,\n  completed: Boolean): Unit\n</code></pre> <p><code>attachSparkUI</code> is part of the ApplicationCacheOperations abstraction.</p> <p><code>attachSparkUI</code>...FIXME</p>","text":""},{"location":"history-server/HistoryServer/#spark-ui","title":"Spark UI <pre><code>getAppUI(\n  appId: String,\n  attemptId: Option[String]): Option[LoadedAppUI]\n</code></pre> <p><code>getAppUI</code> is part of the ApplicationCacheOperations abstraction.</p> <p><code>getAppUI</code> requests the ApplicationHistoryProvider for the Spark UI of a Spark application (based on the <code>appId</code> and <code>attemptId</code>).</p>","text":""},{"location":"history-server/HistoryServer/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.deploy.history.HistoryServer</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.deploy.history.HistoryServer=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"history-server/HistoryServerArguments/","title":"HistoryServerArguments","text":"<p>== HistoryServerArguments</p> <p><code>HistoryServerArguments</code> is the command-line parser for the index.md[History Server].</p> <p>When <code>HistoryServerArguments</code> is executed with a single command-line parameter it is assumed to be the event logs directory.</p> <pre><code>$ ./sbin/start-history-server.sh /tmp/spark-events\n</code></pre> <p>This is however deprecated since Spark 1.1.0 and you should see the following WARN message in the logs:</p> <pre><code>WARN HistoryServerArguments: Setting log directory through the command line is deprecated as of Spark 1.1.0. Please set this through spark.history.fs.logDirectory instead.\n</code></pre> <p>The same WARN message shows up for <code>--dir</code> and <code>-d</code> command-line options.</p> <p><code>--properties-file [propertiesFile]</code> command-line option specifies the file with the custom spark-properties.md[Spark properties].</p> <p>NOTE: When not specified explicitly, History Server uses the default configuration file, i.e. spark-properties.md#spark-defaults-conf[spark-defaults.conf].</p>"},{"location":"history-server/HistoryServerArguments/#tip","title":"[TIP]","text":"<p>Enable <code>WARN</code> logging level for <code>org.apache.spark.deploy.history.HistoryServerArguments</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.deploy.history.HistoryServerArguments=WARN\n</code></pre>"},{"location":"history-server/HistoryServerArguments/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":""},{"location":"history-server/HistoryServerDiskManager/","title":"HistoryServerDiskManager","text":"<p><code>HistoryServerDiskManager</code> is a disk manager for FsHistoryProvider.</p>"},{"location":"history-server/HistoryServerDiskManager/#creating-instance","title":"Creating Instance","text":"<p><code>HistoryServerDiskManager</code> takes the following to be created:</p> <ul> <li> SparkConf <li> Path <li> KVStore <li> <code>Clock</code> <p><code>HistoryServerDiskManager</code> is created\u00a0when:</p> <ul> <li><code>FsHistoryProvider</code> is created (and initializes a diskManager)</li> </ul>"},{"location":"history-server/HistoryServerDiskManager/#initializing","title":"Initializing <pre><code>initialize(): Unit\n</code></pre> <p><code>initialize</code>...FIXME</p> <p><code>initialize</code>\u00a0is used when:</p> <ul> <li><code>FsHistoryProvider</code> is requested to startPolling</li> </ul>","text":""},{"location":"history-server/HistoryServerDiskManager/#releasing-application-store","title":"Releasing Application Store <pre><code>release(\n  appId: String,\n  attemptId: Option[String],\n  delete: Boolean = false): Unit\n</code></pre> <p><code>release</code>...FIXME</p> <p><code>release</code>\u00a0is used when:</p> <ul> <li><code>FsHistoryProvider</code> is requested to onUIDetached, cleanAppData and loadDiskStore</li> </ul>","text":""},{"location":"history-server/JsonProtocol/","title":"JsonProtocol Utility","text":"<p><code>JsonProtocol</code> is an utility to convert SparkListenerEvents to and from JSON format.</p>"},{"location":"history-server/JsonProtocol/#objectmapper","title":"ObjectMapper <p><code>JsonProtocol</code> uses an Jackson Databind ObjectMapper for performing conversions to and from JSON.</p>","text":""},{"location":"history-server/JsonProtocol/#converting-spark-event-to-json","title":"Converting Spark Event to JSON <pre><code>sparkEventToJson(\n  event: SparkListenerEvent): JValue\n</code></pre> <p><code>sparkEventToJson</code> converts the given SparkListenerEvent to JSON format.</p> <p><code>sparkEventToJson</code>\u00a0is used when...FIXME</p>","text":""},{"location":"history-server/JsonProtocol/#converting-json-to-spark-event","title":"Converting JSON to Spark Event <pre><code>sparkEventFromJson(\n  json: JValue): SparkListenerEvent\n</code></pre> <p><code>sparkEventFromJson</code> converts a JSON-encoded event to a SparkListenerEvent.</p> <p><code>sparkEventFromJson</code>\u00a0is used when...FIXME</p>","text":""},{"location":"history-server/ReplayListenerBus/","title":"ReplayListenerBus","text":"<p><code>ReplayListenerBus</code> is a SparkListenerBus that can replay JSON-encoded <code>SparkListenerEvent</code> events.</p> <p><code>ReplayListenerBus</code> is used by FsHistoryProvider.</p>"},{"location":"history-server/ReplayListenerBus/#replaying-json-encoded-sparklistenerevents","title":"Replaying JSON-encoded SparkListenerEvents <pre><code>replay(\n  logData: InputStream,\n  sourceName: String,\n  maybeTruncated: Boolean = false): Unit\n</code></pre> <p><code>replay</code> reads JSON-encoded SparkListener.md#SparkListenerEvent[SparkListenerEvent] events from <code>logData</code> (one event per line) and posts them to all registered SparkListenerInterfaces.</p> <p><code>replay</code> uses spark-history-server:JsonProtocol.md#sparkEventFromJson[<code>JsonProtocol</code> to convert JSON-encoded events to <code>SparkListenerEvent</code> objects].</p> <p>NOTE: <code>replay</code> uses jackson from http://json4s.org/[json4s] library to parse the AST for JSON.</p> <p>When there is an exception parsing a JSON event, you may see the following WARN message in the logs (for the last line) or a <code>JsonParseException</code>.</p> <pre><code>WARN Got JsonParseException from log file $sourceName at line [lineNumber], the file might not have finished writing cleanly.\n</code></pre> <p>Any other non-IO exceptions end up with the following ERROR messages in the logs:</p> <pre><code>ERROR Exception parsing Spark event log: [sourceName]\nERROR Malformed line #[lineNumber]: [currentLine]\n</code></pre> <p>NOTE: The <code>sourceName</code> input argument is only used for messages.</p>","text":""},{"location":"history-server/SQLHistoryListener/","title":"SQLHistoryListener","text":"<p>== SQLHistoryListener</p> <p><code>SQLHistoryListener</code> is a custom spark-sql-SQLListener.md[SQLListener] for index.md[History Server]. It attaches spark-sql-webui.md#creating-instance[SQL tab] to History Server's web UI only when the first spark-sql-SQLListener.md#SparkListenerSQLExecutionStart[SparkListenerSQLExecutionStart] arrives and shuts &lt;&gt; off. It also handles &lt;&gt;. <p>NOTE: Support for SQL UI in History Server was added in SPARK-11206 Support SQL UI on the history server.</p> <p>CAUTION: FIXME Add the link to the JIRA.</p> <p>=== [[onOtherEvent]] onOtherEvent</p>"},{"location":"history-server/SQLHistoryListener/#source-scala","title":"[source, scala]","text":""},{"location":"history-server/SQLHistoryListener/#onothereventevent-sparklistenerevent-unit","title":"onOtherEvent(event: SparkListenerEvent): Unit","text":"<p>When <code>SparkListenerSQLExecutionStart</code> event comes, <code>onOtherEvent</code> attaches spark-sql-webui.md#creating-instance[SQL tab] to web UI and passes the call to the parent spark-sql-SQLListener.md[SQLListener].</p> <p>=== [[onTaskEnd]] onTaskEnd</p> <p>CAUTION: FIXME</p> <p>=== [[creating-instance]] Creating SQLHistoryListener Instance</p> <p><code>SQLHistoryListener</code> is created using a (<code>private[sql]</code>) <code>SQLHistoryListenerFactory</code> class (which is <code>SparkHistoryListenerFactory</code>).</p> <p>The <code>SQLHistoryListenerFactory</code> class is registered when spark-webui-SparkUI.md#createHistoryUI[<code>SparkUI</code> creates a web UI for History Server] as a Java service in <code>META-INF/services/org.apache.spark.scheduler.SparkHistoryListenerFactory</code>:</p> <pre><code>org.apache.spark.sql.execution.ui.SQLHistoryListenerFactory\n</code></pre> <p>NOTE: Loading the service uses Java's https://docs.oracle.com/javase/8/docs/api/java/util/ServiceLoader.html#load-java.lang.Class-java.lang.ClassLoader-[ServiceLoader.load] method.</p> <p>=== [[onExecutorMetricsUpdate]] onExecutorMetricsUpdate</p> <p><code>onExecutorMetricsUpdate</code> does nothing.</p>"},{"location":"history-server/configuration-properties/","title":"Configuration Properties","text":"<p>The following contains the configuration properties of EventLoggingListener and HistoryServer.</p>"},{"location":"history-server/configuration-properties/#sparkeventlog","title":"spark.eventLog","text":""},{"location":"history-server/configuration-properties/#bufferkb","title":"buffer.kb <p>spark.eventLog.buffer.kb</p> <p>Size of the buffer to use when writing to output streams</p> <p>Default: <code>100k</code></p>","text":""},{"location":"history-server/configuration-properties/#compress","title":"compress <p>spark.eventLog.compress</p> <p>Enables event compression (using a <code>CompressionCodec</code>)</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#compressioncodec","title":"compression.codec <p>spark.eventLog.compression.codec</p> <p>The codec used to compress event log (with spark.eventLog.compress enabled). By default, Spark provides four codecs: lz4, lzf, snappy, and zstd. You can also use fully qualified class names to specify the codec.</p> <p>Default: <code>zstd</code></p>","text":""},{"location":"history-server/configuration-properties/#dir","title":"dir <p>spark.eventLog.dir</p> <p>Directory where Spark events are logged to (e.g. <code>hdfs://namenode:8021/directory</code>)</p> <p>Default: <code>/tmp/spark-events</code></p> <p>The directory must exist before SparkContext can be created</p>","text":""},{"location":"history-server/configuration-properties/#enabled","title":"enabled <p>spark.eventLog.enabled</p> <p>Enables persisting Spark events</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#erasurecodingenabled","title":"erasureCoding.enabled <p>spark.eventLog.erasureCoding.enabled</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#gcmetricsyounggenerationgarbagecollectors","title":"gcMetrics.youngGenerationGarbageCollectors <p>spark.eventLog.gcMetrics.youngGenerationGarbageCollectors</p> <p>Names of supported young generation garbage collectors. A name usually is the output of <code>GarbageCollectorMXBean.getName</code>.</p> <p>Default: <code>Copy</code>, <code>PS Scavenge</code>, <code>ParNew</code>, <code>G1 Young Generation</code> (the built-in young generation garbage collectors)</p>","text":""},{"location":"history-server/configuration-properties/#gcmetricsoldgenerationgarbagecollectors","title":"gcMetrics.oldGenerationGarbageCollectors <p>spark.eventLog.gcMetrics.oldGenerationGarbageCollectors</p> <p>Names of supported old generation garbage collectors. A name usually is the output of <code>GarbageCollectorMXBean.getName</code>.</p> <p>Default: <code>MarkSweepCompact</code>, <code>PS MarkSweep</code>, <code>ConcurrentMarkSweep</code>, <code>G1 Old Generation</code> (the built-in old generation garbage collectors)</p>","text":""},{"location":"history-server/configuration-properties/#logblockupdatesenabled","title":"logBlockUpdates.enabled <p>spark.eventLog.logBlockUpdates.enabled</p> <p>Enables log RDD block updates using EventLoggingListener</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#logstageexecutormetrics","title":"logStageExecutorMetrics <p>spark.eventLog.logStageExecutorMetrics</p> <p>Enables logging of per-stage peaks of executor metrics (for each executor) to the event log</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#longformenabled","title":"longForm.enabled <p>spark.eventLog.longForm.enabled</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#overwrite","title":"overwrite <p>spark.eventLog.overwrite</p> <p>Enables deleting (or at least overwriting) an existing .inprogress event log files</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#rollingenabled","title":"rolling.enabled <p>spark.eventLog.rolling.enabled</p> <p>Enables rolling over event log files. When enabled, cuts down each event log file to spark.eventLog.rolling.maxFileSize</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#rollingmaxfilesize","title":"rolling.maxFileSize <p>spark.eventLog.rolling.maxFileSize</p> <p>Max size of event log file to be rolled over (with spark.eventLog.rolling.enabled enabled)</p> <p>Default: <code>128m</code></p> <p>Must be at least 10 MiB</p>","text":""},{"location":"history-server/configuration-properties/#sparkhistory","title":"spark.history","text":""},{"location":"history-server/configuration-properties/#fslogdirectory","title":"fs.logDirectory <p>spark.history.fs.logDirectory</p> <p>The directory for event log files. The directory has to exist before starting History Server.</p> <p>Default: <code>file:/tmp/spark-events</code></p>","text":""},{"location":"history-server/configuration-properties/#kerberosenabled","title":"kerberos.enabled <p>spark.history.kerberos.enabled</p> <p>Whether to enable (<code>true</code>) or disable (<code>false</code>) security when working with HDFS with security enabled (Kerberos).</p> <p>Default: <code>false</code></p>","text":""},{"location":"history-server/configuration-properties/#kerberoskeytab","title":"kerberos.keytab <p>spark.history.kerberos.keytab</p> <p>Keytab to use for login to Kerberos. Required when <code>spark.history.kerberos.enabled</code> is enabled.</p> <p>Default: (empty)</p>","text":""},{"location":"history-server/configuration-properties/#kerberosprincipal","title":"kerberos.principal <p>spark.history.kerberos.principal</p> <p>Kerberos principal. Required when <code>spark.history.kerberos.enabled</code> is enabled.</p> <p>Default: (empty)</p>","text":""},{"location":"history-server/configuration-properties/#provider","title":"provider <p>spark.history.provider</p> <p>Fully-qualified class name of an ApplicationHistoryProvider for HistoryServer.</p> <p>Default: org.apache.spark.deploy.history.FsHistoryProvider</p>","text":""},{"location":"history-server/configuration-properties/#retainedapplications","title":"retainedApplications <p>spark.history.retainedApplications</p> <p>How many Spark applications HistoryServer should retain</p> <p>Default: <code>50</code></p>","text":""},{"location":"history-server/configuration-properties/#storepath","title":"store.path <p>spark.history.store.path</p> <p>Local directory where to cache application history information (by )</p> <p>Default: (undefined) (i.e. all history information will be kept in memory)</p>","text":""},{"location":"history-server/configuration-properties/#uimaxapplications","title":"ui.maxApplications <p>spark.history.ui.maxApplications</p> <p>How many Spark applications HistoryServer should show in the UI</p> <p>Default: (unbounded)</p>","text":""},{"location":"history-server/configuration-properties/#uiport","title":"ui.port <p>spark.history.ui.port</p> <p>The port of History Server's web UI.</p> <p>Default: <code>18080</code></p>","text":""},{"location":"local/","title":"Spark local","text":"<p>Spark local is one of the available runtime environments in Apache Spark. It is the only available runtime with no need for a proper cluster manager (and hence many call it a pseudo-cluster, however such concept do exist in Spark and is a bit different).</p> <p>Spark local is used for the following master URLs (as specified using &lt;&lt;../SparkConf.md#, SparkConf.setMaster&gt;&gt; method or &lt;&lt;../configuration-properties.md#spark.master, spark.master&gt;&gt; configuration property):</p> <ul> <li> <p>local (with exactly 1 CPU core)</p> </li> <li> <p>local[n] (with exactly <code>n</code> CPU cores)</p> </li> <li> <p>local[*] (with the total number of CPU cores that is the number of available CPU cores on the local machine)</p> </li> <li> <p>local[n, m] (with exactly <code>n</code> CPU cores and <code>m</code> retries when a task fails)</p> </li> <li> <p>local[*, m] (with the total number of CPU cores that is the number of available CPU cores on the local machine)</p> </li> </ul> <p>Internally, Spark local uses &lt;&gt; as the &lt;&lt;../SchedulerBackend.md#, SchedulerBackend&gt;&gt; and executor:ExecutorBackend.md[]. <p></p> <p>In this non-distributed multi-threaded runtime environment, Spark spawns all the main execution components - the spark-driver.md[driver] and an executor:Executor.md[] - in the same single JVM.</p> <p>The default parallelism is the number of threads as specified in the &lt;&gt;. This is the only mode where a driver is used for execution (as it acts both as the driver and the only executor). <p>The local mode is very convenient for testing, debugging or demonstration purposes as it requires no earlier setup to launch Spark applications.</p> <p>This mode of operation is also called  http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark[Spark in-process] or (less commonly) a local version of Spark.</p> <p><code>SparkContext.isLocal</code> returns <code>true</code> when Spark runs in local mode.</p> <pre><code>scala&gt; sc.isLocal\nres0: Boolean = true\n</code></pre> <p>Spark shell defaults to local mode with <code>local[*]</code> as the the master URL.</p> <pre><code>scala&gt; sc.master\nres0: String = local[*]\n</code></pre> <p>Tasks are not re-executed on failure in local mode (unless &lt;&gt; is used). <p>The scheduler:TaskScheduler.md[task scheduler] in local mode works with local/spark-LocalSchedulerBackend.md[LocalSchedulerBackend] task scheduler backend.</p>"},{"location":"local/#master-url","title":"Master URL","text":"<p>You can run Spark in local mode using <code>local</code>, <code>local[n]</code> or the most general <code>local[*]</code> for the master URL.</p> <p>The URL says how many threads can be used in total:</p> <ul> <li> <p><code>local</code> uses 1 thread only.</p> </li> <li> <p><code>local[n]</code> uses <code>n</code> threads.</p> </li> <li> <p><code>local[*]</code> uses as many threads as the number of processors available to the Java virtual machine (it uses https://docs.oracle.com/javase/8/docs/api/java/lang/Runtime.html#availableProcessors--[Runtime.getRuntime.availableProcessors()] to know the number).</p> </li> </ul> <p>NOTE: What happens when there are less cores than <code>n</code> in <code>local[n]</code> master URL? \"Breaks\" scheduling as Spark assumes more CPU cores available to execute tasks.</p> <ul> <li>[[local-with-retries]] <code>local[N, maxFailures]</code> (called local-with-retries) with <code>N</code> being <code>*</code> or the number of threads to use (as explained above) and <code>maxFailures</code> being the value of &lt;&lt;../configuration-properties.md#spark.task.maxFailures, spark.task.maxFailures&gt;&gt; configuration property.</li> </ul> <p>== [[task-submission]] Task Submission a.k.a. reviveOffers</p> <p>.TaskSchedulerImpl.submitTasks in local mode image::taskscheduler-submitTasks-local-mode.png[align=\"center\"]</p> <p>When <code>ReviveOffers</code> or <code>StatusUpdate</code> messages are received, local/spark-LocalEndpoint.md[LocalEndpoint] places an offer to <code>TaskSchedulerImpl</code> (using <code>TaskSchedulerImpl.resourceOffers</code>).</p> <p>If there is one or more tasks that match the offer, they are launched (using <code>executor.launchTask</code> method).</p> <p>The number of tasks to be launched is controlled by the number of threads as specified in &lt;&gt;. The executor uses threads to spawn the tasks."},{"location":"local/LauncherBackend/","title":"LauncherBackend","text":"<p>== [[LauncherBackend]] LauncherBackend</p> <p><code>LauncherBackend</code> is the &lt;&gt; of &lt;&gt; that can &lt;&gt;. <p>[[contract]] .LauncherBackend Contract (Abstract Methods Only) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| conf a| [[conf]]</p>"},{"location":"local/LauncherBackend/#source-scala","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#conf-sparkconf","title":"conf: SparkConf","text":"<p>SparkConf.md[]</p> <p>Used exclusively when <code>LauncherBackend</code> is requested to &lt;&gt; (to access configuration-properties.md#spark.launcher.port[spark.launcher.port] and configuration-properties.md#spark.launcher.secret[spark.launcher.secret] configuration properties) <p>| onStopRequest a| [[onStopRequest]]</p>"},{"location":"local/LauncherBackend/#source-scala_1","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#onstoprequest-unit","title":"onStopRequest(): Unit","text":"<p>Handles stop requests (to stop the Spark application as gracefully as possible)</p> <p>Used exclusively when <code>LauncherBackend</code> is requested to &lt;&gt; <p>|===</p> <p>[[creating-instance]] <code>LauncherBackend</code> takes no arguments to be created.</p> <p>NOTE: <code>LauncherBackend</code> is a Scala abstract class and cannot be &lt;&gt; directly. It is created indirectly for the &lt;&gt;. <p>[[internal-registries]] .LauncherBackend's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| _isConnected a| [[_isConnected]][[isConnected]] Flag that says whether...FIXME (<code>true</code>) or not (<code>false</code>)</p> <p>Default: <code>false</code></p> <p>Used when...FIXME</p> <p>| clientThread a| [[clientThread]] Java's https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html[java.lang.Thread]</p> <p>Used when...FIXME</p> <p>| connection a| [[connection]] <code>BackendConnection</code></p> <p>Used when...FIXME</p> <p>| lastState a| [[lastState]] <code>SparkAppHandle.State</code></p> <p>Used when...FIXME</p> <p>|===</p> <p>[[implementations]] <code>LauncherBackend</code> is &lt;&gt; (as an anonymous class) for the following: <ul> <li> <p>Spark on YARN's &lt;&gt; <li> <p>Spark local's &lt;&gt; <li> <p>Spark on Mesos' &lt;&gt; <li> <p>Spark Standalone's &lt;&gt; <p>=== [[close]] Closing -- <code>close</code> Method</p>"},{"location":"local/LauncherBackend/#source-scala_2","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#close-unit","title":"close(): Unit","text":"<p><code>close</code>...FIXME</p> <p>NOTE: <code>close</code> is used when...FIXME</p> <p>=== [[connect]] Connecting -- <code>connect</code> Method</p>"},{"location":"local/LauncherBackend/#source-scala_3","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#connect-unit","title":"connect(): Unit","text":"<p><code>connect</code>...FIXME</p>"},{"location":"local/LauncherBackend/#note","title":"[NOTE]","text":"<p><code>connect</code> is used when:</p> <ul> <li> <p>Spark Standalone's <code>StandaloneSchedulerBackend</code> is requested to &lt;&gt; (in <code>client</code> deploy mode) <li> <p>Spark local's <code>LocalSchedulerBackend</code> is &lt;&gt; <li> <p>Spark on Mesos' <code>MesosCoarseGrainedSchedulerBackend</code> is requested to &lt;&gt; (in <code>client</code> deploy mode)"},{"location":"local/LauncherBackend/#spark-on-yarns-client-is-requested-to","title":"* Spark on YARN's <code>Client</code> is requested to &lt;&gt; <p>=== [[fireStopRequest]] <code>fireStopRequest</code> Internal Method</p>","text":""},{"location":"local/LauncherBackend/#source-scala_4","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#firestoprequest-unit","title":"fireStopRequest(): Unit","text":"<p><code>fireStopRequest</code>...FIXME</p> <p>NOTE: <code>fireStopRequest</code> is used exclusively when <code>BackendConnection</code> is requested to handle a <code>Stop</code> message.</p> <p>=== [[onDisconnected]] Handling Disconnects From Scheduling Backend -- <code>onDisconnected</code> Method</p>"},{"location":"local/LauncherBackend/#source-scala_5","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#ondisconnected-unit","title":"onDisconnected(): Unit","text":"<p><code>onDisconnected</code> does nothing by default and is expected to be overriden by &lt;&gt;. <p>NOTE: <code>onDisconnected</code> is used when...FIXME</p> <p>=== [[setAppId]] <code>setAppId</code> Method</p>"},{"location":"local/LauncherBackend/#source-scala_6","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#setappidappid-string-unit","title":"setAppId(appId: String): Unit","text":"<p><code>setAppId</code>...FIXME</p> <p>NOTE: <code>setAppId</code> is used when...FIXME</p> <p>=== [[setState]] <code>setState</code> Method</p>"},{"location":"local/LauncherBackend/#source-scala_7","title":"[source, scala]","text":""},{"location":"local/LauncherBackend/#setstatestate-sparkapphandlestate-unit","title":"setState(state: SparkAppHandle.State): Unit","text":"<p><code>setState</code>...FIXME</p> <p>NOTE: <code>setState</code> is used when...FIXME</p>"},{"location":"local/LocalEndpoint/","title":"LocalEndpoint","text":"<p><code>LocalEndpoint</code> is the <code>ThreadSafeRpcEndpoint</code> for LocalSchedulerBackend and is registered under the LocalSchedulerBackendEndpoint name.</p>"},{"location":"local/LocalEndpoint/#review-me","title":"Review Me","text":"<p><code>LocalEndpoint</code> is &lt;&gt; exclusively when <code>LocalSchedulerBackend</code> is requested to &lt;&gt;. <p>Put simply, <code>LocalEndpoint</code> is the communication channel between &lt;&gt; and &lt;&gt;. <code>LocalEndpoint</code> is a (thread-safe) rpc:RpcEndpoint.md[RpcEndpoint] that hosts an &lt;&gt; (with <code>driver</code> ID and <code>localhost</code> hostname) for Spark local mode. <p>[[messages]] .LocalEndpoint's RPC Messages [cols=\"1,3\",options=\"header\",width=\"100%\"] |=== | Message | Description</p> <p>| &lt;&gt; | Requests the &lt;&gt; to executor:Executor.md#killTask[kill a given task] <p>| &lt;&gt; | Calls &lt;&gt; &lt;&gt; <p>| &lt;&gt; | Requests the &lt;&gt; to executor:Executor.md#stop[stop] <p>|===</p> <p>When a <code>LocalEndpoint</code> starts up (as part of Spark local's initialization) it prints out the following INFO messages to the logs:</p> <pre><code>INFO Executor: Starting executor ID driver on host localhost\nINFO Executor: Using REPL class URI: http://192.168.1.4:56131\n</code></pre> <p>[[executor]] <code>LocalEndpoint</code> creates a single executor:Executor.md[] with the following properties:</p> <ul> <li> <p>[[localExecutorId]] driver ID for the executor:Executor.md#executorId[executor ID]</p> </li> <li> <p>[[localExecutorHostname]] localhost for the executor:Executor.md#executorHostname[hostname]</p> </li> <li> <p>&lt;&gt; for the executor:Executor.md#userClassPath[user-defined CLASSPATH] <li> <p>executor:Executor.md#isLocal[isLocal] flag enabled</p> </li> <p>The &lt;&gt; is then used when <code>LocalEndpoint</code> is requested to handle &lt;&gt; and &lt;&gt; RPC messages, and &lt;&gt;. <p>[[internal-registries]] .LocalEndpoint's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| freeCores a| [[freeCores]] The number of CPU cores that are free to use (to schedule tasks)</p> <p>Default: Initial &lt;&gt; (aka totalCores) <p>Increments when <code>LocalEndpoint</code> is requested to handle &lt;&gt; RPC message with a finished state <p>Decrements when <code>LocalEndpoint</code> is requested to &lt;&gt; and there were tasks to execute <p>NOTE: A single task to execute costs scheduler:TaskSchedulerImpl.md#CPUS_PER_TASK[spark.task.cpus] configuration (default: <code>1</code>).</p> <p>Used when <code>LocalEndpoint</code> is requested to &lt;&gt; <p>|===</p> <p>[[logging]] [TIP] ==== Enable <code>INFO</code> logging level for <code>org.apache.spark.scheduler.local.LocalEndpoint</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.local.LocalEndpoint=INFO\n</code></pre>"},{"location":"local/LocalEndpoint/#refer-to-spark-loggingmd-logging","title":"Refer to &lt;&lt;../spark-logging.md#, Logging&gt;&gt;.","text":"<p>=== [[creating-instance]] Creating LocalEndpoint Instance</p> <p><code>LocalEndpoint</code> takes the following to be created:</p> <ul> <li>[[rpcEnv]] &lt;&lt;../index.md#, RpcEnv&gt;&gt;</li> <li>[[userClassPath]] User-defined class path (<code>Seq[URL]</code>) that is the &lt;&gt; configuration property and is used exclusively to create the &lt;&gt; <li>[[scheduler]] scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl]</li> <li>[[executorBackend]] &lt;&gt; <li>[[totalCores]] Number of CPU cores (aka totalCores)</li> <p><code>LocalEndpoint</code> initializes the &lt;&gt;. <p>=== [[receive]] Processing Receive-Only RPC Messages -- <code>receive</code> Method</p>"},{"location":"local/LocalEndpoint/#source-scala","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#receive-partialfunctionany-unit","title":"receive: PartialFunction[Any, Unit]","text":"<p>NOTE: <code>receive</code> is part of the rpc:RpcEndpoint.md#receive[RpcEndpoint] abstraction.</p> <p><code>receive</code> handles (processes) &lt;&gt;, &lt;&gt;, and &lt;&gt; RPC messages. <p>==== [[ReviveOffers]] <code>ReviveOffers</code> RPC Message</p>"},{"location":"local/LocalEndpoint/#source-scala_1","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#reviveoffers","title":"ReviveOffers()","text":"<p>When &lt;&gt;, <code>LocalEndpoint</code> &lt;&gt;. <p>NOTE: <code>ReviveOffers</code> RPC message is sent out exclusively when <code>LocalSchedulerBackend</code> is requested to &lt;&gt;. <p>==== [[StatusUpdate]] <code>StatusUpdate</code> RPC Message</p>"},{"location":"local/LocalEndpoint/#source-scala_2","title":"[source, scala]","text":"<p>StatusUpdate(   taskId: Long,   state: TaskState,   serializedData: ByteBuffer)</p> <p>When &lt;&gt;, <code>LocalEndpoint</code> requests the &lt;&gt; to scheduler:TaskSchedulerImpl.md#statusUpdate[handle a task status update] (given the <code>taskId</code>, the task state and the data). <p>If the given scheduler:Task.md#TaskState[TaskState] is a finished state (one of <code>FINISHED</code>, <code>FAILED</code>, <code>KILLED</code>, <code>LOST</code> states), <code>LocalEndpoint</code> adds scheduler:TaskSchedulerImpl.md#CPUS_PER_TASK[spark.task.cpus] configuration (default: <code>1</code>) to the &lt;&gt; registry followed by &lt;&gt;. <p>NOTE: <code>StatusUpdate</code> RPC message is sent out exclusively when <code>LocalSchedulerBackend</code> is requested to &lt;&gt;. <p>==== [[KillTask]] <code>KillTask</code> RPC Message</p>"},{"location":"local/LocalEndpoint/#source-scala_3","title":"[source, scala]","text":"<p>KillTask(   taskId: Long,   interruptThread: Boolean,   reason: String)</p> <p>When &lt;&gt;, <code>LocalEndpoint</code> requests the single &lt;&gt; to executor:Executor.md#killTask[kill a task] (given the <code>taskId</code>, the <code>interruptThread</code> flag and the reason). <p>NOTE: <code>KillTask</code> RPC message is sent out exclusively when <code>LocalSchedulerBackend</code> is requested to &lt;&gt;. <p>=== [[reviveOffers]] Reviving Offers -- <code>reviveOffers</code> Method</p>"},{"location":"local/LocalEndpoint/#source-scala_4","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#reviveoffers-unit","title":"reviveOffers(): Unit","text":"<p><code>reviveOffers</code>...FIXME</p> <p>NOTE: <code>reviveOffers</code> is used when <code>LocalEndpoint</code> is requested to &lt;&gt; (namely &lt;&gt; and &lt;&gt;). <p>=== [[receiveAndReply]] Processing Receive-Reply RPC Messages -- <code>receiveAndReply</code> Method</p>"},{"location":"local/LocalEndpoint/#source-scala_5","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#receiveandreplycontext-rpccallcontext-partialfunctionany-unit","title":"receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit]","text":"<p>NOTE: <code>receiveAndReply</code> is part of the rpc:RpcEndpoint.md#receiveAndReply[RpcEndpoint] abstraction.</p> <p><code>receiveAndReply</code> handles (processes) &lt;&gt; RPC message exclusively. <p>==== [[StopExecutor]] <code>StopExecutor</code> RPC Message</p>"},{"location":"local/LocalEndpoint/#source-scala_6","title":"[source, scala]","text":""},{"location":"local/LocalEndpoint/#stopexecutor","title":"StopExecutor()","text":"<p>When &lt;&gt;, <code>LocalEndpoint</code> requests the single &lt;&gt; to executor:Executor.md#stop[stop] and requests the given <code>RpcCallContext</code> to <code>reply</code> with <code>true</code> (as the response). <p>NOTE: <code>StopExecutor</code> RPC message is sent out exclusively when <code>LocalSchedulerBackend</code> is requested to &lt;&gt;."},{"location":"local/LocalSchedulerBackend/","title":"LocalSchedulerBackend","text":"<p><code>LocalSchedulerBackend</code> is a SchedulerBackend and an ExecutorBackend for Spark local deployment.</p> Master URL Total CPU Cores <code>local</code> 1 <code>local[n]</code> <code>n</code> <code>local[*]</code> The number of available CPU cores on the local machine <code>local[n, m]</code> <code>n</code> CPU cores and <code>m</code> task retries <code>local[*, m]</code> The number of available CPU cores on the local machine and <code>m</code> task retries <p></p>"},{"location":"local/LocalSchedulerBackend/#creating-instance","title":"Creating Instance","text":"<p><code>LocalSchedulerBackend</code> takes the following to be created:</p> <ul> <li> SparkConf <li> TaskSchedulerImpl <li> Total number of CPU cores <p><code>LocalSchedulerBackend</code> is created when:</p> <ul> <li><code>SparkContext</code> is requested to create a Spark Scheduler (for <code>local</code> master URL)</li> <li><code>KubernetesClusterManager</code> (Spark on Kubernetes) is requested for a <code>SchedulerBackend</code></li> </ul>"},{"location":"local/LocalSchedulerBackend/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks","text":"SchedulerBackend <pre><code>maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n</code></pre> <p><code>maxNumConcurrentTasks</code> is part of the SchedulerBackend abstraction.</p> <p><code>maxNumConcurrentTasks</code> calculates the number of CPU cores per task for the given ResourceProfile (and this SparkConf).</p> <p>In the end, <code>maxNumConcurrentTasks</code> is the total CPU cores available divided by the number of CPU cores per task.</p>"},{"location":"local/LocalSchedulerBackend/#logging","title":"Logging","text":"<p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.local.LocalSchedulerBackend</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.LocalSchedulerBackend.name = org.apache.spark.scheduler.local.LocalSchedulerBackend\nlogger.LocalSchedulerBackend.level = all\n</code></pre> <p>Refer to Logging.</p>"},{"location":"memory/","title":"Memory System","text":"<p>Memory System is a core component of Apache Spark that is based on UnifiedMemoryManager.</p>"},{"location":"memory/#resources","title":"Resources","text":"<ul> <li>SPARK-10000: Consolidate storage and execution memory management</li> </ul>"},{"location":"memory/#videos","title":"Videos","text":"<ul> <li>Deep Dive: Apache Spark Memory Management</li> <li>Deep Dive into Project Tungsten</li> <li>Spark Performance: What's Next</li> </ul>"},{"location":"memory/ExecutionMemoryPool/","title":"ExecutionMemoryPool","text":"<p><code>ExecutionMemoryPool</code> is a MemoryPool.</p>"},{"location":"memory/ExecutionMemoryPool/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutionMemoryPool</code> takes the following to be created:</p> <ul> <li> Lock Object <li> <code>MemoryMode</code> (<code>ON_HEAP</code> or <code>OFF_HEAP</code>) <p><code>ExecutionMemoryPool</code> is created\u00a0when:</p> <ul> <li><code>MemoryManager</code> is created (and initializes on-heap and off-heap execution memory pools)</li> </ul>"},{"location":"memory/ExecutionMemoryPool/#acquiring-memory","title":"Acquiring Memory <pre><code>acquireMemory(\n  numBytes: Long,\n  taskAttemptId: Long,\n  maybeGrowPool: Long =&gt; Unit = (additionalSpaceNeeded: Long) =&gt; (),\n  computeMaxPoolSize: () =&gt; Long = () =&gt; poolSize): Long\n</code></pre> <p><code>acquireMemory</code>...FIXME</p> <p><code>acquireMemory</code>\u00a0is used when:</p> <ul> <li><code>UnifiedMemoryManager</code> is requested to acquire execution memory</li> </ul>","text":""},{"location":"memory/MemoryAllocator/","title":"MemoryAllocator","text":"<p><code>MemoryAllocator</code> is an abstraction of memory allocators that TaskMemoryManager uses to allocate and release memory.</p> <p><code>MemoryAllocator</code> creates the available MemoryAllocators to be available under the names HEAP and UNSAFE.</p> <p>A MemoryAllocator to use is selected when <code>MemoryManager</code> is created (based on MemoryMode).</p>"},{"location":"memory/MemoryAllocator/#contract","title":"Contract","text":""},{"location":"memory/MemoryAllocator/#allocating-contiguous-block-of-memory","title":"Allocating Contiguous Block of Memory <pre><code>MemoryBlock allocate(\n  long size)\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested to allocate a memory page</li> </ul>","text":""},{"location":"memory/MemoryAllocator/#releasing-memory","title":"Releasing Memory <pre><code>void free(\n  MemoryBlock memory)\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested to release a memory page and clean up all the allocated memory</li> </ul>","text":""},{"location":"memory/MemoryAllocator/#implementations","title":"Implementations","text":"<ul> <li> HeapMemoryAllocator <li> UnsafeMemoryAllocator"},{"location":"memory/MemoryConsumer/","title":"MemoryConsumer","text":"<p><code>MemoryConsumer</code> is an abstraction of memory consumers (of TaskMemoryManager) that support spilling.</p> <p><code>MemoryConsumer</code>s correspond to individual operators and data structures within a task. <code>TaskMemoryManager</code> receives memory allocation requests from <code>MemoryConsumer</code>s and issues callbacks to consumers in order to trigger spilling when running low on memory.</p> <p>A <code>MemoryConsumer</code> basically tracks how much memory is allocated.</p>"},{"location":"memory/MemoryConsumer/#contract","title":"Contract","text":""},{"location":"memory/MemoryConsumer/#spilling","title":"Spilling <pre><code>void spill() // (1)\nlong spill(\n  long size,\n  MemoryConsumer trigger)\n</code></pre> <ol> <li>Uses <code>MAX_VALUE</code> for the size and this <code>MemoryConsumer</code></li> </ol> <p>Used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested to acquire execution memory (and trySpillAndAcquire)</li> <li><code>ShuffleExternalSorter</code> is requested to growPointerArrayIfNecessary, insertRecord</li> <li><code>UnsafeExternalSorter</code> is requested to createWithExistingInMemorySorter,  growPointerArrayIfNecessary, insertRecord, merge</li> </ul>","text":""},{"location":"memory/MemoryConsumer/#implementations","title":"Implementations","text":"<ul> <li>BytesToBytesMap</li> <li>ShuffleExternalSorter</li> <li>Spillable</li> <li>UnsafeExternalSorter</li> <li>a few others</li> </ul>"},{"location":"memory/MemoryConsumer/#creating-instance","title":"Creating Instance","text":"<p><code>MemoryConsumer</code> takes the following to be created:</p> <ul> <li> TaskMemoryManager <li> Page Size <li> <code>MemoryMode</code> (<code>ON_HEAP</code> or <code>OFF_HEAP</code>) Abstract Class <p><code>MemoryConsumer</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MemoryConsumers.</p>"},{"location":"memory/MemoryManager/","title":"MemoryManager","text":"<p><code>MemoryManager</code> is an abstraction of memory managers that can share available memory between tasks (TaskMemoryManager) and storage (BlockManager).</p> <p></p> <p><code>MemoryManager</code> splits assigned memory into two regions:</p> <ul> <li> <p>Execution Memory for shuffles, joins, sorts and aggregations</p> </li> <li> <p>Storage Memory for caching and propagating internal data across Spark nodes (in on- and off-heap modes)</p> </li> </ul> <p><code>MemoryManager</code> is used to create BlockManager (and MemoryStore) and TaskMemoryManager.</p>"},{"location":"memory/MemoryManager/#contract","title":"Contract","text":""},{"location":"memory/MemoryManager/#acquiring-execution-memory-for-task","title":"Acquiring Execution Memory for Task <pre><code>acquireExecutionMemory(\n  numBytes: Long,\n  taskAttemptId: Long,\n  memoryMode: MemoryMode): Long\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested to acquire execution memory</li> </ul>","text":""},{"location":"memory/MemoryManager/#acquiring-storage-memory-for-block","title":"Acquiring Storage Memory for Block <pre><code>acquireStorageMemory(\n  blockId: BlockId,\n  numBytes: Long,\n  memoryMode: MemoryMode): Boolean\n</code></pre> <p>Used when:</p> <ul> <li><code>MemoryStore</code> is requested for the putBytes and putIterator</li> </ul>","text":""},{"location":"memory/MemoryManager/#acquiring-unroll-memory-for-block","title":"Acquiring Unroll Memory for Block <pre><code>acquireUnrollMemory(\n  blockId: BlockId,\n  numBytes: Long,\n  memoryMode: MemoryMode): Boolean\n</code></pre> <p>Used when:</p> <ul> <li><code>MemoryStore</code> is requested for the reserveUnrollMemoryForThisTask</li> </ul>","text":""},{"location":"memory/MemoryManager/#total-available-off-heap-storage-memory","title":"Total Available Off-Heap Storage Memory <pre><code>maxOffHeapStorageMemory: Long\n</code></pre> <p>May vary over time</p> <p>Used when:</p> <ul> <li><code>BlockManager</code> is created</li> <li><code>MemoryStore</code> is requested for the maxMemory</li> </ul>","text":""},{"location":"memory/MemoryManager/#total-available-on-heap-storage-memory","title":"Total Available On-Heap Storage Memory <pre><code>maxOnHeapStorageMemory: Long\n</code></pre> <p>May vary over time</p> <p>Used when:</p> <ul> <li><code>BlockManager</code> is created</li> <li><code>MemoryStore</code> is requested for the maxMemory</li> </ul>","text":""},{"location":"memory/MemoryManager/#implementations","title":"Implementations","text":"<ul> <li>UnifiedMemoryManager</li> </ul>"},{"location":"memory/MemoryManager/#creating-instance","title":"Creating Instance","text":"<p><code>MemoryManager</code> takes the following to be created:</p> <ul> <li> SparkConf <li> Number of CPU Cores <li> Size of the On-Heap Storage Memory <li> Size of the On-Heap Execution Memory Abstract Class <p><code>MemoryManager</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MemoryManagers.</p>"},{"location":"memory/MemoryManager/#SparkEnv","title":"Accessing MemoryManager","text":"<p><code>MemoryManager</code> is available as SparkEnv.memoryManager on the driver and executors.</p> <pre><code>import org.apache.spark.SparkEnv\nval mm = SparkEnv.get.memoryManager\n</code></pre> <pre><code>// MemoryManager is private[spark]\n// the following won't work unless within org.apache.spark package\n// import org.apache.spark.memory.MemoryManager\n// assert(mm.isInstanceOf[MemoryManager])\n\n// we have to revert to string comparision \ud83d\ude14\nassert(\"UnifiedMemoryManager\".equals(mm.getClass.getSimpleName))\n</code></pre>"},{"location":"memory/MemoryManager/#associating-memorystore-with-storage-memory-pools","title":"Associating MemoryStore with Storage Memory Pools <pre><code>setMemoryStore(\n  store: MemoryStore): Unit\n</code></pre> <p><code>setMemoryStore</code> requests the on-heap and off-heap storage memory pools to use the given MemoryStore.</p> <p><code>setMemoryStore</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is created</li> </ul>","text":""},{"location":"memory/MemoryManager/#execution-memory-pools","title":"Execution Memory Pools","text":""},{"location":"memory/MemoryManager/#on-heap","title":"On-Heap <pre><code>onHeapExecutionMemoryPool: ExecutionMemoryPool\n</code></pre> <p><code>MemoryManager</code> creates an ExecutionMemoryPool for <code>ON_HEAP</code> memory mode when created and immediately requests it to incrementPoolSize to onHeapExecutionMemory.</p>","text":""},{"location":"memory/MemoryManager/#off-heap","title":"Off-Heap <pre><code>offHeapExecutionMemoryPool: ExecutionMemoryPool\n</code></pre> <p><code>MemoryManager</code> creates an ExecutionMemoryPool for <code>OFF_HEAP</code> memory mode when created and immediately requests it to incrementPoolSize to...FIXME</p>","text":""},{"location":"memory/MemoryManager/#storage-memory-pools","title":"Storage Memory Pools","text":""},{"location":"memory/MemoryManager/#on-heap_1","title":"On-Heap <pre><code>onHeapStorageMemoryPool: StorageMemoryPool\n</code></pre> <p><code>MemoryManager</code> creates a StorageMemoryPool for <code>ON_HEAP</code> memory mode when created and immediately requests it to incrementPoolSize to onHeapExecutionMemory.</p> <p><code>onHeapStorageMemoryPool</code> is requested to setMemoryStore when <code>MemoryManager</code> is requested to setMemoryStore.</p> <p><code>onHeapStorageMemoryPool</code> is requested to release memory when <code>MemoryManager</code> is requested to release on-heap storage memory.</p> <p><code>onHeapStorageMemoryPool</code> is requested to release all memory when <code>MemoryManager</code> is requested to release all storage memory.</p> <p><code>onHeapStorageMemoryPool</code> is used when:</p> <ul> <li><code>MemoryManager</code> is requested for the storageMemoryUsed and onHeapStorageMemoryUsed</li> <li><code>UnifiedMemoryManager</code> is requested to acquire on-heap execution and storage memory</li> </ul>","text":""},{"location":"memory/MemoryManager/#off-heap_1","title":"Off-Heap <pre><code>offHeapStorageMemoryPool: StorageMemoryPool\n</code></pre> <p><code>MemoryManager</code> creates a StorageMemoryPool for <code>OFF_HEAP</code> memory mode when created and immediately requested it to incrementPoolSize to offHeapStorageMemory.</p> <p><code>MemoryManager</code> requests the <code>MemoryPool</code>s to use a given MemoryStore when requested to setMemoryStore.</p> <p><code>MemoryManager</code> requests the <code>MemoryPool</code>s to release memory when requested to releaseStorageMemory.</p> <p><code>MemoryManager</code> requests the <code>MemoryPool</code>s to release all memory when requested to release all storage memory.</p> <p><code>MemoryManager</code> requests the <code>MemoryPool</code>s for the memoryUsed when requested for storageMemoryUsed.</p> <p><code>offHeapStorageMemoryPool</code> is used when:</p> <ul> <li><code>MemoryManager</code> is requested for the offHeapStorageMemoryUsed</li> <li><code>UnifiedMemoryManager</code> is requested to acquire off-heap execution and storage memory</li> </ul>","text":""},{"location":"memory/MemoryManager/#total-storage-memory-used","title":"Total Storage Memory Used <pre><code>storageMemoryUsed: Long\n</code></pre> <p><code>storageMemoryUsed</code> is the sum of the memory used of the on-heap and off-heap storage memory pools.</p> <p><code>storageMemoryUsed</code>\u00a0is used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested to showMemoryUsage</li> <li><code>MemoryStore</code> is requested to memoryUsed</li> </ul>","text":""},{"location":"memory/MemoryManager/#memorymode","title":"MemoryMode <pre><code>tungstenMemoryMode: MemoryMode\n</code></pre> <p><code>tungstenMemoryMode</code> tracks whether Tungsten memory will be allocated on the JVM heap or off-heap (using <code>sun.misc.Unsafe</code>).</p>  <p>final val</p> <p><code>tungstenMemoryMode</code> is a <code>final val</code>ue so initialized once when <code>MemoryManager</code> is created.</p>  <p><code>tungstenMemoryMode</code> is <code>OFF_HEAP</code> when the following are all met:</p> <ul> <li> <p>spark.memory.offHeap.enabled configuration property is enabled</p> </li> <li> <p>spark.memory.offHeap.size configuration property is greater than <code>0</code></p> </li> <li> <p>JVM supports unaligned memory access (aka unaligned Unsafe, i.e. <code>sun.misc.Unsafe</code> package is available and the underlying system has unaligned-access capability)</p> </li> </ul> <p>Otherwise, <code>tungstenMemoryMode</code> is <code>ON_HEAP</code>.</p>  <p>Note</p> <p>Given that spark.memory.offHeap.enabled configuration property is turned off by default and spark.memory.offHeap.size configuration property is <code>0</code> by default, Apache Spark seems to encourage using Tungsten memory allocated on the JVM heap (<code>ON_HEAP</code>).</p>  <p><code>tungstenMemoryMode</code> is used when:</p> <ul> <li><code>MemoryManager</code> is created (and initializes the pageSizeBytes and tungstenMemoryAllocator internal properties)</li> <li><code>TaskMemoryManager</code> is created</li> </ul>","text":""},{"location":"memory/MemoryManager/#memoryallocator","title":"MemoryAllocator <pre><code>tungstenMemoryAllocator: MemoryAllocator\n</code></pre> <p><code>MemoryManager</code> selects the MemoryAllocator to use based on the MemoryMode.</p>  <p>final val</p> <p><code>tungstenMemoryAllocator</code> is a <code>final val</code>ue so initialized once when <code>MemoryManager</code> is created.</p>     MemoryMode MemoryAllocator     <code>ON_HEAP</code> HeapMemoryAllocator   <code>OFF_HEAP</code> UnsafeMemoryAllocator    <p><code>tungstenMemoryAllocator</code> is used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested to allocate a memory page, release a memory page and clean up all the allocated memory</li> </ul>","text":""},{"location":"memory/MemoryManager/#pageSizeBytes","title":"Page Size <p><code>pageSizeBytes</code> is either spark.buffer.pageSize, if defined, or the default page size.</p> <p><code>pageSizeBytes</code> is used when:</p> <ul> <li><code>TaskMemoryManager</code> is requested for the page size</li> </ul>","text":""},{"location":"memory/MemoryManager/#defaultPageSizeBytes","title":"Default Page Size <pre><code>defaultPageSizeBytes: Long\n</code></pre>  Lazy Value <p><code>defaultPageSizeBytes</code> is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.</p> <p>Learn more in the Scala Language Specification.</p>","text":""},{"location":"memory/MemoryPool/","title":"MemoryPool","text":"<p><code>MemoryPool</code> is an abstraction of memory pools.</p>"},{"location":"memory/MemoryPool/#contract","title":"Contract","text":""},{"location":"memory/MemoryPool/#size-of-memory-used","title":"Size of Memory Used <pre><code>memoryUsed: Long\n</code></pre> <p>Used when:</p> <ul> <li><code>MemoryPool</code> is requested for the amount of free memory and decrementPoolSize</li> </ul>","text":""},{"location":"memory/MemoryPool/#implementations","title":"Implementations","text":"<ul> <li>ExecutionMemoryPool</li> <li>StorageMemoryPool</li> </ul>"},{"location":"memory/MemoryPool/#creating-instance","title":"Creating Instance","text":"<p><code>MemoryPool</code> takes the following to be created:</p> <ul> <li> Lock Object Abstract Class <p><code>MemoryPool</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MemoryPools.</p>"},{"location":"memory/MemoryPool/#free-memory","title":"Free Memory <pre><code>memoryFree\n</code></pre> <p><code>memoryFree</code>...FIXME</p> <p><code>memoryFree</code>\u00a0is used when:</p> <ul> <li><code>ExecutionMemoryPool</code> is requested to acquireMemory</li> <li><code>StorageMemoryPool</code> is requested to acquireMemory and freeSpaceToShrinkPool</li> <li><code>UnifiedMemoryManager</code> is requested to acquire execution and storage memory</li> </ul>","text":""},{"location":"memory/MemoryPool/#decrementpoolsize","title":"decrementPoolSize <pre><code>decrementPoolSize(\n  delta: Long): Unit\n</code></pre> <p><code>decrementPoolSize</code>...FIXME</p> <p><code>decrementPoolSize</code>\u00a0is used when:</p> <ul> <li><code>UnifiedMemoryManager</code> is requested to acquireExecutionMemory and acquireStorageMemory</li> </ul>","text":""},{"location":"memory/StorageMemoryPool/","title":"StorageMemoryPool","text":"<p><code>StorageMemoryPool</code> is a MemoryPool.</p>"},{"location":"memory/StorageMemoryPool/#creating-instance","title":"Creating Instance","text":"<p><code>StorageMemoryPool</code> takes the following to be created:</p> <ul> <li> Lock Object <li> <code>MemoryMode</code> (<code>ON_HEAP</code> or <code>OFF_HEAP</code>) <p><code>StorageMemoryPool</code> is created\u00a0when:</p> <ul> <li><code>MemoryManager</code> is created (and initializes on-heap and off-heap storage memory pools)</li> </ul>"},{"location":"memory/StorageMemoryPool/#memorystore","title":"MemoryStore <p><code>StorageMemoryPool</code> is given a MemoryStore when <code>MemoryManager</code> is requested to associate one with the on- and off-heap storage memory pools.</p> <p><code>StorageMemoryPool</code> uses the <code>MemoryStore</code> (to evict blocks) when requested to:</p> <ul> <li>Acquire Memory</li> <li>Free Space to Shrink Pool</li> </ul>","text":""},{"location":"memory/StorageMemoryPool/#size-of-memory-used","title":"Size of Memory Used <p><code>StorageMemoryPool</code> keeps track of the size of the memory acquired.</p> <p>The size descreases when <code>StorageMemoryPool</code> is requested to releaseMemory or releaseAllMemory.</p> <p><code>memoryUsed</code> is part of the MemoryPool abstraction.</p>","text":""},{"location":"memory/StorageMemoryPool/#acquiring-memory","title":"Acquiring Memory <pre><code>acquireMemory(\n  blockId: BlockId,\n  numBytes: Long): Boolean\nacquireMemory(\n  blockId: BlockId,\n  numBytesToAcquire: Long,\n  numBytesToFree: Long): Boolean\n</code></pre> <p><code>acquireMemory</code>...FIXME</p> <p><code>acquireMemory</code>\u00a0is used when:</p> <ul> <li><code>UnifiedMemoryManager</code> is requested to acquire storage memory</li> </ul>","text":""},{"location":"memory/StorageMemoryPool/#freeing-space-to-shrink-pool","title":"Freeing Space to Shrink Pool <pre><code>freeSpaceToShrinkPool(\n  spaceToFree: Long): Long\n</code></pre> <p><code>freeSpaceToShrinkPool</code>...FIXME</p> <p><code>freeSpaceToShrinkPool</code>\u00a0is used when:</p> <ul> <li><code>UnifiedMemoryManager</code> is requested to acquire execution memory</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/","title":"TaskMemoryManager","text":"<p><code>TaskMemoryManager</code> manages the memory allocated to a single task (using MemoryManager).</p> <p><code>TaskMemoryManager</code> assumes that:</p> <ol> <li> The number of bits to address pages is <code>13</code> <li> The number of bits to encode offsets in pages is <code>51</code> (64 bits - 13 bits) <li> Number of pages in the page table and to be allocated is <code>8192</code> (<code>1 &lt;&lt;</code> 13) <li> The maximum page size is <code>15GB</code> (<code>((1L &lt;&lt; 31) - 1) * 8L</code>)"},{"location":"memory/TaskMemoryManager/#creating-instance","title":"Creating Instance","text":"<p><code>TaskMemoryManager</code> takes the following to be created:</p> <ul> <li>MemoryManager</li> <li> Task Attempt ID <p><code>TaskMemoryManager</code> is created\u00a0when:</p> <ul> <li><code>TaskRunner</code> is requested to run</li> </ul> <p></p>"},{"location":"memory/TaskMemoryManager/#memorymanager","title":"MemoryManager <p><code>TaskMemoryManager</code> is given a MemoryManager when created.</p> <p><code>TaskMemoryManager</code> uses the <code>MemoryManager</code>\u00a0when requested for the following:</p> <ul> <li>Acquiring, releasing or cleaning up execution memory</li> <li>Report memory usage</li> <li>pageSizeBytes</li> <li>Allocating a memory block for Tungsten consumers</li> <li>freePage</li> <li>getMemoryConsumptionForThisTask</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#page-table-memoryblocks","title":"Page Table (MemoryBlocks) <p><code>TaskMemoryManager</code> uses an array of <code>MemoryBlock</code>s (to mimic an operating system's page table).</p> <p>The page table uses 13 bits for addressing pages.</p> <p>A page is \"stored\" in allocatePage and \"removed\" in freePage.</p> <p>All pages are released (removed) in cleanUpAllAllocatedMemory.</p> <p><code>TaskMemoryManager</code> uses the page table when requested to:</p> <ul> <li>getPage</li> <li>getOffsetInPage</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#spillable-memory-consumers","title":"Spillable Memory Consumers <pre><code>HashSet&lt;MemoryConsumer&gt; consumers\n</code></pre> <p><code>TaskMemoryManager</code> tracks spillable memory consumers.</p> <p><code>TaskMemoryManager</code> registers a new memory consumer when requested to acquire execution memory.</p> <p><code>TaskMemoryManager</code> removes (clears) all registered memory consumers when cleaning up all the allocated memory.</p> <p>Memory consumers are used to report memory usage when <code>TaskMemoryManager</code> is requested to show memory usage.</p>","text":""},{"location":"memory/TaskMemoryManager/#memory-acquired-but-not-used","title":"Memory Acquired But Not Used <p><code>TaskMemoryManager</code> tracks the size of memory allocated but not used (by any of the MemoryConsumers due to a <code>OutOfMemoryError</code> upon trying to use it).</p> <p><code>TaskMemoryManager</code> releases the memory when cleaning up all the allocated memory.</p>","text":""},{"location":"memory/TaskMemoryManager/#allocated-pages","title":"Allocated Pages <pre><code>BitSet allocatedPages\n</code></pre> <p><code>TaskMemoryManager</code> uses a <code>BitSet</code> (Java) to track allocated pages.</p> <p>The size is exactly the number of entries in the page table (8192).</p>","text":""},{"location":"memory/TaskMemoryManager/#memorymode","title":"MemoryMode <p><code>TaskMemoryManager</code> can be in <code>ON_HEAP</code> or <code>OFF_HEAP</code> mode (to avoid extra work for off-heap and hoping that the JIT handles branching well).</p> <p><code>TaskMemoryManager</code> is given the <code>MemoryMode</code> matching the MemoryMode (of the given MemoryManager) when created.</p> <p><code>TaskMemoryManager</code> uses the <code>MemoryMode</code> to match to for the following:</p> <ul> <li>allocatePage</li> <li>cleanUpAllAllocatedMemory</li> </ul> <p>For <code>OFF_HEAP</code> mode, <code>TaskMemoryManager</code> has to change offset while encodePageNumberAndOffset and getOffsetInPage.</p> <p>For <code>OFF_HEAP</code> mode, <code>TaskMemoryManager</code> returns no page.</p> <p>The <code>MemoryMode</code> is used when:</p> <ul> <li><code>ShuffleExternalSorter</code> is created</li> <li><code>BytesToBytesMap</code> is created</li> <li><code>UnsafeExternalSorter</code> is created</li> <li><code>Spillable</code> is requested to spill (only when in <code>ON_HEAP</code> mode)</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#acquiring-execution-memory","title":"Acquiring Execution Memory <pre><code>long acquireExecutionMemory(\n  long required,\n  MemoryConsumer consumer)\n</code></pre> <p><code>acquireExecutionMemory</code> allocates up to <code>required</code> execution memory (bytes) for the MemoryConsumer (from the MemoryManager).</p> <p>When not enough memory could be allocated initially, <code>acquireExecutionMemory</code> requests every consumer (with the same MemoryMode, itself including) to spill.</p> <p><code>acquireExecutionMemory</code> returns the amount of memory allocated.</p> <p><code>acquireExecutionMemory</code>\u00a0is used when:</p> <ul> <li><code>MemoryConsumer</code> is requested to acquire execution memory</li> <li><code>TaskMemoryManager</code> is requested to allocate a page</li> </ul>  <p><code>acquireExecutionMemory</code> requests the MemoryManager to acquire execution memory (with <code>required</code> bytes, the taskAttemptId and the MemoryMode of the MemoryConsumer).</p> <p>In the end, <code>acquireExecutionMemory</code> registers the <code>MemoryConsumer</code> (and adds it to the consumers registry) and prints out the following DEBUG message to the logs:</p> <pre><code>Task [taskAttemptId] acquired [got] for [consumer]\n</code></pre>  <p>In case <code>MemoryManager</code> will have offerred less memory than <code>required</code>, <code>acquireExecutionMemory</code> finds the MemoryConsumers (in the consumers registry) with the MemoryMode and non-zero memory used, sorts them by memory usage, requests them (one by one) to spill until enough memory is acquired or there are no more consumers to release memory from (by spilling).</p> <p>When a <code>MemoryConsumer</code> releases memory, <code>acquireExecutionMemory</code> prints out the following DEBUG message to the logs:</p> <pre><code>Task [taskAttemptId] released [released] from [c] for [consumer]\n</code></pre>  <p>In case there is still not enough memory (less than <code>required</code>), <code>acquireExecutionMemory</code> requests the <code>MemoryConsumer</code> (to acquire memory for) to spill.</p> <p><code>acquireExecutionMemory</code> prints out the following DEBUG message to the logs:</p> <pre><code>Task [taskAttemptId] released [released] from itself ([consumer])\n</code></pre>","text":""},{"location":"memory/TaskMemoryManager/#releasing-execution-memory","title":"Releasing Execution Memory <pre><code>void releaseExecutionMemory(\n  long size,\n  MemoryConsumer consumer)\n</code></pre> <p><code>releaseExecutionMemory</code> prints out the following DEBUG message to the logs:</p> <pre><code>Task [taskAttemptId] release [size] from [consumer]\n</code></pre> <p>In the end, <code>releaseExecutionMemory</code> requests the MemoryManager to releaseExecutionMemory.</p> <p><code>releaseExecutionMemory</code> is used when:</p> <ul> <li><code>MemoryConsumer</code> is requested to free up memory</li> <li><code>TaskMemoryManager</code> is requested to allocatePage and freePage</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#pageSizeBytes","title":"Page Size <pre><code>long pageSizeBytes()\n</code></pre> <p><code>pageSizeBytes</code> requests the MemoryManager for the page size.</p>  <p><code>pageSizeBytes</code> is used when:</p> <ul> <li><code>MemoryConsumer</code> is created</li> <li><code>ShuffleExternalSorter</code> is created (as a <code>MemoryConsumer</code>)</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#reporting-memory-usage","title":"Reporting Memory Usage <pre><code>void showMemoryUsage()\n</code></pre> <p><code>showMemoryUsage</code> prints out the following INFO message to the logs (with the taskAttemptId):</p> <pre><code>Memory used in task [taskAttemptId]\n</code></pre> <p><code>showMemoryUsage</code> requests every MemoryConsumer to report memory used. For consumers with non-zero memory usage, <code>showMemoryUsage</code> prints out the following INFO message to the logs:</p> <pre><code>Acquired by [consumer]: [memUsage]\n</code></pre> <p><code>showMemoryUsage</code> requests the MemoryManager to getExecutionMemoryUsageForTask to calculate memory not accounted for (that is not associated with a specific consumer).</p> <p><code>showMemoryUsage</code> prints out the following INFO messages to the logs:</p> <pre><code>[memoryNotAccountedFor] bytes of memory were used by task [taskAttemptId] but are not associated with specific consumers\n</code></pre> <p><code>showMemoryUsage</code> requests the MemoryManager for the executionMemoryUsed and storageMemoryUsed and prints out the following INFO message to the logs:</p> <pre><code>[executionMemoryUsed] bytes of memory are used for execution and\n[storageMemoryUsed] bytes of memory are used for storage\n</code></pre> <p><code>showMemoryUsage</code> is used when:</p> <ul> <li><code>MemoryConsumer</code> is requested to throw an OutOfMemoryError</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#cleaning-up-all-allocated-memory","title":"Cleaning Up All Allocated Memory <pre><code>long cleanUpAllAllocatedMemory()\n</code></pre> <p>The <code>consumers</code> collection is then cleared.</p> <p><code>cleanUpAllAllocatedMemory</code> finds all the registered MemoryConsumers (in the consumers registry) that still keep some memory used and, for every such consumer, prints out the following DEBUG message to the logs:</p> <pre><code>unreleased [getUsed] memory from [consumer]\n</code></pre> <p><code>cleanUpAllAllocatedMemory</code> removes all the consumers.</p>  <p>For every <code>MemoryBlock</code> in the pageTable, <code>cleanUpAllAllocatedMemory</code> prints out the following DEBUG message to the logs:</p> <pre><code>unreleased page: [page] in task [taskAttemptId]\n</code></pre> <p><code>cleanUpAllAllocatedMemory</code> marks the pages to be freed (<code>FREED_IN_TMM_PAGE_NUMBER</code>) and requests the MemoryManager for the tungstenMemoryAllocator to free up the MemoryBlock.</p> <p><code>cleanUpAllAllocatedMemory</code> clears the pageTable registry (by assigning <code>null</code> values).</p>  <p><code>cleanUpAllAllocatedMemory</code> requests the MemoryManager to release execution memory that is not used by any consumer (with the acquiredButNotUsed and the tungstenMemoryMode).</p> <p>In the end, <code>cleanUpAllAllocatedMemory</code> requests the MemoryManager to release all execution memory for the task.</p>  <p><code>cleanUpAllAllocatedMemory</code>\u00a0is used when:</p> <ul> <li><code>TaskRunner</code> is requested to run a task (and the task has finished successfully)</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#allocating-memory-page","title":"Allocating Memory Page <pre><code>MemoryBlock allocatePage(\n  long size,\n  MemoryConsumer consumer)\n</code></pre> <p><code>allocatePage</code> allocates a block of memory (page) that is:</p> <ol> <li>Below MAXIMUM_PAGE_SIZE_BYTES maximum size</li> <li>For MemoryConsumers with the same MemoryMode as the TaskMemoryManager</li> </ol> <p><code>allocatePage</code> acquireExecutionMemory (for the <code>size</code> and the MemoryConsumer). <code>allocatePage</code> returns immediately (with <code>null</code>) when this allocation ended up with <code>0</code> or less bytes.</p> <p><code>allocatePage</code> allocates the first clear bit in the allocatedPages (unless the whole page table is taken and <code>allocatePage</code> throws an <code>IllegalStateException</code>).</p> <p><code>allocatePage</code> requests the MemoryManager for the tungstenMemoryAllocator that is requested to allocate the acquired memory.</p> <p><code>allocatePage</code> registers the page in the pageTable.</p> <p>In the end, <code>allocatePage</code> prints out the following TRACE message to the logs and returns the <code>MemoryBlock</code> allocated.</p> <pre><code>Allocate page number [pageNumber] ([acquired] bytes)\n</code></pre>","text":""},{"location":"memory/TaskMemoryManager/#usage","title":"Usage <p><code>allocatePage</code> is used when:</p> <ul> <li><code>MemoryConsumer</code> is requested to allocate an array and a page</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#toolargepageexception","title":"TooLargePageException <p>For sizes larger than the MAXIMUM_PAGE_SIZE_BYTES <code>allocatePage</code> throws a <code>TooLargePageException</code>.</p>","text":""},{"location":"memory/TaskMemoryManager/#outofmemoryerror","title":"OutOfMemoryError <p>Requesting the tungstenMemoryAllocator to allocate the acquired memory may throw an <code>OutOfMemoryError</code>. If so, <code>allocatePage</code> prints out the following WARN message to the logs:</p> <pre><code>Failed to allocate a page ([acquired] bytes), try again.\n</code></pre> <p><code>allocatePage</code> adds the acquired memory to the acquiredButNotUsed and removes the page from the allocatedPages (by clearing the bit).</p> <p>In the end, <code>allocatePage</code> tries to allocate the page again (recursively).</p>","text":""},{"location":"memory/TaskMemoryManager/#releasing-memory-page","title":"Releasing Memory Page <pre><code>void freePage(\n  MemoryBlock page,\n  MemoryConsumer consumer)\n</code></pre> <p><code>pageSizeBytes</code> requests the MemoryManager for pageSizeBytes.</p> <p><code>pageSizeBytes</code> is used when:</p> <ul> <li><code>MemoryConsumer</code> is requested to freePage and throwOom</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#getting-page","title":"Getting Page <pre><code>Object getPage(\n  long pagePlusOffsetAddress)\n</code></pre> <p><code>getPage</code> handles the <code>ON_HEAP</code> mode of the tungstenMemoryMode only.</p> <p><code>getPage</code> looks up the page (by the given address) in the page table and requests it for the base object.</p> <p><code>getPage</code> is used when:</p> <ul> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile</li> <li><code>Location</code> (of BytesToBytesMap) is requested to <code>updateAddressesAndSizes</code></li> <li><code>SortComparator</code> (of UnsafeInMemorySorter) is requested to <code>compare</code> two record pointers</li> <li><code>SortedIterator</code> (of UnsafeInMemorySorter) is requested to <code>loadNext</code> record</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#getoffsetinpage","title":"getOffsetInPage <pre><code>long getOffsetInPage(\n  long pagePlusOffsetAddress)\n</code></pre> <p><code>getOffsetInPage</code> gives the offset associated with the given <code>pagePlusOffsetAddress</code> (encoded by <code>encodePageNumberAndOffset</code>).</p> <p><code>getOffsetInPage</code> is used when:</p> <ul> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile</li> <li><code>Location</code> (of BytesToBytesMap) is requested to <code>updateAddressesAndSizes</code></li> <li><code>SortComparator</code> (of UnsafeInMemorySorter) is requested to <code>compare</code> two record pointers</li> <li><code>SortedIterator</code> (of UnsafeInMemorySorter) is requested to <code>loadNext</code> record</li> </ul>","text":""},{"location":"memory/TaskMemoryManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.memory.TaskMemoryManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.memory.TaskMemoryManager=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"memory/UnifiedMemoryManager/","title":"UnifiedMemoryManager","text":"<p><code>UnifiedMemoryManager</code> is a MemoryManager (with the onHeapExecutionMemory being the Maximum Heap Memory with the onHeapStorageRegionSize taken out).</p> <p><code>UnifiedMemoryManager</code> allows for soft boundaries between storage and execution memory (allowing requests for memory in one region to be fulfilled by borrowing memory from the other).</p>"},{"location":"memory/UnifiedMemoryManager/#creating-instance","title":"Creating Instance","text":"<p><code>UnifiedMemoryManager</code> takes the following to be created:</p> <ul> <li> SparkConf <li>Maximum Heap Memory</li> <li>Size of the On-Heap Storage Region</li> <li> Number of CPU Cores <p>While being created, <code>UnifiedMemoryManager</code> asserts the invariants.</p> <p><code>UnifiedMemoryManager</code> is created\u00a0using apply factory.</p>"},{"location":"memory/UnifiedMemoryManager/#invariants","title":"Invariants <p><code>UnifiedMemoryManager</code> asserts the following:</p> <ul> <li> <p>Sum of the pool size of the on-heap ExecutionMemoryPool and on-heap StorageMemoryPool is exactly the maximum heap memory</p> </li> <li> <p>Sum of the pool size of the off-heap ExecutionMemoryPool and off-heap StorageMemoryPool is exactly the maximum off-heap memory</p> </li> </ul>","text":""},{"location":"memory/UnifiedMemoryManager/#total-available-on-heap-memory-for-storage","title":"Total Available On-Heap Memory for Storage <pre><code>maxOnHeapStorageMemory: Long\n</code></pre> <p><code>maxOnHeapStorageMemory</code>\u00a0is part of the MemoryManager abstraction.</p> <p><code>maxOnHeapStorageMemory</code> is the difference between Maximum Heap Memory and the memory used in the on-heap execution memory pool.</p>","text":""},{"location":"memory/UnifiedMemoryManager/#size-of-the-on-heap-storage-memory","title":"Size of the On-Heap Storage Memory <p><code>UnifiedMemoryManager</code> is given the size of the on-heap storage memory (region) when created.</p> <p>The size is the fraction (based on spark.memory.storageFraction configuration property) of the maximum heap memory.</p> <p>The remaining memory space (of the maximum heap memory) is used for the on-heap execution memory.</p>","text":""},{"location":"memory/UnifiedMemoryManager/#creating-unifiedmemorymanager","title":"Creating UnifiedMemoryManager <pre><code>apply(\n  conf: SparkConf,\n  numCores: Int): UnifiedMemoryManager\n</code></pre> <p><code>apply</code> creates a UnifiedMemoryManager with the Maximum Heap Memory and the size of the on-heap storage region as spark.memory.storageFraction of the Maximum Memory.</p> <p><code>apply</code>\u00a0is used when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a base SparkEnv (for the driver and executors)</li> </ul>","text":""},{"location":"memory/UnifiedMemoryManager/#maximum-heap-memory","title":"Maximum Heap Memory <p><code>UnifiedMemoryManager</code> is given the maximum heap memory to use (for execution and storage) when created (that uses apply factory method which uses <code>getMaxMemory</code>).</p> <p><code>UnifiedMemoryManager</code> makes sure that the driver's system memory is at least <code>1.5</code> of the Reserved System Memory. Otherwise, <code>getMaxMemory</code> throws an <code>IllegalArgumentException</code>:</p> <pre><code>System memory [systemMemory] must be at least [minSystemMemory].\nPlease increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.\n</code></pre> <p><code>UnifiedMemoryManager</code> makes sure that the executor memory (spark.executor.memory) is at least the Reserved System Memory. Otherwise, <code>getMaxMemory</code> throws an <code>IllegalArgumentException</code>:</p> <pre><code>Executor memory [executorMemory] must be at least [minSystemMemory].\nPlease increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration.\n</code></pre> <p><code>UnifiedMemoryManager</code> considers \"usable\" memory to be the system memory without the reserved memory.</p> <p><code>UnifiedMemoryManager</code> uses the fraction (based on spark.memory.fraction configuration property) of the \"usable\" memory for the maximum heap memory.</p>","text":""},{"location":"memory/UnifiedMemoryManager/#demo","title":"Demo <pre><code>// local mode with --conf spark.driver.memory=2g\nscala&gt; sc.getConf.getSizeAsBytes(\"spark.driver.memory\")\nres0: Long = 2147483648\n\nscala&gt; val systemMemory = Runtime.getRuntime.maxMemory\n\n// fixed amount of memory for non-storage, non-execution purposes\n// UnifiedMemoryManager.RESERVED_SYSTEM_MEMORY_BYTES\nval reservedMemory = 300 * 1024 * 1024\n\n// minimum system memory required\nval minSystemMemory = (reservedMemory * 1.5).ceil.toLong\n\nval usableMemory = systemMemory - reservedMemory\n\nval memoryFraction = sc.getConf.getDouble(\"spark.memory.fraction\", 0.6)\nscala&gt; val maxMemory = (usableMemory * memoryFraction).toLong\nmaxMemory: Long = 956615884\n\nimport org.apache.spark.network.util.JavaUtils\nscala&gt; JavaUtils.byteStringAsMb(maxMemory + \"b\")\nres1: Long = 912\n</code></pre>","text":""},{"location":"memory/UnifiedMemoryManager/#reserved-system-memory","title":"Reserved System Memory <p><code>UnifiedMemoryManager</code> considers <code>300MB</code> (<code>300 * 1024 * 1024</code> bytes) as a reserved system memory while calculating the maximum heap memory.</p>","text":""},{"location":"memory/UnifiedMemoryManager/#acquiring-execution-memory-for-task","title":"Acquiring Execution Memory for Task <pre><code>acquireExecutionMemory(\n  numBytes: Long,\n  taskAttemptId: Long,\n  memoryMode: MemoryMode): Long\n</code></pre> <p><code>acquireExecutionMemory</code> asserts the invariants.</p> <p><code>acquireExecutionMemory</code> selects the execution and storage pools, the storage region size and the maximum memory for the given <code>MemoryMode</code>.</p>    MemoryMode ON_HEAP OFF_HEAP     executionPool onHeapExecutionMemoryPool offHeapExecutionMemoryPool   storagePool onHeapStorageMemoryPool offHeapStorageMemoryPool   storageRegionSize onHeapStorageRegionSize offHeapStorageMemory   maxMemory maxHeapMemory maxOffHeapMemory    <p>In the end, <code>acquireExecutionMemory</code> requests the ExecutionMemoryPool to acquire memory of <code>numBytes</code> bytes (with the maybeGrowExecutionPool and the maximum size of execution pool functions).</p>  <p><code>acquireExecutionMemory</code>\u00a0is part of the MemoryManager abstraction.</p>","text":""},{"location":"memory/UnifiedMemoryManager/#maybegrowexecutionpool","title":"maybeGrowExecutionPool <pre><code>maybeGrowExecutionPool(\n  extraMemoryNeeded: Long): Unit\n</code></pre> <p><code>maybeGrowExecutionPool</code>...FIXME</p>","text":""},{"location":"memory/UnifiedMemoryManager/#maximum-size-of-execution-pool","title":"Maximum Size of Execution Pool <pre><code>computeMaxExecutionPoolSize(): Long\n</code></pre> <p><code>computeMaxExecutionPoolSize</code> takes the minimum size of the storage memory regions (based on the memory mode, <code>ON_HEAP</code> or <code>OFF_HEAP</code>, respectively):</p> <ul> <li>Memory used of the on-heap or the off-heap storage memory pool</li> <li>On-heap or the off-heap storage memory size</li> </ul> <p>In the end, <code>computeMaxExecutionPoolSize</code> returns the size of the remaining memory space of the maximum memory (the maxHeapMemory or the maxOffHeapMemory for <code>ON_HEAP</code> or <code>OFF_HEAP</code> memory mode, respectively) without (the minimum size of) the storage memory region.</p>","text":""},{"location":"memory/UnsafeExternalSorter/","title":"UnsafeExternalSorter","text":"<p><code>UnsafeExternalSorter</code> is a MemoryConsumer.</p>"},{"location":"memory/UnsafeExternalSorter/#creating-instance","title":"Creating Instance","text":"<p><code>UnsafeExternalSorter</code> takes the following to be created:</p> <ul> <li> TaskMemoryManager <li> BlockManager <li> SerializerManager <li> TaskContext <li> RecordComparator Supplier <li> <code>PrefixComparator</code> <li> Initial Size <li> Page size (in bytes) <li> numElementsForSpillThreshold <li> UnsafeInMemorySorter <li> <code>canUseRadixSort</code> flag <p><code>UnsafeExternalSorter</code> is created\u00a0when:</p> <ul> <li><code>UnsafeExternalSorter</code> utility is used to createWithExistingInMemorySorter and create</li> </ul>"},{"location":"memory/UnsafeExternalSorter/#createwithexistinginmemorysorter","title":"createWithExistingInMemorySorter <pre><code>UnsafeExternalSorter createWithExistingInMemorySorter(\n  TaskMemoryManager taskMemoryManager,\n  BlockManager blockManager,\n  SerializerManager serializerManager,\n  TaskContext taskContext,\n  Supplier&lt;RecordComparator&gt; recordComparatorSupplier,\n  PrefixComparator prefixComparator,\n  int initialSize,\n  long pageSizeBytes,\n  int numElementsForSpillThreshold,\n  UnsafeInMemorySorter inMemorySorter,\n  long existingMemoryConsumption)\n</code></pre> <p><code>createWithExistingInMemorySorter</code>...FIXME</p> <p><code>createWithExistingInMemorySorter</code>\u00a0is used when:</p> <ul> <li><code>UnsafeKVExternalSorter</code> is created</li> </ul>","text":""},{"location":"memory/UnsafeExternalSorter/#create","title":"create <pre><code>UnsafeExternalSorter create(\n  TaskMemoryManager taskMemoryManager,\n  BlockManager blockManager,\n  SerializerManager serializerManager,\n  TaskContext taskContext,\n  Supplier&lt;RecordComparator&gt; recordComparatorSupplier,\n  PrefixComparator prefixComparator,\n  int initialSize,\n  long pageSizeBytes,\n  int numElementsForSpillThreshold,\n  boolean canUseRadixSort)\n</code></pre> <p><code>create</code> creates a new UnsafeExternalSorter with no UnsafeInMemorySorter given (<code>null</code>).</p> <p><code>create</code>\u00a0is used when:</p> <ul> <li><code>UnsafeExternalRowSorter</code> and <code>UnsafeKVExternalSorter</code> are created</li> </ul>","text":""},{"location":"memory/UnsafeInMemorySorter/","title":"UnsafeInMemorySorter","text":""},{"location":"memory/UnsafeInMemorySorter/#creating-instance","title":"Creating Instance","text":"<p><code>UnsafeInMemorySorter</code> takes the following to be created:</p> <ul> <li> MemoryConsumer <li> TaskMemoryManager <li> <code>RecordComparator</code> <li> <code>PrefixComparator</code> <li> Long Array or Size <li> <code>canUseRadixSort</code> flag <p><code>UnsafeInMemorySorter</code> is created\u00a0when:</p> <ul> <li><code>UnsafeExternalSorter</code> is created</li> <li><code>UnsafeKVExternalSorter</code> is created</li> </ul>"},{"location":"memory/UnsafeSorterSpillReader/","title":"UnsafeSorterSpillReader","text":"<p>= UnsafeSorterSpillReader</p> <p>UnsafeSorterSpillReader is...FIXME</p>"},{"location":"memory/UnsafeSorterSpillWriter/","title":"UnsafeSorterSpillWriter","text":"<p>= [[UnsafeSorterSpillWriter]] UnsafeSorterSpillWriter</p> <p>UnsafeSorterSpillWriter is...FIXME</p>"},{"location":"metrics/","title":"Spark Metrics","text":"<p>Spark Metrics gives you execution metrics of Spark subsystems (metrics instances, e.g. the driver of a Spark application or the master of a Spark Standalone cluster).</p> <p>Spark Metrics uses Dropwizard Metrics Java library for the metrics infrastructure.</p> <p>Metrics is a Java library which gives you unparalleled insight into what your code does in production.</p> <p>Metrics provides a powerful toolkit of ways to measure the behavior of critical components in your production environment.</p>"},{"location":"metrics/#metrics-systems","title":"Metrics Systems","text":""},{"location":"metrics/#applicationmaster","title":"applicationMaster","text":"<p>Registered when <code>ApplicationMaster</code> (Hadoop YARN) is requested to <code>createAllocator</code></p>"},{"location":"metrics/#applications","title":"applications","text":"<p>Registered when <code>Master</code> (Spark Standalone) is created</p>"},{"location":"metrics/#driver","title":"driver","text":"<p>Registered when <code>SparkEnv</code> is created for the driver</p> <p></p>"},{"location":"metrics/#executor","title":"executor","text":"<p>Registered when <code>SparkEnv</code> is created for an executor</p>"},{"location":"metrics/#master","title":"master","text":"<p>Registered when <code>Master</code> (Spark Standalone) is created</p>"},{"location":"metrics/#mesos_cluster","title":"mesos_cluster","text":"<p>Registered when <code>MesosClusterScheduler</code> (Apache Mesos) is created</p>"},{"location":"metrics/#shuffleservice","title":"shuffleService","text":"<p>Registered when <code>ExternalShuffleService</code> is created</p>"},{"location":"metrics/#worker","title":"worker","text":"<p>Registered when <code>Worker</code> (Spark Standalone) is created</p>"},{"location":"metrics/#metricssystem","title":"MetricsSystem <p>Spark Metrics uses MetricsSystem.</p> <p><code>MetricsSystem</code> uses Dropwizard Metrics' MetricRegistry that acts as the integration point between Spark and the metrics library.</p> <p>A Spark subsystem can access the <code>MetricsSystem</code> through the SparkEnv.metricsSystem property.</p> <pre><code>val metricsSystem = SparkEnv.get.metricsSystem\n</code></pre>","text":""},{"location":"metrics/#metricsconfig","title":"MetricsConfig <p><code>MetricsConfig</code> is the configuration of the MetricsSystem (i.e. metrics spark-metrics-Source.md[sources] and spark-metrics-Sink.md[sinks]).</p> <p>metrics.properties is the default metrics configuration file. It is configured using spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property. The file is first loaded from the path directly before using Spark's CLASSPATH.</p> <p><code>MetricsConfig</code> also accepts a metrics configuration using <code>spark.metrics.conf.</code>-prefixed configuration properties.</p> <p>Spark comes with <code>conf/metrics.properties.template</code> file that is a template of metrics configuration.</p>","text":""},{"location":"metrics/#metricsservlet-metrics-sink","title":"MetricsServlet Metrics Sink <p>Among the metrics sinks is spark-metrics-MetricsServlet.md[MetricsServlet] that is used when sink.servlet metrics sink is configured in spark-metrics-MetricsConfig.md[metrics configuration].</p> <p>CAUTION: FIXME Describe configuration files and properties</p>","text":""},{"location":"metrics/#jmxsink-metrics-sink","title":"JmxSink Metrics Sink <p>Enable <code>org.apache.spark.metrics.sink.JmxSink</code> in spark-metrics-MetricsConfig.md[metrics configuration].</p> <p>You can then use <code>jconsole</code> to access Spark metrics through JMX.</p> <pre><code>*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink\n</code></pre> <p></p>","text":""},{"location":"metrics/#json-uri-path","title":"JSON URI Path <p>Metrics System is available at http://localhost:4040/metrics/json (for the default setup of a Spark application).</p> <pre><code>$ http --follow http://localhost:4040/metrics/json\nHTTP/1.1 200 OK\nCache-Control: no-cache, no-store, must-revalidate\nContent-Length: 2200\nContent-Type: text/json;charset=utf-8\nDate: Sat, 25 Feb 2017 14:14:16 GMT\nServer: Jetty(9.2.z-SNAPSHOT)\nX-Frame-Options: SAMEORIGIN\n\n{\n    \"counters\": {\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.fileCacheHits\": {\n            \"count\": 0\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.filesDiscovered\": {\n            \"count\": 0\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.hiveClientCalls\": {\n            \"count\": 2\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.parallelListingJobCount\": {\n            \"count\": 0\n        },\n        \"app-20170225151406-0000.driver.HiveExternalCatalog.partitionsFetched\": {\n            \"count\": 0\n        }\n    },\n    \"gauges\": {\n    ...\n    \"timers\": {\n        \"app-20170225151406-0000.driver.DAGScheduler.messageProcessingTime\": {\n            \"count\": 0,\n            \"duration_units\": \"milliseconds\",\n            \"m15_rate\": 0.0,\n            \"m1_rate\": 0.0,\n            \"m5_rate\": 0.0,\n            \"max\": 0.0,\n            \"mean\": 0.0,\n            \"mean_rate\": 0.0,\n            \"min\": 0.0,\n            \"p50\": 0.0,\n            \"p75\": 0.0,\n            \"p95\": 0.0,\n            \"p98\": 0.0,\n            \"p99\": 0.0,\n            \"p999\": 0.0,\n            \"rate_units\": \"calls/second\",\n            \"stddev\": 0.0\n        }\n    },\n    \"version\": \"3.0.0\"\n}\n</code></pre> <p>NOTE: You can access a Spark subsystem's <code>MetricsSystem</code> using its corresponding \"leading\" port, e.g. <code>4040</code> for the <code>driver</code>, <code>8080</code> for Spark Standalone's <code>master</code> and <code>applications</code>.</p> <p>NOTE: You have to use the trailing slash (<code>/</code>) to have the output.</p>","text":""},{"location":"metrics/#spark-standalone-master","title":"Spark Standalone Master <pre><code>$ http http://192.168.1.4:8080/metrics/master/json/path\nHTTP/1.1 200 OK\nCache-Control: no-cache, no-store, must-revalidate\nContent-Length: 207\nContent-Type: text/json;charset=UTF-8\nServer: Jetty(8.y.z-SNAPSHOT)\nX-Frame-Options: SAMEORIGIN\n\n{\n    \"counters\": {},\n    \"gauges\": {\n        \"master.aliveWorkers\": {\n            \"value\": 0\n        },\n        \"master.apps\": {\n            \"value\": 0\n        },\n        \"master.waitingApps\": {\n            \"value\": 0\n        },\n        \"master.workers\": {\n            \"value\": 0\n        }\n    },\n    \"histograms\": {},\n    \"meters\": {},\n    \"timers\": {},\n    \"version\": \"3.0.0\"\n}\n</code></pre>","text":""},{"location":"metrics/JvmSource/","title":"JvmSource","text":"<p><code>JvmSource</code> is a metrics source.</p> <p> The name of the source is jvm. <p><code>JvmSource</code> registers the build-in Codahale metrics:</p> <ul> <li><code>GarbageCollectorMetricSet</code></li> <li><code>MemoryUsageGaugeSet</code></li> <li><code>BufferPoolMetricSet</code></li> </ul> <p>Among the metrics is total.committed (from <code>MemoryUsageGaugeSet</code>) that describes the current usage of the heap and non-heap memories.</p>"},{"location":"metrics/MetricsConfig/","title":"MetricsConfig","text":"<p><code>MetricsConfig</code> is the configuration of the MetricsSystem (i.e. metrics sources and sinks).</p> <p><code>MetricsConfig</code> is &lt;&gt; when MetricsSystem is. <p><code>MetricsConfig</code> uses metrics.properties as the default metrics configuration file. It is configured using spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property. The file is first loaded from the path directly before using Spark's CLASSPATH.</p> <p><code>MetricsConfig</code> accepts a metrics configuration using <code>spark.metrics.conf.</code>-prefixed configuration properties.</p> <p>Spark comes with <code>conf/metrics.properties.template</code> file that is a template of metrics configuration.</p> <p><code>MetricsConfig</code> &lt;&gt; that the &lt;&gt; are always defined. <p>[[default-properties]] .MetricsConfig's Default Metrics Properties [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| <code>*.sink.servlet.class</code> | <code>org.apache.spark.metrics.sink.MetricsServlet</code></p> <p>| <code>*.sink.servlet.path</code> | <code>/metrics/json</code></p> <p>| <code>master.sink.servlet.path</code> | <code>/metrics/master/json</code></p> <p>| <code>applications.sink.servlet.path</code> | <code>/metrics/applications/json</code> |===</p>"},{"location":"metrics/MetricsConfig/#note","title":"[NOTE]","text":"<p>The order of precedence of metrics configuration settings is as follows:</p> <p>. &lt;&gt; . spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property or <code>metrics.properties</code> configuration file . <code>spark.metrics.conf.</code>-prefixed Spark properties ==== <p>[[creating-instance]] [[conf]] <code>MetricsConfig</code> takes a SparkConf.md[SparkConf] when created.</p> <p>[[internal-registries]] .MetricsConfig's Internal Registries and Counters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| [[properties]] <code>properties</code> | https://docs.oracle.com/javase/8/docs/api/java/util/Properties.html[java.util.Properties] with metrics properties</p> <p>Used to &lt;&gt; per-subsystem's &lt;&gt;. <p>| [[perInstanceSubProperties]] <code>perInstanceSubProperties</code> | Lookup table of metrics properties per subsystem |===</p> <p>=== [[initialize]] Initializing MetricsConfig -- <code>initialize</code> Method</p>"},{"location":"metrics/MetricsConfig/#source-scala","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#initialize-unit","title":"initialize(): Unit","text":"<p><code>initialize</code> &lt;&gt; and &lt;&gt; (that is defined using spark-metrics-properties.md#spark.metrics.conf[spark.metrics.conf] configuration property). <p><code>initialize</code> takes all Spark properties that start with spark.metrics.conf. prefix from &lt;&gt; and adds them to &lt;&gt; (without the prefix). <p>In the end, <code>initialize</code> splits &lt;&gt; with the default configuration (denoted as <code>*</code>) assigned to all subsystems afterwards. <p>NOTE: <code>initialize</code> accepts <code>*</code> (star) for the default configuration or any combination of lower- and upper-case letters for Spark subsystem names.</p> <p>NOTE: <code>initialize</code> is used exclusively when <code>MetricsSystem</code> is created.</p> <p>=== [[setDefaultProperties]] <code>setDefaultProperties</code> Internal Method</p>"},{"location":"metrics/MetricsConfig/#source-scala_1","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#setdefaultpropertiesprop-properties-unit","title":"setDefaultProperties(prop: Properties): Unit","text":"<p><code>setDefaultProperties</code> sets the &lt;&gt; (in the input <code>prop</code>). <p>NOTE: <code>setDefaultProperties</code> is used exclusively when <code>MetricsConfig</code> &lt;&gt;. <p>=== [[loadPropertiesFromFile]] Loading Custom Metrics Configuration File or metrics.properties -- <code>loadPropertiesFromFile</code> Method</p>"},{"location":"metrics/MetricsConfig/#source-scala_2","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#loadpropertiesfromfilepath-optionstring-unit","title":"loadPropertiesFromFile(path: Option[String]): Unit","text":"<p><code>loadPropertiesFromFile</code> tries to open the input <code>path</code> file (if defined) or the default metrics configuration file metrics.properties (on CLASSPATH).</p> <p>If either file is available, <code>loadPropertiesFromFile</code> loads the properties (to &lt;&gt; registry). <p>In case of exceptions, you should see the following ERROR message in the logs followed by the exception.</p> <pre><code>ERROR Error loading configuration file [file]\n</code></pre> <p>NOTE: <code>loadPropertiesFromFile</code> is used exclusively when <code>MetricsConfig</code> &lt;&gt;. <p>=== [[subProperties]] Grouping Properties Per Subsystem -- <code>subProperties</code> Method</p>"},{"location":"metrics/MetricsConfig/#source-scala_3","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#subpropertiesprop-properties-regex-regex-mutablehashmapstring-properties","title":"subProperties(prop: Properties, regex: Regex): mutable.HashMap[String, Properties]","text":"<p><code>subProperties</code> takes <code>prop</code> properties and destructures keys given <code>regex</code>. <code>subProperties</code> takes the matching prefix (of a key per <code>regex</code>) and uses it as a new key with the value(s) being the matching suffix(es).</p>"},{"location":"metrics/MetricsConfig/#source-scala_4","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#driverhelloworld-driver-helloworld","title":"driver.hello.world =&gt; (driver, (hello.world))","text":"<p>NOTE: <code>subProperties</code> is used when <code>MetricsConfig</code> &lt;&gt; (to apply the default metrics configuration) and when <code>MetricsSystem</code> registers metrics sources and sinks. <p>=== [[getInstance]] <code>getInstance</code> Method</p>"},{"location":"metrics/MetricsConfig/#source-scala_5","title":"[source, scala]","text":""},{"location":"metrics/MetricsConfig/#getinstanceinst-string-properties","title":"getInstance(inst: String): Properties","text":"<p><code>getInstance</code>...FIXME</p> <p>NOTE: <code>getInstance</code> is used when...FIXME</p>"},{"location":"metrics/MetricsServlet/","title":"MetricsServlet JSON Metrics Sink","text":"<p><code>MetricsServlet</code> is a metrics sink that gives metrics snapshots in JSON format.</p> <p><code>MetricsServlet</code> is a \"special\" sink as it is only available to the metrics instances with a web UI:</p> <ul> <li>Driver of a Spark application</li> <li>Spark Standalone's <code>Master</code> and <code>Worker</code></li> </ul> <p>You can access the metrics from <code>MetricsServlet</code> at /metrics/json URI by default. The entire URL depends on a metrics instance, e.g. http://localhost:4040/metrics/json/ for a running Spark application.</p> <pre><code>$ http http://localhost:4040/metrics/json/\nHTTP/1.1 200 OK\nCache-Control: no-cache, no-store, must-revalidate\nContent-Length: 5005\nContent-Type: text/json;charset=utf-8\nDate: Mon, 11 Jun 2018 06:29:03 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nX-Content-Type-Options: nosniff\nX-Frame-Options: SAMEORIGIN\nX-XSS-Protection: 1; mode=block\n\n{\n    \"counters\": {\n        \"local-1528698499919.driver.HiveExternalCatalog.fileCacheHits\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.filesDiscovered\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.hiveClientCalls\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.parallelListingJobCount\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.HiveExternalCatalog.partitionsFetched\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.LiveListenerBus.numEventsPosted\": {\n            \"count\": 7\n        },\n        \"local-1528698499919.driver.LiveListenerBus.queue.appStatus.numDroppedEvents\": {\n            \"count\": 0\n        },\n        \"local-1528698499919.driver.LiveListenerBus.queue.executorManagement.numDroppedEvents\": {\n            \"count\": 0\n        }\n    },\n    ...\n</code></pre> <p><code>MetricsServlet</code> is &lt;&gt; exclusively when <code>MetricsSystem</code> is started (and requested to register metrics sinks). <p><code>MetricsServlet</code> can be configured using configuration properties with sink.servlet prefix (in spark-metrics-MetricsConfig.md[metrics configuration]). That is not required since <code>MetricsConfig</code> spark-metrics-MetricsConfig.md#setDefaultProperties[makes sure] that <code>MetricsServlet</code> is always configured.</p> <p><code>MetricsServlet</code> uses https://fasterxml.github.io/jackson-databind/[jackson-databind], the general data-binding package for Jackson (as &lt;&gt;) with Dropwizard Metrics library (i.e. registering a Coda Hale <code>MetricsModule</code>). <p>[[properties]] .MetricsServlet's Configuration Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Name | Default | Description</p> <p>| <code>path</code> | <code>/metrics/json/</code> | [[path]] Path URI prefix to bind to</p> <p>| <code>sample</code> | <code>false</code> | [[sample]] Whether to show entire set of samples for histograms |===</p> <p>[[internal-registries]] .MetricsServlet's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| <code>mapper</code> | [[mapper]] Jaxson's https://fasterxml.github.io/jackson-databind/javadoc/2.6/com/fasterxml/jackson/databind/ObjectMapper.html[com.fasterxml.jackson.databind.ObjectMapper] that \"provides functionality for reading and writing JSON, either to and from basic POJOs (Plain Old Java Objects), or to and from a general-purpose JSON Tree Model (JsonNode), as well as related functionality for performing conversions.\"</p> <p>When created, <code>mapper</code> is requested to register a Coda Hale com.codahale.metrics.json.MetricsModule.</p> <p>Used exclusively when <code>MetricsServlet</code> is requested to &lt;&gt;. <p>| <code>servletPath</code> | [[servletPath]] Value of &lt;&gt; configuration property <p>| <code>servletShowSample</code> | [[servletShowSample]] Flag to control whether to show samples (<code>true</code>) or not (<code>false</code>).</p> <p><code>servletShowSample</code> is the value of &lt;&gt; configuration property (if defined) or <code>false</code>. <p>Used when &lt;&gt; is requested to register a Coda Hale com.codahale.metrics.json.MetricsModule. |==="},{"location":"metrics/MetricsServlet/#creating-instance","title":"Creating Instance","text":"<p><code>MetricsServlet</code> takes the following when created:</p> <ul> <li>[[property]] Configuration Properties (as Java <code>Properties</code>)</li> <li>[[registry]] <code>MetricRegistry</code> (Dropwizard Metrics</li> <li>[[securityMgr]] <code>SecurityManager</code></li> </ul> <p><code>MetricsServlet</code> initializes the &lt;&gt;. <p>=== [[getMetricsSnapshot]] Requesting Metrics Snapshot -- <code>getMetricsSnapshot</code> Method</p>"},{"location":"metrics/MetricsServlet/#source-scala","title":"[source, scala]","text":""},{"location":"metrics/MetricsServlet/#getmetricssnapshotrequest-httpservletrequest-string","title":"getMetricsSnapshot(request: HttpServletRequest): String","text":"<p><code>getMetricsSnapshot</code> simply requests the &lt;&gt; to serialize the &lt;&gt; to a JSON string (using ++https://fasterxml.github.io/jackson-databind/javadoc/2.6/com/fasterxml/jackson/databind/ObjectMapper.html#writeValueAsString-java.lang.Object-++[ObjectMapper.writeValueAsString]). <p>NOTE: <code>getMetricsSnapshot</code> is used exclusively when <code>MetricsServlet</code> is requested to &lt;&gt;. <p>=== [[getHandlers]] Requesting JSON Servlet Handler -- <code>getHandlers</code> Method</p>"},{"location":"metrics/MetricsServlet/#source-scala_1","title":"[source, scala]","text":""},{"location":"metrics/MetricsServlet/#gethandlersconf-sparkconf-arrayservletcontexthandler","title":"getHandlers(conf: SparkConf): Array[ServletContextHandler]","text":"<p><code>getHandlers</code> returns just a single <code>ServletContextHandler</code> (in a collection) that gives &lt;&gt; in JSON format at every request at &lt;&gt; URI path. <p>NOTE: <code>getHandlers</code> is used exclusively when <code>MetricsSystem</code> is requested for MetricsSystem.md#getServletHandlers[metrics ServletContextHandlers].</p>"},{"location":"metrics/MetricsSystem/","title":"MetricsSystem","text":"<p><code>MetricsSystem</code> is a registry of metrics sources and sinks of a Spark subsystem.</p>"},{"location":"metrics/MetricsSystem/#creating-instance","title":"Creating Instance","text":"<p><code>MetricsSystem</code> takes the following to be created:</p> <ul> <li> Instance Name <li> SparkConf <li> <code>SecurityManager</code> <p>While being created, <code>MetricsSystem</code> requests the MetricsConfig to initialize.</p> <p></p> <p><code>MetricsSystem</code> is created (using createMetricsSystem utility) for the Metrics Systems.</p>"},{"location":"metrics/MetricsSystem/#prometheusservlet","title":"PrometheusServlet <p><code>MetricsSystem</code> creates a PrometheusServlet when requested to registerSinks for an instance with <code>sink.prometheusServlet</code> configuration.</p> <p><code>MetricsSystem</code> requests the <code>PrometheusServlet</code> for URL handlers when requested for servlet handlers (so it can be attached to a web UI and serve HTTP requests).</p>","text":""},{"location":"metrics/MetricsSystem/#metricsservlet","title":"MetricsServlet  <p>Note</p> <p>review me</p>  <p>MetricsServlet JSON metrics sink that is only available for the &lt;&gt; with a web UI (i.e. the driver of a Spark application and Spark Standalone's <code>Master</code>). <p><code>MetricsSystem</code> may have at most one <code>MetricsServlet</code> JSON metrics sink (which is registered by default).</p> <p>Initialized when MetricsSystem registers &lt;&gt; (and finds a configuration entry with <code>servlet</code> sink name). <p>Used when MetricsSystem is requested for a &lt;&gt;.","text":""},{"location":"metrics/MetricsSystem/#creating-metricssystem","title":"Creating MetricsSystem <pre><code>createMetricsSystem(\n  instance: String\n  conf: SparkConf\n  securityMgr: SecurityManager): MetricsSystem\n</code></pre> <p><code>createMetricsSystem</code> creates a new <code>MetricsSystem</code> (for the given parameters).</p> <p><code>createMetricsSystem</code> is used to create metrics systems.</p>","text":""},{"location":"metrics/MetricsSystem/#metrics-sources-for-spark-sql","title":"Metrics Sources for Spark SQL <ul> <li><code>CodegenMetrics</code></li> <li><code>HiveCatalogMetrics</code></li> </ul>","text":""},{"location":"metrics/MetricsSystem/#registering-metrics-source","title":"Registering Metrics Source <pre><code>registerSource(\n  source: Source): Unit\n</code></pre> <p><code>registerSource</code> adds <code>source</code> to the sources internal registry.</p> <p><code>registerSource</code> creates an identifier for the metrics source and registers it with the MetricRegistry.</p> <p><code>registerSource</code> registers the metrics source under a given name.</p> <p><code>registerSource</code> prints out the following INFO message to the logs when registering a name more than once:</p> <pre><code>Metrics already registered\n</code></pre>","text":""},{"location":"metrics/MetricsSystem/#building-metrics-source-identifier","title":"Building Metrics Source Identifier <pre><code>buildRegistryName(\n  source: Source): String\n</code></pre> <p><code>buildRegistryName</code> uses spark-metrics-properties.md#spark.metrics.namespace[spark.metrics.namespace] and executor:Executor.md#spark.executor.id[spark.executor.id] Spark properties to differentiate between a Spark application's driver and executors, and the other Spark framework's components.</p> <p>(only when &lt;&gt; is <code>driver</code> or <code>executor</code>) <code>buildRegistryName</code> builds metrics source name that is made up of spark-metrics-properties.md#spark.metrics.namespace[spark.metrics.namespace], executor:Executor.md#spark.executor.id[spark.executor.id] and the name of the <code>source</code>. <p>FIXME Finish for the other components.</p> <p><code>buildRegistryName</code> is used when <code>MetricsSystem</code> is requested to register or remove a metrics source.</p>","text":""},{"location":"metrics/MetricsSystem/#registering-metrics-sources-for-spark-instance","title":"Registering Metrics Sources for Spark Instance <pre><code>registerSources(): Unit\n</code></pre> <p><code>registerSources</code> finds &lt;&gt; configuration for the &lt;&gt;. <p>NOTE: <code>instance</code> is defined when MetricsSystem &lt;&gt;. <p><code>registerSources</code> finds the configuration of all the spark-metrics-Source.md[metrics sources] for the subsystem (as described with <code>source.</code> prefix).</p> <p>For every metrics source, <code>registerSources</code> finds <code>class</code> property, creates an instance, and in the end &lt;&gt;. <p>When <code>registerSources</code> fails, you should see the following ERROR message in the logs followed by the exception.</p> <pre><code>Source class [classPath] cannot be instantiated\n</code></pre> <p><code>registerSources</code> is used when <code>MetricsSystem</code> is requested to start.</p>","text":""},{"location":"metrics/MetricsSystem/#servlet-handlers","title":"Servlet Handlers <pre><code>getServletHandlers: Array[ServletContextHandler]\n</code></pre> <p><code>getServletHandlers</code> requests the metricsServlet (if defined) and the prometheusServlet (if defined) for URL handlers.</p> <p><code>getServletHandlers</code> requires that the <code>MetricsSystem</code> is running or throws an <code>IllegalArgumentException</code>:</p> <pre><code>Can only call getServletHandlers on a running MetricsSystem\n</code></pre> <p><code>getServletHandlers</code> is used when:</p> <ul> <li><code>SparkContext</code> is created (and attaches the URL handlers to the web UI)</li> <li><code>Master</code> (Spark Standalone) is requested to <code>onStart</code></li> <li><code>Worker</code> (Spark Standalone) is requested to <code>onStart</code></li> </ul>","text":""},{"location":"metrics/MetricsSystem/#registering-metrics-sinks","title":"Registering Metrics Sinks <pre><code>registerSinks(): Unit\n</code></pre> <p><code>registerSinks</code> requests the &lt;&gt; for the spark-metrics-MetricsConfig.md#getInstance[configuration] of the &lt;&gt;. <p><code>registerSinks</code> requests the &lt;&gt; for the spark-metrics-MetricsConfig.md#subProperties[configuration] of all metrics sinks (i.e. configuration entries that match <code>^sink\\\\.(.+)\\\\.(.+)</code> regular expression). <p>For every metrics sink configuration, <code>registerSinks</code> takes <code>class</code> property and (if defined) creates an instance of the metric sink using an constructor that takes the configuration, &lt;&gt; and &lt;&gt;. <p>For a single servlet metrics sink, <code>registerSinks</code> converts the sink to a spark-metrics-MetricsServlet.md[MetricsServlet] and sets the &lt;&gt; internal registry. <p>For all other metrics sinks, <code>registerSinks</code> adds the sink to the &lt;&gt; internal registry. <p>In case of an <code>Exception</code>, <code>registerSinks</code> prints out the following ERROR message to the logs:</p> <pre><code>Sink class [classPath] cannot be instantiated\n</code></pre> <p><code>registerSinks</code> is used when <code>MetricsSystem</code> is requested to start.</p>","text":""},{"location":"metrics/MetricsSystem/#stopping","title":"Stopping <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>...FIXME</p>","text":""},{"location":"metrics/MetricsSystem/#reporting-metrics","title":"Reporting Metrics <pre><code>report(): Unit\n</code></pre> <p><code>report</code> simply requests the registered metrics sinks to report metrics.</p>","text":""},{"location":"metrics/MetricsSystem/#starting","title":"Starting <pre><code>start(): Unit\n</code></pre> <p><code>start</code> turns &lt;&gt; flag on. <p>NOTE: <code>start</code> can only be called once and &lt;&gt; an <code>IllegalArgumentException</code> when called multiple times. <p><code>start</code> &lt;&gt; the &lt;&gt; for Spark SQL, i.e. <code>CodegenMetrics</code> and <code>HiveCatalogMetrics</code>. <p><code>start</code> then registers the configured metrics &lt;&gt; and &lt;&gt; for the &lt;&gt;. <p>In the end, <code>start</code> requests the registered &lt;&gt; to spark-metrics-Sink.md#start[start]. <p>[[start-IllegalArgumentException]] <code>start</code> throws an <code>IllegalArgumentException</code> when &lt;&gt; flag is on. <pre><code>requirement failed: Attempting to start a MetricsSystem that is already running\n</code></pre>","text":""},{"location":"metrics/MetricsSystem/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.metrics.MetricsSystem</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.metrics.MetricsSystem=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"metrics/MetricsSystem/#internal-registries","title":"Internal Registries","text":""},{"location":"metrics/MetricsSystem/#metricregistry","title":"MetricRegistry <p>Integration point to Dropwizard Metrics' MetricRegistry</p> <p>Used when MetricsSystem is requested to:</p> <ul> <li>Register or remove a metrics source</li> <li>Start (that in turn registers metrics sinks)</li> </ul>","text":""},{"location":"metrics/MetricsSystem/#metricsconfig","title":"MetricsConfig <p>MetricsConfig</p> <p>Initialized when MetricsSystem is &lt;&gt;. <p>Used when MetricsSystem registers &lt;&gt; and &lt;&gt;.","text":""},{"location":"metrics/MetricsSystem/#running-flag","title":"running Flag <p>Indicates whether <code>MetricsSystem</code> has been started (<code>true</code>) or not (<code>false</code>)</p> <p>Default: <code>false</code></p>","text":""},{"location":"metrics/MetricsSystem/#sinks","title":"sinks <p>Metrics sinks</p> <p>Used when MetricsSystem &lt;&gt; and &lt;&gt;.","text":""},{"location":"metrics/MetricsSystem/#sources","title":"sources <p>Metrics sources</p> <p>Used when MetricsSystem &lt;&gt;.","text":""},{"location":"metrics/PrometheusServlet/","title":"PrometheusServlet","text":"<p><code>PrometheusServlet</code> is a metrics sink that comes with a ServletContextHandler to serve metrics snapshots in Prometheus format.</p>"},{"location":"metrics/PrometheusServlet/#creating-instance","title":"Creating Instance","text":"<p><code>PrometheusServlet</code> takes the following to be created:</p> <ul> <li> <code>Properties</code> <li> <code>MetricRegistry</code> (Dropwizard Metrics) <p><code>PrometheusServlet</code> is created when:</p> <ul> <li><code>MetricsSystem</code> is requested to register metric sinks (with <code>sink.prometheusServlet</code> configuration)</li> </ul>"},{"location":"metrics/PrometheusServlet/#servletcontexthandler","title":"ServletContextHandler <p><code>PrometheusServlet</code> creates a <code>ServletContextHandler</code> to be registered at the path configured by <code>path</code> property.</p> <p>The <code>ServletContextHandler</code> handles <code>text/plain</code> content type.</p> <p>When executed, the <code>ServletContextHandler</code> gives a metrics snapshot.</p>","text":""},{"location":"metrics/PrometheusServlet/#metrics-snapshot","title":"Metrics Snapshot <pre><code>getMetricsSnapshot(\n  request: HttpServletRequest): String\n</code></pre> <p><code>getMetricsSnapshot</code>...FIXME</p>","text":""},{"location":"metrics/PrometheusServlet/#gethandlers","title":"getHandlers <pre><code>getHandlers(\n  conf: SparkConf): Array[ServletContextHandler]\n</code></pre> <p><code>getHandlers</code> is the ServletContextHandler.</p> <p><code>getHandlers</code> is used when:</p> <ul> <li><code>MetricsSystem</code> is requested for servlet handlers</li> </ul>","text":""},{"location":"metrics/Sink/","title":"Sink","text":"<p><code>Sink</code> is a &lt;&gt; of metrics sinks. <p>[[contract]] [source, scala]</p> <p>package org.apache.spark.metrics.sink</p> <p>trait Sink {   def start(): Unit   def stop(): Unit   def report(): Unit }</p> <p>NOTE: <code>Sink</code> is a <code>private[spark]</code> contract.</p> <p>.Sink Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| <code>start</code> | [[start]] Used when...FIXME</p> <p>| <code>stop</code> | [[stop]] Used when...FIXME</p> <p>| <code>report</code> | [[report]] Used when...FIXME |===</p> <p>[[implementations]] .Sinks [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Sink | Description</p> <p>| <code>ConsoleSink</code> | [[ConsoleSink]]</p> <p>| <code>CsvSink</code> | [[CsvSink]]</p> <p>| <code>GraphiteSink</code> | [[GraphiteSink]]</p> <p>| <code>JmxSink</code> | [[JmxSink]]</p> <p>| spark-metrics-MetricsServlet.md[MetricsServlet] | [[MetricsServlet]]</p> <p>| <code>Slf4jSink</code> | [[Slf4jSink]]</p> <p>| <code>StatsdSink</code> | [[StatsdSink]] |===</p> <p>NOTE: All known &lt;&gt; in Spark 2.3 are in <code>org.apache.spark.metrics.sink</code> Scala package."},{"location":"metrics/Source/","title":"Source","text":"<p><code>Source</code> is an abstraction of metrics sources.</p>"},{"location":"metrics/Source/#contract","title":"Contract","text":""},{"location":"metrics/Source/#metricregistry","title":"MetricRegistry <pre><code>metricRegistry: MetricRegistry\n</code></pre> <p><code>MetricRegistry</code> (Codahale Metrics)</p> <p>Used when:</p> <ul> <li><code>MetricsSystem</code> is requested to register a metrics source</li> </ul>","text":""},{"location":"metrics/Source/#source-name","title":"Source Name <pre><code>sourceName: String\n</code></pre> <p>Used when:</p> <ul> <li><code>MetricsSystem</code> is requested to build a metrics source identifier and getSourcesByName</li> </ul>","text":""},{"location":"metrics/Source/#implementations","title":"Implementations","text":"<ul> <li>AccumulatorSource</li> <li>AppStatusSource</li> <li>BlockManagerSource</li> <li>DAGSchedulerSource</li> <li>ExecutorAllocationManagerSource</li> <li>ExecutorMetricsSource</li> <li>ExecutorSource</li> <li>JvmSource</li> <li>ShuffleMetricsSource</li> <li>others</li> </ul>"},{"location":"metrics/configuration-properties/","title":"Configuration Properties","text":""},{"location":"metrics/configuration-properties/#sparkmetricsappstatussourceenabled","title":"spark.metrics.appStatusSource.enabled <p>Enables Dropwizard/Codahale metrics with the status of a live Spark application</p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>AppStatusSource</code> utility is used to create an AppStatusSource</li> </ul>","text":""},{"location":"metrics/configuration-properties/#sparkmetricsconf","title":"spark.metrics.conf <p>The metrics configuration file</p> <p>Default: <code>metrics.properties</code></p>","text":""},{"location":"metrics/configuration-properties/#sparkmetricsexecutormetricssourceenabled","title":"spark.metrics.executorMetricsSource.enabled <p>Enables registering ExecutorMetricsSource with the metrics system</p> <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is created</li> <li><code>Executor</code> is created</li> </ul>","text":""},{"location":"metrics/configuration-properties/#sparkmetricsnamespace","title":"spark.metrics.namespace <p>Root namespace for metrics reporting</p> <p>Default: Spark Application ID (i.e. <code>spark.app.id</code> configuration property)</p> <p>Since a Spark application's ID changes with every execution of a Spark application, a custom namespace can be specified for an easier metrics reporting.</p> <p>Used when <code>MetricsSystem</code> is requested for a metrics source identifier (metrics namespace)</p>","text":""},{"location":"metrics/configuration-properties/#sparkmetricsstaticsourcesenabled","title":"spark.metrics.staticSources.enabled <p>Enables static metric sources</p> <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is created</li> <li><code>SparkEnv</code> utility is used to create SparkEnv for executors</li> </ul>","text":""},{"location":"network/","title":"Network","text":""},{"location":"network/SparkTransportConf/","title":"SparkTransportConf Utility","text":""},{"location":"network/SparkTransportConf/#fromsparkconf","title":"fromSparkConf <pre><code>fromSparkConf(\n  _conf: SparkConf,\n  module: String, // (1)\n  numUsableCores: Int = 0,\n  role: Option[String] = None): TransportConf // (2)\n</code></pre> <ol> <li>The given <code>module</code> is <code>shuffle</code> most of the time except:<ul> <li><code>rpc</code> for NettyRpcEnv</li> <li><code>files</code> for NettyRpcEnv</li> </ul> </li> <li>Only defined in NettyRpcEnv to be either <code>driver</code> or <code>executor</code></li> </ol> <p><code>fromSparkConf</code> makes a copy (clones) the given SparkConf.</p> <p><code>fromSparkConf</code> sets the following configuration properties (for the given <code>module</code>):</p> <ul> <li><code>spark.[module].io.serverThreads</code></li> <li><code>spark.[module].io.clientThreads</code></li> </ul> <p>The values are taken using the following properties in the order and until one is found (with <code>suffix</code> being <code>serverThreads</code> or <code>clientThreads</code>, respectively):</p> <ol> <li><code>spark.[role].[module].io.[suffix]</code></li> <li><code>spark.[module].io.[suffix]</code></li> </ol> <p>Unless found, <code>fromSparkConf</code> defaults to the default number of threads (based on the given <code>numUsableCores</code> and not more than <code>8</code>).</p> <p>In the end, <code>fromSparkConf</code> creates a TransportConf (for the given <code>module</code> and the updated <code>SparkConf</code>).</p> <p><code>fromSparkConf</code>\u00a0is used when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv (with the spark.shuffle.service.enabled configuration property enabled)</li> <li><code>ExternalShuffleService</code> is created</li> <li><code>NettyBlockTransferService</code> is requested to init</li> <li><code>NettyRpcEnv</code> is created and requested for a downloadClient</li> <li><code>IndexShuffleBlockResolver</code> is created</li> <li><code>ShuffleBlockPusher</code> is requested to initiateBlockPush</li> <li><code>BlockManager</code> is requested to readDiskBlockFromSameHostExecutor</li> </ul>","text":""},{"location":"network/TransportClientFactory/","title":"TransportClientFactory","text":""},{"location":"network/TransportClientFactory/#creating-instance","title":"Creating Instance","text":"<p><code>TransportClientFactory</code> takes the following to be created:</p> <ul> <li> TransportContext <li> <code>TransportClientBootstrap</code>s <p><code>TransportClientFactory</code> is created\u00a0when:</p> <ul> <li><code>TransportContext</code> is requested for a TransportClientFactory</li> </ul>"},{"location":"network/TransportClientFactory/#configuration-properties","title":"Configuration Properties","text":"<p>While being created, <code>TransportClientFactory</code> requests the given TransportContext for the TransportConf that is used to access the values of the following (configuration) properties:</p> <ul> <li>io.numConnectionsPerPeer</li> <li>io.mode</li> <li>io.mode</li> <li>io.preferDirectBufs</li> <li>io.retryWait</li> <li>spark.network.sharedByteBufAllocators.enabled</li> <li>spark.network.io.preferDirectBufs</li> <li>Module Name</li> </ul>"},{"location":"network/TransportClientFactory/#creating-transportclient","title":"Creating TransportClient <pre><code>TransportClient createClient(\n  String remoteHost,\n  int remotePort) // (1)\nTransportClient createClient(\n  String remoteHost,\n  int remotePort,\n  boolean fastFail)\nTransportClient createClient(\n  InetSocketAddress address)\n</code></pre> <ol> <li>Turns <code>fastFail</code> off</li> </ol> <p><code>createClient</code> prints out the following DEBUG message to the logs:</p> <pre><code>Creating new connection to [address]\n</code></pre> <p><code>createClient</code> creates a Netty <code>Bootstrap</code> and initializes it.</p> <p><code>createClient</code> requests the Netty <code>Bootstrap</code> to connect.</p> <p>If successful, <code>createClient</code> prints out the following DEBUG message and requests the TransportClientBootstraps to <code>doBootstrap</code>.</p> <pre><code>Connection to [address] successful, running bootstraps...\n</code></pre> <p>In the end, <code>createClient</code> prints out the following INFO message:</p> <pre><code>Successfully created connection to [address] after [t] ms ([t] ms spent in bootstraps)\n</code></pre>","text":""},{"location":"network/TransportConf/","title":"TransportConf","text":""},{"location":"network/TransportConf/#creating-instance","title":"Creating Instance","text":"<p><code>TransportConf</code> takes the following to be created:</p> <ul> <li>Module Name</li> <li> <code>ConfigProvider</code> <p><code>TransportConf</code> is created\u00a0when:</p> <ul> <li><code>SparkTransportConf</code> utility is used to fromSparkConf</li> <li><code>YarnShuffleService</code> (Spark on YARN) is requested to <code>serviceInit</code></li> </ul>"},{"location":"network/TransportConf/#module-name","title":"Module Name <p><code>TransportConf</code> is given the name of a module the transport-related configuration properties are for and is as follows (per SparkTransportConf):</p> <ul> <li><code>shuffle</code></li> <li><code>rpc</code> for NettyRpcEnv</li> <li><code>files</code> for NettyRpcEnv</li> </ul>","text":""},{"location":"network/TransportConf/#getmodulename","title":"getModuleName <pre><code>String getModuleName()\n</code></pre> <p><code>getModuleName</code> returns the module name.</p>","text":""},{"location":"network/TransportConf/#getconfkey","title":"getConfKey <pre><code>String getConfKey(\n  String suffix)\n</code></pre> <p><code>getConfKey</code> creates the key of a configuration property (with the module and the given suffix):</p> <pre><code>spark.[module].[suffix]\n</code></pre>","text":""},{"location":"network/TransportConf/#suffixes","title":"Suffixes","text":""},{"location":"network/TransportConf/#iomode","title":"io.mode <ul> <li><code>nio</code> (default)</li> <li><code>epoll</code></li> </ul>","text":""},{"location":"network/TransportConf/#iopreferdirectbufs","title":"io.preferDirectBufs <p>Controls whether Spark prefers allocating off-heap byte buffers within Netty (<code>true</code>) or not (<code>false</code>).</p> <p>Default: <code>true</code></p>","text":""},{"location":"network/TransportConf/#ioconnectiontimeout","title":"io.connectionTimeout","text":""},{"location":"network/TransportConf/#ioconnectioncreationtimeout","title":"io.connectionCreationTimeout","text":""},{"location":"network/TransportConf/#iobacklog","title":"io.backLog <p>The requested maximum length of the queue of incoming connections</p> <p>Default: <code>-1</code> (no backlog)</p>","text":""},{"location":"network/TransportConf/#ionumconnectionsperpeer","title":"io.numConnectionsPerPeer <p>Default: <code>1</code></p>","text":""},{"location":"network/TransportConf/#ioserverthreads","title":"io.serverThreads","text":""},{"location":"network/TransportConf/#ioclientthreads","title":"io.clientThreads <p>Default: <code>0</code></p>","text":""},{"location":"network/TransportConf/#ioreceivebuffer","title":"io.receiveBuffer","text":""},{"location":"network/TransportConf/#iosendbuffer","title":"io.sendBuffer","text":""},{"location":"network/TransportConf/#sasltimeout","title":"sasl.timeout","text":""},{"location":"network/TransportConf/#iomaxretries","title":"io.maxRetries","text":""},{"location":"network/TransportConf/#ioretrywait","title":"io.retryWait <p>Time that we will wait in order to perform a retry after an <code>IOException</code>. Only relevant if maxIORetries is greater than 0.</p> <p>Default: <code>5s</code></p>","text":""},{"location":"network/TransportConf/#iolazyfd","title":"io.lazyFD","text":""},{"location":"network/TransportConf/#ioenableverbosemetrics","title":"io.enableVerboseMetrics <p>Enables Netty's memory detailed metrics</p> <p>Default: <code>false</code></p>","text":""},{"location":"network/TransportConf/#ioenabletcpkeepalive","title":"io.enableTcpKeepAlive","text":""},{"location":"network/TransportConf/#preferdirectbufsforsharedbytebufallocators","title":"preferDirectBufsForSharedByteBufAllocators <p>The value of spark.network.io.preferDirectBufs.</p>","text":""},{"location":"network/TransportConf/#sharedbytebufallocators","title":"sharedByteBufAllocators <p>The value of spark.network.sharedByteBufAllocators.enabled.</p>","text":""},{"location":"network/TransportContext/","title":"TransportContext","text":""},{"location":"network/TransportContext/#creating-instance","title":"Creating Instance","text":"<p><code>TransportContext</code> takes the following to be created:</p> <ul> <li> TransportConf <li> <code>RpcHandler</code> <li> <code>closeIdleConnections</code> flag <li> <code>isClientOnly</code> flag <p><code>TransportContext</code> is created\u00a0when:</p> <ul> <li><code>ExternalBlockStoreClient</code> is requested to init</li> <li><code>ExternalShuffleService</code> is requested to start</li> <li><code>NettyBlockTransferService</code> is requested to init</li> <li><code>NettyRpcEnv</code> is created and requested to downloadClient</li> <li><code>YarnShuffleService</code> (Spark on YARN) is requested to <code>serviceInit</code></li> </ul>"},{"location":"network/TransportContext/#creating-server","title":"Creating Server <pre><code>TransportServer createServer(\n  int port,\n  List&lt;TransportServerBootstrap&gt; bootstraps)\nTransportServer createServer(\n  String host,\n  int port,\n  List&lt;TransportServerBootstrap&gt; bootstraps)\n</code></pre> <p><code>createServer</code> creates a <code>TransportServer</code> (with the RpcHandler and the input arguments).</p> <p><code>createServer</code>\u00a0is used when:</p> <ul> <li><code>YarnShuffleService</code> (Spark on YARN) is requested to <code>serviceInit</code></li> <li><code>ExternalShuffleService</code> is requested to start</li> <li><code>NettyBlockTransferService</code> is requested to createServer</li> <li><code>NettyRpcEnv</code> is requested to startServer</li> </ul>","text":""},{"location":"network/TransportContext/#creating-transportclientfactory","title":"Creating TransportClientFactory <pre><code>TransportClientFactory createClientFactory() // (1)\nTransportClientFactory createClientFactory(\n  List&lt;TransportClientBootstrap&gt; bootstraps)\n</code></pre> <ol> <li>Uses empty <code>bootstraps</code></li> </ol> <p><code>createClientFactory</code> creates a TransportClientFactory (with itself and the given <code>TransportClientBootstrap</code>s).</p> <p><code>createClientFactory</code>\u00a0is used when:</p> <ul> <li><code>ExternalBlockStoreClient</code> is requested to init</li> <li><code>NettyBlockTransferService</code> is requested to init</li> <li><code>NettyRpcEnv</code> is created and requested to downloadClient</li> </ul>","text":""},{"location":"plugins/","title":"Plugin Framework","text":"<p>Plugin Framework is an API for registering custom extensions (plugins) to be executed on the driver and executors.</p> <p>Plugin Framework uses separate PluginContainers for the driver and executors, and spark.plugins configuration property for SparkPlugins to be registered.</p> <p>Plugin Framework was introduced in Spark 2.4.4 (with an API for executors) with further changes in Spark 3.0.0 (to cover the driver).</p>"},{"location":"plugins/#resources","title":"Resources","text":"<ul> <li>Advanced Instrumentation in the official documentation of Apache Spark</li> <li>Commit for SPARK-29397</li> <li>Spark Plugin Framework in 3.0 - Part 1: Introduction by Madhukara Phatak</li> <li>Spark Memory Monitor by squito</li> <li>SparkPlugins by Luca Canali (CERN)</li> </ul>"},{"location":"plugins/DriverPlugin/","title":"DriverPlugin","text":"<p><code>DriverPlugin</code> is...FIXME</p>"},{"location":"plugins/DriverPluginContainer/","title":"DriverPluginContainer","text":"<p><code>DriverPluginContainer</code> is a PluginContainer.</p>"},{"location":"plugins/DriverPluginContainer/#creating-instance","title":"Creating Instance","text":"<p><code>DriverPluginContainer</code> takes the following to be created:</p> <ul> <li> SparkContext <li> Resources (<code>Map[String, ResourceInformation]</code>) <li> SparkPlugins <p><code>DriverPluginContainer</code> is created\u00a0when:</p> <ul> <li><code>PluginContainer</code> utility is used for a PluginContainer (at SparkContext startup)</li> </ul>"},{"location":"plugins/DriverPluginContainer/#registering-metrics","title":"Registering Metrics <pre><code>registerMetrics(\n  appId: String): Unit\n</code></pre> <p><code>registerMetrics</code>\u00a0is part of the PluginContainer abstraction.</p> <p>For every driver plugin, <code>registerMetrics</code> requests it to register metrics and the associated PluginContextImpl for the same.</p>","text":""},{"location":"plugins/DriverPluginContainer/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.internal.plugin.DriverPluginContainer</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.internal.plugin.DriverPluginContainer=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"plugins/ExecutorPlugin/","title":"ExecutorPlugin","text":"<p><code>ExecutorPlugin</code> is...FIXME</p>"},{"location":"plugins/ExecutorPluginContainer/","title":"ExecutorPluginContainer","text":"<p><code>ExecutorPluginContainer</code> is a PluginContainer for Executors.</p>"},{"location":"plugins/ExecutorPluginContainer/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorPluginContainer</code> takes the following to be created:</p> <ul> <li> SparkEnv <li> Resources (<code>Map[String, ResourceInformation]</code>) <li> SparkPlugins <p><code>ExecutorPluginContainer</code> is created when:</p> <ul> <li><code>PluginContainer</code> utility is used to create a PluginContainer (for Executors)</li> </ul>"},{"location":"plugins/ExecutorPluginContainer/#executorplugins","title":"ExecutorPlugins <p><code>ExecutorPluginContainer</code> initializes <code>executorPlugins</code> internal registry of ExecutorPlugins when created.</p>","text":""},{"location":"plugins/ExecutorPluginContainer/#initialization","title":"Initialization","text":"<p><code>executorPlugins</code> finds all the configuration properties with <code>spark.plugins.internal.conf.</code> prefix (in the SparkConf) for extra configuration of every ExecutorPlugin of the given SparkPlugins.</p> <p>For every <code>SparkPlugin</code> (in the given SparkPlugins) that defines an ExecutorPlugin, <code>executorPlugins</code> creates a PluginContextImpl, requests the <code>ExecutorPlugin</code> to init (with the <code>PluginContextImpl</code> and the extra configuration) and the <code>PluginContextImpl</code> to registerMetrics.</p> <p>In the end, <code>executorPlugins</code> prints out the following INFO message to the logs (for every <code>ExecutorPlugin</code>):</p> <pre><code>Initialized executor component for plugin [name].\n</code></pre>"},{"location":"plugins/ExecutorPluginContainer/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.internal.plugin.ExecutorPluginContainer</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.internal.plugin.ExecutorPluginContainer=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"plugins/PluginContainer/","title":"PluginContainer","text":"<p><code>PluginContainer</code> is an abstraction of plugin containers that can register metrics (for the driver and executors).</p> <p><code>PluginContainer</code> is created for the driver and executors using apply utility.</p>"},{"location":"plugins/PluginContainer/#contract","title":"Contract","text":""},{"location":"plugins/PluginContainer/#listening-to-task-failures","title":"Listening to Task Failures <pre><code>onTaskFailed(\n  failureReason: TaskFailedReason): Unit\n</code></pre> <p>For ExecutorPluginContainer only</p> <p>Possible <code>TaskFailedReason</code>s:</p> <ul> <li><code>TaskKilledException</code></li> <li><code>TaskKilled</code></li> <li><code>FetchFailed</code></li> <li><code>TaskCommitDenied</code></li> <li><code>ExceptionFailure</code></li> </ul> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run (and the task has failed)</li> </ul>","text":""},{"location":"plugins/PluginContainer/#listening-to-task-start","title":"Listening to Task Start <pre><code>onTaskStart(): Unit\n</code></pre> <p>For ExecutorPluginContainer only</p> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run (and the task has just started)</li> </ul>","text":""},{"location":"plugins/PluginContainer/#listening-to-task-success","title":"Listening to Task Success <pre><code>onTaskSucceeded(): Unit\n</code></pre> <p>For ExecutorPluginContainer only</p> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run (and the task has finished successfully)</li> </ul>","text":""},{"location":"plugins/PluginContainer/#registering-metrics","title":"Registering Metrics <pre><code>registerMetrics(\n  appId: String): Unit\n</code></pre> <p>Registers metrics for the application ID</p> <p>For DriverPluginContainer only</p> <p>Used when:</p> <ul> <li>SparkContext is created</li> </ul>","text":""},{"location":"plugins/PluginContainer/#shutdown","title":"Shutdown <pre><code>shutdown(): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested to stop</li> <li><code>Executor</code> is requested to stop</li> </ul>","text":""},{"location":"plugins/PluginContainer/#implementations","title":"Implementations","text":"Sealed Abstract Class <p><code>PluginContainer</code> is a Scala sealed abstract class which means that all of the implementations are in the same compilation unit (a single file).</p> <ul> <li>DriverPluginContainer</li> <li>ExecutorPluginContainer</li> </ul>"},{"location":"plugins/PluginContainer/#creating-plugincontainer","title":"Creating PluginContainer <pre><code>// the driver\napply(\n  sc: SparkContext,\n  resources: java.util.Map[String, ResourceInformation]): Option[PluginContainer]\n// executors\napply(\n  env: SparkEnv,\n  resources: java.util.Map[String, ResourceInformation]): Option[PluginContainer]\n// private helper\napply(\n  ctx: Either[SparkContext, SparkEnv],\n  resources: java.util.Map[String, ResourceInformation]): Option[PluginContainer]\n</code></pre> <p><code>apply</code> creates a <code>PluginContainer</code> for the driver or executors (based on the type of the first input argument, i.e. SparkContext or SparkEnv, respectively).</p> <p><code>apply</code> first loads the SparkPlugins defined by spark.plugins configuration property.</p> <p>Only when there was at least one plugin loaded, <code>apply</code> creates a DriverPluginContainer or ExecutorPluginContainer.</p> <p><code>apply</code> is used when:</p> <ul> <li><code>SparkContext</code> is created</li> <li><code>Executor</code> is created</li> </ul>","text":""},{"location":"plugins/PluginContextImpl/","title":"PluginContextImpl","text":"<p><code>PluginContextImpl</code> is...FIXME</p>"},{"location":"plugins/SparkPlugin/","title":"SparkPlugin","text":"<p><code>SparkPlugin</code> is an abstraction of custom extensions for Spark applications.</p>","tags":["DeveloperApi"]},{"location":"plugins/SparkPlugin/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"plugins/SparkPlugin/#driver-side-component","title":"Driver-side Component <pre><code>DriverPlugin driverPlugin()\n</code></pre> <p>Used when:</p> <ul> <li><code>DriverPluginContainer</code> is created</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"plugins/SparkPlugin/#executor-side-component","title":"Executor-side Component <pre><code>ExecutorPlugin executorPlugin()\n</code></pre> <p>Used when:</p> <ul> <li><code>ExecutorPluginContainer</code> is created</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/","title":"Resilient Distributed Dataset (RDD)","text":"<p>Resilient Distributed Dataset (aka RDD) is the primary data abstraction in Apache Spark and the core of Spark (that I often refer to as \"Spark Core\").</p> <p>.The origins of RDD</p> <p>The original paper that gave birth to the concept of RDD is https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf[Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing] by Matei Zaharia, et al.</p> <p>An RDD is a description of a fault-tolerant and resilient computation over a distributed collection of records (spread over &lt;&gt;). <p>NOTE: One could compare RDDs to collections in Scala, i.e. a RDD is computed on many JVMs while a Scala collection lives on a single JVM.</p> <p>Using RDD Spark hides data partitioning and so distribution that in turn allowed them to design parallel computational framework with a higher-level programming interface (API) for four mainstream programming languages.</p> <p>The features of RDDs (decomposing the name):</p> <ul> <li>Resilient, i.e. fault-tolerant with the help of &lt;&gt; and so able to recompute missing or damaged partitions due to node failures. <li>Distributed with data residing on multiple nodes in a spark-cluster.md[cluster].</li> <li>Dataset is a collection of spark-rdd-partitions.md[partitioned data] with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).</li> <p>.RDDs image::spark-rdds.png[align=\"center\"]</p> <p>From the scaladoc of http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD[org.apache.spark.rdd.RDD]:</p> <p>A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.</p> <p>From the original paper about RDD - https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf[Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing]:</p> <p>Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.</p> <p>Beside the above traits (that are directly embedded in the name of the data abstraction - RDD) it has the following additional traits:</p> <ul> <li>In-Memory, i.e. data inside RDD is stored in memory as much (size) and long (time) as possible.</li> <li>Immutable or Read-Only, i.e. it does not change once created and can only be transformed using transformations to new RDDs.</li> <li>Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action is executed that triggers the execution.</li> <li>Cacheable, i.e. you can hold all the data in a persistent \"storage\" like memory (default and the most preferred) or disk (the least preferred due to access speed).</li> <li>Parallel, i.e. process data in parallel.</li> <li>Typed -- RDD records have types, e.g. <code>Long</code> in <code>RDD[Long]</code> or <code>(Int, String)</code> in <code>RDD[(Int, String)]</code>.</li> <li>Partitioned -- records are partitioned (split into logical partitions) and distributed across nodes in a cluster.</li> <li>Location-Stickiness -- <code>RDD</code> can define &lt;&gt; to compute partitions (as close to the records as possible). <p>NOTE: Preferred location (aka locality preferences or placement preferences or locality info) is information about the locations of RDD records (that Spark's scheduler:DAGScheduler.md#preferred-locations[DAGScheduler] uses to place computing partitions on to have the tasks as close to the data as possible).</p> <p>Computing partitions in a RDD is a distributed process by design and to achieve even data distribution as well as leverage data locality (in distributed systems like HDFS or Apache Kafka in which data is partitioned by default), they are partitioned to a fixed number of spark-rdd-partitions.md[partitions] - logical chunks (parts) of data. The logical division is for processing only and internally it is not divided whatsoever. Each partition comprises of records.</p> <p></p> <p>spark-rdd-partitions.md[Partitions are the units of parallelism]. You can control the number of partitions of a RDD using spark-rdd-partitions.md#repartition[repartition] or spark-rdd-partitions.md#coalesce[coalesce] transformations. Spark tries to be as close to data as possible without wasting time to send data across network by means of RDD shuffling, and creates as many partitions as required to follow the storage layout and thus optimize data access. It leads to a one-to-one mapping between (physical) data in distributed data storage, e.g. HDFS or Cassandra, and partitions.</p> <p>RDDs support two kinds of operations:</p> <ul> <li>&lt;&gt; - lazy operations that return another RDD. <li>&lt;&gt; - operations that trigger computation and return values. <p>The motivation to create RDD were (https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf[after the authors]) two types of applications that current computing frameworks handle inefficiently:</p> <ul> <li>iterative algorithms in machine learning and graph computations.</li> <li>interactive data mining tools as ad-hoc queries on the same dataset.</li> </ul> <p>The goal is to reuse intermediate in-memory results across multiple data-intensive workloads with no need for copying large amounts of data over the network.</p> <p>Technically, RDDs follow the &lt;&gt; defined by the five main intrinsic properties: <ul> <li> <p>[[dependencies]] Parent RDDs (aka rdd:RDD.md#dependencies[RDD dependencies])</p> </li> <li> <p>An array of spark-rdd-partitions.md[partitions] that a dataset is divided to.</p> </li> <li> <p>A rdd:RDD.md#compute[compute] function to do a computation on partitions.</p> </li> <li> <p>An optional rdd:Partitioner.md[Partitioner] that defines how keys are hashed, and the pairs partitioned (for key-value RDDs)</p> </li> <li> <p>Optional &lt;&gt; (aka locality info), i.e. hosts for a partition where the records live or are the closest to read from. <p>This RDD abstraction supports an expressive set of operations without having to modify scheduler for each one.</p> <p>[[context]] An RDD is a named (by <code>name</code>) and uniquely identified (by <code>id</code>) entity in a SparkContext.md[] (available as <code>context</code> property).</p> <p>RDDs live in one and only one SparkContext.md[] that creates a logical boundary.</p> <p>NOTE: RDDs cannot be shared between <code>SparkContexts</code> (see SparkContext.md#sparkcontext-and-rdd[SparkContext and RDDs]).</p> <p>An RDD can optionally have a friendly name accessible using <code>name</code> that can be changed using <code>=</code>:</p> <pre><code>scala&gt; val ns = sc.parallelize(0 to 10)\nns: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at &lt;console&gt;:24\n\nscala&gt; ns.id\nres0: Int = 2\n\nscala&gt; ns.name\nres1: String = null\n\nscala&gt; ns.name = \"Friendly name\"\nns.name: String = Friendly name\n\nscala&gt; ns.name\nres2: String = Friendly name\n\nscala&gt; ns.toDebugString\nres3: String = (8) Friendly name ParallelCollectionRDD[2] at parallelize at &lt;console&gt;:24 []\n</code></pre> <p>RDDs are a container of instructions on how to materialize big (arrays of) distributed data, and how to split it into partitions so Spark (using executor:Executor.md[executors]) can hold some of them.</p> <p>In general data distribution can help executing processing in parallel so a task processes a chunk of data that it could eventually keep in memory.</p> <p>Spark does jobs in parallel, and RDDs are split into partitions to be processed and written in parallel. Inside a partition, data is processed sequentially.</p> <p>Saving partitions results in part-files instead of one single file (unless there is a single partition).</p> <p>== [[transformations]] Transformations</p> <p>A transformation is a lazy operation on a RDD that returns another RDD, e.g. <code>map</code>, <code>flatMap</code>, <code>filter</code>, <code>reduceByKey</code>, <code>join</code>, <code>cogroup</code>, etc.</p> <p>Find out more in rdd:spark-rdd-transformations.md[Transformations].</p> <p>== [[actions]] Actions</p> <p>An action is an operation that triggers execution of &lt;&gt; and returns a value (to a Spark driver - the user program). <p>TIP: Go in-depth in the section spark-rdd-actions.md[Actions].</p> <p>== [[creating-rdds]] Creating RDDs</p> <p>=== SparkContext.parallelize</p> <p>One way to create a RDD is with <code>SparkContext.parallelize</code> method. It accepts a collection of elements as shown below (<code>sc</code> is a SparkContext instance):</p> <pre><code>scala&gt; val rdd = sc.parallelize(1 to 1000)\nrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:25\n</code></pre> <p>You may also want to randomize the sample data:</p> <pre><code>scala&gt; val data = Seq.fill(10)(util.Random.nextInt)\ndata: Seq[Int] = List(-964985204, 1662791, -1820544313, -383666422, -111039198, 310967683, 1114081267, 1244509086, 1797452433, 124035586)\n\nscala&gt; val rdd = sc.parallelize(data)\nrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:29\n</code></pre> <p>Given the reason to use Spark to process more data than your own laptop could handle, <code>SparkContext.parallelize</code> is mainly used to learn Spark in the Spark shell. <code>SparkContext.parallelize</code> requires all the data to be available on a single machine - the Spark driver - that eventually hits the limits of your laptop.</p> <p>=== SparkContext.makeRDD</p> <p>CAUTION: FIXME What's the use case for <code>makeRDD</code>?</p> <pre><code>scala&gt; sc.makeRDD(0 to 1000)\nres0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at &lt;console&gt;:25\n</code></pre> <p>=== SparkContext.textFile</p> <p>One of the easiest ways to create an RDD is to use <code>SparkContext.textFile</code> to read files.</p> <p>You can use the local <code>README.md</code> file (and then <code>flatMap</code> over the lines inside to have an RDD of words):</p> <pre><code>scala&gt; val words = sc.textFile(\"README.md\").flatMap(_.split(\"\\\\W+\")).cache\nwords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[27] at flatMap at &lt;console&gt;:24\n</code></pre> <p>NOTE: You spark-rdd-caching.md[cache] it so the computation is not performed every time you work with <code>words</code>.</p> <p>== [[creating-rdds-from-input]] Creating RDDs from Input</p> <p>Refer to spark-io.md[Using Input and Output (I/O)] to learn about the IO API to create RDDs.</p> <p>=== Transformations</p> <p>RDD transformations by definition transform an RDD into another RDD and hence are the way to create new ones.</p> <p>Refer to &lt;&gt; section to learn more. <p>== RDDs in Web UI</p> <p>It is quite informative to look at RDDs in the Web UI that is at http://localhost:4040 for spark-shell.md[Spark shell].</p> <p>Execute the following Spark application (type all the lines in <code>spark-shell</code>):</p>"},{"location":"rdd/#sourcescala","title":"[source,scala]","text":"<p>val ints = sc.parallelize(1 to 100) // &lt;1&gt; ints.setName(\"Hundred ints\")        // &lt;2&gt; ints.cache                          // &lt;3&gt; ints.count                          // &lt;4&gt;</p> <p>&lt;1&gt; Creates an RDD with hundred of numbers (with as many partitions as possible) &lt;2&gt; Sets the name of the RDD &lt;3&gt; Caches the RDD for performance reasons that also makes it visible in Storage tab in the web UI &lt;4&gt; Executes action (and materializes the RDD)</p> <p>With the above executed, you should see the following in the Web UI:</p> <p>.RDD with custom name image::spark-ui-rdd-name.png[align=\"center\"]</p> <p>Click the name of the RDD (under RDD Name) and you will get the details of how the RDD is cached.</p> <p>.RDD Storage Info image::spark-ui-storage-hundred-ints.png[align=\"center\"]</p> <p>Execute the following Spark job and you will see how the number of partitions decreases.</p> <pre><code>ints.repartition(2).count\n</code></pre> <p>.Number of tasks after <code>repartition</code> image::spark-ui-repartition-2.png[align=\"center\"]</p>"},{"location":"rdd/Aggregator/","title":"Aggregator","text":"<p><code>Aggregator</code> is a set of &lt;&gt; used to aggregate data using rdd:PairRDDFunctions.md#combineByKeyWithClassTag[PairRDDFunctions.combineByKeyWithClassTag] transformation. <p><code>Aggregator[K, V, C]</code> is a parameterized type of <code>K</code> keys, <code>V</code> values, and <code>C</code> combiner (partial) values.</p> <p>[[creating-instance]][[aggregation-functions]] Aggregator transforms an <code>RDD[(K, V)]</code> into an <code>RDD[(K, C)]</code> (for a \"combined type\" C) using the functions:</p> <ul> <li>[[createCombiner]] <code>createCombiner: V =&gt; C</code></li> <li>[[mergeValue]] <code>mergeValue: (C, V) =&gt; C</code></li> <li>[[mergeCombiners]] <code>mergeCombiners: (C, C) =&gt; C</code></li> </ul> <p>Aggregator is used to create a ShuffleDependency and ExternalSorter.</p> <p>== [[combineValuesByKey]] combineValuesByKey Method</p>"},{"location":"rdd/Aggregator/#source-scala","title":"[source, scala]","text":"<p>combineValuesByKey(   iter: Iterator[_ &lt;: Product2[K, V]],   context: TaskContext): Iterator[(K, C)]</p> <p>combineValuesByKey creates a new shuffle:ExternalAppendOnlyMap.md[ExternalAppendOnlyMap] (with the &lt;&gt;). <p>combineValuesByKey requests the ExternalAppendOnlyMap to shuffle:ExternalAppendOnlyMap.md#insertAll[insert all key-value pairs] from the given iterator (that is the values of a partition).</p> <p>combineValuesByKey &lt;&gt;. <p>In the end, combineValuesByKey requests the ExternalAppendOnlyMap for an shuffle:ExternalAppendOnlyMap.md#iterator[iterator of \"combined\" pairs].</p> <p>combineValuesByKey is used when:</p> <ul> <li> <p>rdd:PairRDDFunctions.md#combineByKeyWithClassTag[PairRDDFunctions.combineByKeyWithClassTag] transformation is used (with the same Partitioner as the RDD's)</p> </li> <li> <p>BlockStoreShuffleReader is requested to shuffle:BlockStoreShuffleReader.md#read[read combined records for a reduce task] (with the Map-Size Partial Aggregation Flag off)</p> </li> </ul> <p>== [[combineCombinersByKey]] combineCombinersByKey Method</p>"},{"location":"rdd/Aggregator/#source-scala_1","title":"[source, scala]","text":"<p>combineCombinersByKey(   iter: Iterator[_ &lt;: Product2[K, C]],   context: TaskContext): Iterator[(K, C)]</p> <p>combineCombinersByKey...FIXME</p> <p>combineCombinersByKey is used when BlockStoreShuffleReader is requested to shuffle:BlockStoreShuffleReader.md#read[read combined records for a reduce task] (with the Map-Size Partial Aggregation Flag on).</p> <p>== [[updateMetrics]] Updating Task Metrics</p>"},{"location":"rdd/Aggregator/#source-scala_2","title":"[source, scala]","text":"<p>updateMetrics(   context: TaskContext,   map: ExternalAppendOnlyMap[_, _, _]): Unit</p> <p>updateMetrics requests the input TaskContext for the TaskMetrics to update the metrics based on the metrics of the input ExternalAppendOnlyMap:</p> <ul> <li> <p>executor:TaskMetrics.md#incMemoryBytesSpilled[Increment memory bytes spilled]</p> </li> <li> <p>executor:TaskMetrics.md#incDiskBytesSpilled[Increment disk bytes spilled]</p> </li> <li> <p>executor:TaskMetrics.md#incPeakExecutionMemory[Increment peak execution memory]</p> </li> </ul> <p>updateMetrics is used when Aggregator is requested to &lt;&gt; and &lt;&gt;."},{"location":"rdd/AsyncRDDActions/","title":"AsyncRDDActions","text":"<p><code>AsyncRDDActions</code> is...FIXME</p>"},{"location":"rdd/CheckpointRDD/","title":"CheckpointRDD","text":"<p><code>CheckpointRDD</code> is an extension of the RDD abstraction for RDDs that recovers checkpointed data from storage.</p> <p><code>CheckpointRDD</code> cannot be checkpointed again (and doCheckpoint, checkpoint, and localCheckpoint are simply noops).</p> <p>getPartitions and compute throw an <code>NotImplementedError</code> and are supposed to be overriden by the implementations.</p>"},{"location":"rdd/CheckpointRDD/#implementations","title":"Implementations","text":"<ul> <li>LocalCheckpointRDD</li> <li>ReliableCheckpointRDD</li> </ul>"},{"location":"rdd/CoGroupedRDD/","title":"CoGroupedRDD","text":"<p><code>CoGroupedRDD[K]</code> is an RDD that cogroups the parent RDDs.</p> <pre><code>RDD[(K, Array[Iterable[_]])]\n</code></pre> <p>For each key <code>k</code> in parent RDDs, the resulting RDD contains a tuple with the list of values for that key.</p>"},{"location":"rdd/CoGroupedRDD/#creating-instance","title":"Creating Instance","text":"<p><code>CoGroupedRDD</code> takes the following to be created:</p> <ul> <li> Key-Value RDDs (<code>Seq[RDD[_ &lt;: Product2[K, _]]]</code>) <li> Partitioner <p><code>CoGroupedRDD</code> is created\u00a0when:</p> <ul> <li>RDD.cogroup operator is used</li> </ul>"},{"location":"rdd/CoalescedRDD/","title":"CoalescedRDD","text":"<p><code>CoalescedRDD</code> is...FIXME</p>"},{"location":"rdd/Dependency/","title":"Dependency","text":"<p><code>Dependency[T]</code> is an abstraction of dependencies between <code>RDD</code>s.</p> <p>Any time an RDD transformation (e.g. <code>map</code>, <code>flatMap</code>) is used (and RDD lineage graph is built), <code>Dependency</code>ies are the edges.</p>","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#rdd","title":"RDD <pre><code>rdd: RDD[T]\n</code></pre> <p>Used when:</p> <ul> <li><code>DAGScheduler</code> is requested for the shuffle dependencies and ResourceProfiles (of an <code>RDD</code>)</li> <li><code>RDD</code> is requested to getNarrowAncestors, cleanShuffleDependencies, firstParent, parent, toDebugString, getOutputDeterministicLevel</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#implementations","title":"Implementations","text":"<ul> <li>NarrowDependency</li> <li>ShuffleDependency</li> </ul>","tags":["DeveloperApi"]},{"location":"rdd/Dependency/#demo","title":"Demo","text":"<p>The dependencies of an <code>RDD</code> are available using <code>RDD.dependencies</code> method.</p> <pre><code>val myRdd = sc.parallelize(0 to 9).groupBy(_ % 2)\n</code></pre> <pre><code>scala&gt; myRdd.dependencies.foreach(println)\norg.apache.spark.ShuffleDependency@41e38d89\n</code></pre> <pre><code>scala&gt; myRdd.dependencies.map(_.rdd).foreach(println)\nMapPartitionsRDD[6] at groupBy at &lt;console&gt;:39\n</code></pre> <p>RDD.toDebugString is used to print out the RDD lineage in a developer-friendly way.</p> <pre><code>scala&gt; println(myRdd.toDebugString)\n(16) ShuffledRDD[7] at groupBy at &lt;console&gt;:39 []\n +-(16) MapPartitionsRDD[6] at groupBy at &lt;console&gt;:39 []\n    |   ParallelCollectionRDD[5] at parallelize at &lt;console&gt;:39 []\n</code></pre>","tags":["DeveloperApi"]},{"location":"rdd/HadoopRDD/","title":"HadoopRDD","text":"<p>https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.HadoopRDD[HadoopRDD] is an RDD that provides core functionality for reading data stored in HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI using the older MapReduce API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/package-summary.html[org.apache.hadoop.mapred]).</p> <p>HadoopRDD is created as a result of calling the following methods in SparkContext.md[]:</p> <ul> <li><code>hadoopFile</code></li> <li><code>textFile</code> (the most often used in examples!)</li> <li><code>sequenceFile</code></li> </ul> <p>Partitions are of type <code>HadoopPartition</code>.</p> <p>When an HadoopRDD is computed, i.e. an action is called, you should see the INFO message <code>Input split:</code> in the logs.</p> <pre><code>scala&gt; sc.textFile(\"README.md\").count\n...\n15/10/10 18:03:21 INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/README.md:0+1784\n15/10/10 18:03:21 INFO HadoopRDD: Input split: file:/Users/jacek/dev/oss/spark/README.md:1784+1784\n...\n</code></pre> <p>The following properties are set upon partition execution:</p> <ul> <li>mapred.tip.id - task id of this task's attempt</li> <li>mapred.task.id - task attempt's id</li> <li>mapred.task.is.map as <code>true</code></li> <li>mapred.task.partition - split id</li> <li>mapred.job.id</li> </ul> <p>Spark settings for <code>HadoopRDD</code>:</p> <ul> <li>spark.hadoop.cloneConf (default: <code>false</code>) - shouldCloneJobConf - should a Hadoop job configuration <code>JobConf</code> object be cloned before spawning a Hadoop job. Refer to https://issues.apache.org/jira/browse/SPARK-2546[[SPARK-2546] Configuration object thread safety issue]. When <code>true</code>, you should see a DEBUG message <code>Cloning Hadoop Configuration</code>.</li> </ul> <p>You can register callbacks on TaskContext.</p> <p>HadoopRDDs are not checkpointed. They do nothing when <code>checkpoint()</code> is called.</p>"},{"location":"rdd/HadoopRDD/#caution","title":"[CAUTION]","text":"<p>FIXME</p> <ul> <li>What are <code>InputMetrics</code>?</li> <li>What is <code>JobConf</code>?</li> <li>What are the InputSplits: <code>FileSplit</code> and <code>CombineFileSplit</code>? * What are <code>InputFormat</code> and <code>Configurable</code> subtypes?</li> <li>What's InputFormat's RecordReader? It creates a key and a value. What are they?</li> <li> </li> </ul> <p>=== [[getPreferredLocations]] <code>getPreferredLocations</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[getPartitions]] <code>getPartitions</code> Method</p> <p>The number of partition for HadoopRDD, i.e. the return value of <code>getPartitions</code>, is calculated using <code>InputFormat.getSplits(jobConf, minPartitions)</code> where <code>minPartitions</code> is only a hint of how many partitions one may want at minimum. As a hint it does not mean the number of partitions will be exactly the number given.</p> <p>For <code>SparkContext.textFile</code> the input format class is https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html[org.apache.hadoop.mapred.TextInputFormat].</p> <p>The https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html[javadoc of org.apache.hadoop.mapred.FileInputFormat] says:</p> <p>FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobConf, int). Subclasses of FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.</p> <p>TIP: You may find https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L319[the sources of org.apache.hadoop.mapred.FileInputFormat.getSplits] enlightening.</p>"},{"location":"rdd/HadoopRDD/#whats-hadoop-split-input-splits-for-hadoop-reads-see-inputformatgetsplits","title":"What's Hadoop Split? input splits for Hadoop reads? See <code>InputFormat.getSplits</code>","text":""},{"location":"rdd/HashPartitioner/","title":"HashPartitioner","text":"<p><code>HashPartitioner</code> is a Partitioner for hash-based partitioning.</p> <p>Important</p> <p><code>HashPartitioner</code> places null keys in 0<sup>th</sup> partition.</p> <p><code>HashPartitioner</code> is used as the default Partitioner.</p>"},{"location":"rdd/HashPartitioner/#creating-instance","title":"Creating Instance","text":"<p><code>HashPartitioner</code> takes the following to be created:</p> <ul> <li> Number of partitions"},{"location":"rdd/HashPartitioner/#number-of-partitions","title":"Number of Partitions <pre><code>numPartitions: Int\n</code></pre> <p><code>numPartitions</code> returns the given number of partitions.</p> <p><code>numPartitions</code>\u00a0is part of the Partitioner abstraction.</p>","text":""},{"location":"rdd/HashPartitioner/#partition-for-key","title":"Partition for Key <pre><code>getPartition(\n  key: Any): Int\n</code></pre> <p>For <code>null</code> keys <code>getPartition</code> simply returns <code>0</code>.</p> <p>For non-<code>null</code> keys, <code>getPartition</code> uses the Object.hashCode of the key  modulo the number of partitions. For negative results, <code>getPartition</code> adds the number of partitions to make it non-negative.</p> <p><code>getPartition</code>\u00a0is part of the Partitioner abstraction.</p>","text":""},{"location":"rdd/LocalCheckpointRDD/","title":"LocalCheckpointRDD","text":"<p><code>LocalCheckpointRDD[T]</code> is a CheckpointRDD.</p>"},{"location":"rdd/LocalCheckpointRDD/#creating-instance","title":"Creating Instance","text":"<p><code>LocalCheckpointRDD</code> takes the following to be created:</p> <ul> <li> RDD <li> SparkContext <li> RDD ID <li> Number of Partitions <p><code>LocalCheckpointRDD</code> is created\u00a0when:</p> <ul> <li><code>LocalRDDCheckpointData</code> is requested to doCheckpoint</li> </ul>"},{"location":"rdd/LocalCheckpointRDD/#partitions","title":"Partitions <pre><code>getPartitions: Array[Partition]\n</code></pre> <p><code>getPartitions</code>\u00a0is part of the RDD abstraction.</p> <p><code>getPartitions</code> creates a <code>CheckpointRDDPartition</code> for every input partition (index).</p>","text":""},{"location":"rdd/LocalCheckpointRDD/#computing-partition","title":"Computing Partition <pre><code>compute(\n  partition: Partition,\n  context: TaskContext): Iterator[T]\n</code></pre> <p><code>compute</code>\u00a0is part of the RDD abstraction.</p> <p><code>compute</code> merely throws an <code>SparkException</code> (that explains the reason):</p> <pre><code>Checkpoint block [RDDBlockId] not found! Either the executor\nthat originally checkpointed this partition is no longer alive, or the original RDD is\nunpersisted. If this problem persists, you may consider using `rdd.checkpoint()`\ninstead, which is slower than local checkpointing but more fault-tolerant.\"\n</code></pre>","text":""},{"location":"rdd/LocalRDDCheckpointData/","title":"LocalRDDCheckpointData","text":"<p><code>LocalRDDCheckpointData</code> is a RDDCheckpointData.</p>"},{"location":"rdd/LocalRDDCheckpointData/#creating-instance","title":"Creating Instance","text":"<p><code>LocalRDDCheckpointData</code> takes the following to be created:</p> <ul> <li> RDD <p><code>LocalRDDCheckpointData</code> is created\u00a0when:</p> <ul> <li><code>RDD</code> is requested to localCheckpoint</li> </ul>"},{"location":"rdd/LocalRDDCheckpointData/#docheckpoint","title":"doCheckpoint <pre><code>doCheckpoint(): CheckpointRDD[T]\n</code></pre> <p><code>doCheckpoint</code>\u00a0is part of the RDDCheckpointData abstraction.</p> <p><code>doCheckpoint</code> creates a LocalCheckpointRDD with the RDD. <code>doCheckpoint</code> triggers caching any missing partitions (by checking availability of the RDDBlockIds for the partitions in the BlockManagerMaster).</p>  <p>Extra Spark Job</p> <p>If there are any missing partitions (<code>RDDBlockId</code>s) <code>doCheckpoint</code> requests the <code>SparkContext</code> to run a Spark job with the <code>RDD</code> and the missing partitions.</p>  <p><code>doCheckpoint</code>makes sure that the StorageLevel of the <code>RDD</code> uses disk (among other persistence storages). If not, <code>doCheckpoint</code>\u00a0throws an <code>AssertionError</code>:</p> <pre><code>Storage level [level] is not appropriate for local checkpointing\n</code></pre>","text":""},{"location":"rdd/MapPartitionsRDD/","title":"MapPartitionsRDD","text":"<p><code>MapPartitionsRDD[U, T]</code> is a RDD that transforms (maps) input <code>T</code> records into <code>U</code>s using partition function.</p> <p><code>MapPartitionsRDD</code> is a RDD that has exactly one-to-one narrow dependency on the parent RDD.</p>"},{"location":"rdd/MapPartitionsRDD/#creating-instance","title":"Creating Instance","text":"<p><code>MapPartitionsRDD</code> takes the following to be created:</p> <ul> <li> Parent RDD (<code>RDD[T]</code>) <li> Partition Function <li> <code>preservesPartitioning</code> flag <li>isFromBarrier Flag</li> <li> <code>isOrderSensitive</code> flag <p><code>MapPartitionsRDD</code> is created when:</p> <ul> <li><code>PairRDDFunctions</code> is requested to mapValues and flatMapValues</li> <li><code>RDD</code> is requested to map, flatMap, filter, glom, mapPartitions, mapPartitionsWithIndexInternal, mapPartitionsInternal, mapPartitionsWithIndex</li> <li><code>RDDBarrier</code> is requested to mapPartitions, mapPartitionsWithIndex</li> </ul>"},{"location":"rdd/MapPartitionsRDD/#barrier-rdd","title":"Barrier RDD","text":"<p><code>MapPartitionsRDD</code> can be a barrier RDD in Barrier Execution Mode.</p>"},{"location":"rdd/MapPartitionsRDD/#isFromBarrier","title":"isFromBarrier Flag","text":"<p><code>MapPartitionsRDD</code> can be given <code>isFromBarrier</code> flag when created.</p> <p><code>isFromBarrier</code> flag is assumed disabled (<code>false</code>) and can only be enabled (<code>true</code>) using RDDBarrier transformations:</p> <ul> <li>RDDBarrier.mapPartitions</li> <li>RDDBarrier.mapPartitionsWithIndex</li> </ul>"},{"location":"rdd/MapPartitionsRDD/#isBarrier_","title":"isBarrier_","text":"RDD <pre><code>isBarrier_ : Boolean\n</code></pre> <p><code>isBarrier_</code> is part of the RDD abstraction.</p> <p><code>isBarrier_</code> is enabled (<code>true</code>) when either this <code>MapPartitionsRDD</code> is isFromBarrier or any of the parent RDDs is isBarrier. Otherwise, <code>isBarrier_</code> is disabled (<code>false</code>).</p>"},{"location":"rdd/NarrowDependency/","title":"NarrowDependency","text":"<p><code>NarrowDependency[T]</code> is an extension of the Dependency abstraction for narrow dependencies (of RDD[T]s) where each partition of the child RDD depends on a small number of partitions of the parent RDD.</p>","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#contract","title":"Contract","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#getparents","title":"getParents <pre><code>getParents(\n  partitionId: Int): Seq[Int]\n</code></pre> <p>The parent partitions for a given child partition</p> <p>Used when:</p> <ul> <li><code>DAGScheduler</code> is requested for the preferred locations (of a partition of an <code>RDD</code>)</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#implementations","title":"Implementations","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#onetoonedependency","title":"OneToOneDependency <p><code>OneToOneDependency</code> is a <code>NarrowDependency</code> with getParents returning a single-element collection with the given <code>partitionId</code>.</p> <pre><code>val myRdd = sc.parallelize(0 to 9).map((_, 1))\n\nscala&gt; :type myRdd\norg.apache.spark.rdd.RDD[(Int, Int)]\n\nscala&gt; myRdd.dependencies.foreach(println)\norg.apache.spark.OneToOneDependency@801fe56\n\nimport org.apache.spark.OneToOneDependency\nval dep = myRdd.dependencies.head.asInstanceOf[OneToOneDependency[(_, _)]]\n\nscala&gt; println(dep.getParents(0))\nList(0)\n\nscala&gt; println(dep.getParents(1))\nList(1)\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#prunedependency","title":"PruneDependency <p><code>PruneDependency</code> is a <code>NarrowDependency</code> that represents a dependency between the <code>PartitionPruningRDD</code> and the parent RDD (with a subset of partitions of the parents).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#rangedependency","title":"RangeDependency <p><code>RangeDependency</code> is a <code>NarrowDependency</code> that represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.</p> <p>Used in <code>UnionRDD</code> (<code>SparkContext.union</code>).</p> <pre><code>val r1 = sc.range(0, 4)\nval r2 = sc.range(5, 9)\n\nval unioned = sc.union(r1, r2)\n\nscala&gt; unioned.dependencies.foreach(println)\norg.apache.spark.RangeDependency@76b0e1d9\norg.apache.spark.RangeDependency@3f3e51e0\n\nimport org.apache.spark.RangeDependency\nval dep = unioned.dependencies.head.asInstanceOf[RangeDependency[(_, _)]]\n\nscala&gt; println(dep.getParents(0))\nList(0)\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"rdd/NarrowDependency/#creating-instance","title":"Creating Instance","text":"<p><code>NarrowDependency</code> takes the following to be created:</p> <ul> <li> RDD[T] <p>Abstract Class</p> <p><code>NarrowDependency</code> is an abstract class and cannot be created directly. It is created indirectly for the concrete NarrowDependencies.</p>","tags":["DeveloperApi"]},{"location":"rdd/NewHadoopRDD/","title":"NewHadoopRDD","text":"<p>== [[NewHadoopRDD]] NewHadoopRDD</p> <p><code>NewHadoopRDD</code> is an rdd:index.md[RDD] of <code>K</code> keys and <code>V</code> values.</p> <p>&lt;NewHadoopRDD is created&gt;&gt; when: <ul> <li><code>SparkContext.newAPIHadoopFile</code></li> <li><code>SparkContext.newAPIHadoopRDD</code></li> <li>(indirectly) <code>SparkContext.binaryFiles</code></li> <li>(indirectly) <code>SparkContext.wholeTextFiles</code></li> </ul> <p>NOTE: <code>NewHadoopRDD</code> is the base RDD of <code>BinaryFileRDD</code> and <code>WholeTextFileRDD</code>.</p> <p>=== [[getPreferredLocations]] <code>getPreferredLocations</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[creating-instance]] Creating NewHadoopRDD Instance</p> <p><code>NewHadoopRDD</code> takes the following when created:</p> <ul> <li>[[sc]] SparkContext.md[]</li> <li>[[inputFormatClass]] HDFS' <code>InputFormat[K, V]</code></li> <li>[[keyClass]] <code>K</code> class name</li> <li>[[valueClass]] <code>V</code> class name</li> <li>[[_conf]] transient HDFS' <code>Configuration</code></li> </ul> <p><code>NewHadoopRDD</code> initializes the &lt;&gt;."},{"location":"rdd/OrderedRDDFunctions/","title":"OrderedRDDFunctions","text":"<pre><code>class OrderedRDDFunctions[\n  K: Ordering : ClassTag,\n  V: ClassTag,\n  P &lt;: Product2[K, V] : ClassTag]\n</code></pre> <p><code>OrderedRDDFunctions</code> adds extra operators to RDDs of (key, value) pairs (<code>RDD[(K, V)]</code>) where the <code>K</code> key is sortable (i.e. any key type <code>K</code> that has an implicit <code>Ordering[K]</code> in scope).</p> <p>Tip</p> <p>Learn more about Ordering in the Scala Standard Library documentation.</p>"},{"location":"rdd/OrderedRDDFunctions/#creating-instance","title":"Creating Instance","text":"<p><code>OrderedRDDFunctions</code> takes the following to be created:</p> <ul> <li> RDD of <code>P</code>s <p><code>OrderedRDDFunctions</code> is created using RDD.rddToOrderedRDDFunctions implicit method.</p>"},{"location":"rdd/OrderedRDDFunctions/#filterbyrange","title":"filterByRange <pre><code>filterByRange(\n  lower: K,\n  upper: K): RDD[P]\n</code></pre> <p><code>filterByRange</code>...FIXME</p>","text":""},{"location":"rdd/OrderedRDDFunctions/#repartitionandsortwithinpartitions","title":"repartitionAndSortWithinPartitions <pre><code>repartitionAndSortWithinPartitions(\n  partitioner: Partitioner): RDD[(K, V)]\n</code></pre> <p><code>repartitionAndSortWithinPartitions</code> creates a ShuffledRDD with the given Partitioner.</p>  <p>Note</p> <p><code>repartitionAndSortWithinPartitions</code> is a generalization of sortByKey operator.</p>","text":""},{"location":"rdd/OrderedRDDFunctions/#sortbykey","title":"sortByKey <pre><code>sortByKey(\n  ascending: Boolean = true,\n  numPartitions: Int = self.partitions.length): RDD[(K, V)]\n</code></pre> <p><code>sortByKey</code> creates a ShuffledRDD (with the RDD and a RangePartitioner).</p>  <p>Note</p> <p><code>sortByKey</code> is a specialization of repartitionAndSortWithinPartitions operator.</p>  <p><code>sortByKey</code> is used when:</p> <ul> <li>RDD.sortBy high-level operator is used</li> </ul>","text":""},{"location":"rdd/PairRDDFunctions/","title":"PairRDDFunctions","text":"<p><code>PairRDDFunctions</code> is an extension of RDD API for additional high-level operators to work with key-value RDDs (<code>RDD[(K, V)]</code>).</p> <p><code>PairRDDFunctions</code> is available in RDDs of key-value pairs via Scala implicit conversion.</p> <p>The gist of <code>PairRDDFunctions</code> is combineByKeyWithClassTag.</p>"},{"location":"rdd/PairRDDFunctions/#aggregatebykey","title":"aggregateByKey <pre><code>aggregateByKey[U: ClassTag](\n  zeroValue: U)(\n  seqOp: (U, V) =&gt; U,\n  combOp: (U, U) =&gt; U): RDD[(K, U)] // (1)!\naggregateByKey[U: ClassTag](\n  zeroValue: U,\n  numPartitions: Int)(seqOp: (U, V) =&gt; U,\n  combOp: (U, U) =&gt; U): RDD[(K, U)] // (2)!\naggregateByKey[U: ClassTag](\n  zeroValue: U,\n  partitioner: Partitioner)(\n  seqOp: (U, V) =&gt; U,\n  combOp: (U, U) =&gt; U): RDD[(K, U)]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Creates a HashPartitioner with the given <code>numPartitions</code> partitions</li> </ol> <p><code>aggregateByKey</code>...FIXME</p>","text":""},{"location":"rdd/PairRDDFunctions/#combinebykey","title":"combineByKey <pre><code>combineByKey[C](\n  createCombiner: V =&gt; C,\n  mergeValue: (C, V) =&gt; C,\n  mergeCombiners: (C, C) =&gt; C): RDD[(K, C)]\ncombineByKey[C](\n  createCombiner: V =&gt; C,\n  mergeValue: (C, V) =&gt; C,\n  mergeCombiners: (C, C) =&gt; C,\n  numPartitions: Int): RDD[(K, C)]\ncombineByKey[C](\n  createCombiner: V =&gt; C,\n  mergeValue: (C, V) =&gt; C,\n  mergeCombiners: (C, C) =&gt; C,\n  partitioner: Partitioner,\n  mapSideCombine: Boolean = true,\n  serializer: Serializer = null): RDD[(K, C)]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Creates a HashPartitioner with the given <code>numPartitions</code> partitions</li> </ol> <p><code>combineByKey</code>...FIXME</p>","text":""},{"location":"rdd/PairRDDFunctions/#combinebykeywithclasstag","title":"combineByKeyWithClassTag <pre><code>combineByKeyWithClassTag[C](\n  createCombiner: V =&gt; C,\n  mergeValue: (C, V) =&gt; C,\n  mergeCombiners: (C, C) =&gt; C)(implicit ct: ClassTag[C]): RDD[(K, C)] // (1)!\ncombineByKeyWithClassTag[C](\n  createCombiner: V =&gt; C,\n  mergeValue: (C, V) =&gt; C,\n  mergeCombiners: (C, C) =&gt; C,\n  numPartitions: Int)(implicit ct: ClassTag[C]): RDD[(K, C)] // (2)!\ncombineByKeyWithClassTag[C](\n  createCombiner: V =&gt; C,\n  mergeValue: (C, V) =&gt; C,\n  mergeCombiners: (C, C) =&gt; C,\n  partitioner: Partitioner,\n  mapSideCombine: Boolean = true,\n  serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Uses a HashPartitioner (with the given <code>numPartitions</code>)</li> </ol> <p><code>combineByKeyWithClassTag</code> creates an Aggregator for the given aggregation functions.</p> <p><code>combineByKeyWithClassTag</code> branches off per the given Partitioner.</p> <p>If the input partitioner and the RDD's are the same, <code>combineByKeyWithClassTag</code> simply mapPartitions on the RDD with the following arguments:</p> <ul> <li> <p>Iterator of the Aggregator</p> </li> <li> <p><code>preservesPartitioning</code> flag turned on</p> </li> </ul> <p>If the input partitioner is different than the RDD's, <code>combineByKeyWithClassTag</code> creates a ShuffledRDD (with the <code>Serializer</code>, the <code>Aggregator</code>, and the <code>mapSideCombine</code> flag).</p>","text":""},{"location":"rdd/PairRDDFunctions/#usage","title":"Usage <p><code>combineByKeyWithClassTag</code> lays the foundation for the following high-level RDD key-value pair transformations:</p> <ul> <li>aggregateByKey</li> <li>combineByKey</li> <li>countApproxDistinctByKey</li> <li>foldByKey</li> <li>groupByKey</li> <li>reduceByKey</li> </ul>","text":""},{"location":"rdd/PairRDDFunctions/#requirements","title":"Requirements <p><code>combineByKeyWithClassTag</code> requires that the <code>mergeCombiners</code> is defined (not-<code>null</code>) or throws an <code>IllegalArgumentException</code>:</p> <pre><code>mergeCombiners must be defined\n</code></pre> <p><code>combineByKeyWithClassTag</code> throws a <code>SparkException</code> for the keys being of type array with the <code>mapSideCombine</code> flag enabled:</p> <pre><code>Cannot use map-side combining with array keys.\n</code></pre> <p><code>combineByKeyWithClassTag</code> throws a <code>SparkException</code> for the keys being of type <code>array</code> with the partitioner being a HashPartitioner:</p> <pre><code>HashPartitioner cannot partition array keys.\n</code></pre>","text":""},{"location":"rdd/PairRDDFunctions/#example","title":"Example <pre><code>val nums = sc.parallelize(0 to 9, numSlices = 4)\nval groups = nums.keyBy(_ % 2)\ndef createCombiner(n: Int) = {\n  println(s\"createCombiner($n)\")\n  n\n}\ndef mergeValue(n1: Int, n2: Int) = {\n  println(s\"mergeValue($n1, $n2)\")\n  n1 + n2\n}\ndef mergeCombiners(c1: Int, c2: Int) = {\n  println(s\"mergeCombiners($c1, $c2)\")\n  c1 + c2\n}\nval countByGroup = groups.combineByKeyWithClassTag(\n  createCombiner,\n  mergeValue,\n  mergeCombiners)\nprintln(countByGroup.toDebugString)\n/*\n(4) ShuffledRDD[3] at combineByKeyWithClassTag at &lt;console&gt;:31 []\n +-(4) MapPartitionsRDD[1] at keyBy at &lt;console&gt;:25 []\n    |  ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:24 []\n*/\n</code></pre>","text":""},{"location":"rdd/PairRDDFunctions/#countapproxdistinctbykey","title":"countApproxDistinctByKey <pre><code>countApproxDistinctByKey(\n  relativeSD: Double = 0.05): RDD[(K, Long)] // (1)!\ncountApproxDistinctByKey(\n  relativeSD: Double,\n  numPartitions: Int): RDD[(K, Long)] // (2)!\ncountApproxDistinctByKey(\n  relativeSD: Double,\n  partitioner: Partitioner): RDD[(K, Long)]\ncountApproxDistinctByKey(\n  p: Int,\n  sp: Int,\n  partitioner: Partitioner): RDD[(K, Long)]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Creates a HashPartitioner with the given <code>numPartitions</code> partitions</li> </ol> <p><code>countApproxDistinctByKey</code>...FIXME</p>","text":""},{"location":"rdd/PairRDDFunctions/#foldbykey","title":"foldByKey <pre><code>foldByKey(\n  zeroValue: V)(\n  func: (V, V) =&gt; V): RDD[(K, V)] // (1)!\nfoldByKey(\n  zeroValue: V,\n  numPartitions: Int)(\n  func: (V, V) =&gt; V): RDD[(K, V)] // (2)!\nfoldByKey(\n  zeroValue: V,\n  partitioner: Partitioner)(\n  func: (V, V) =&gt; V): RDD[(K, V)]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Creates a HashPartitioner with the given <code>numPartitions</code> partitions</li> </ol> <p><code>foldByKey</code>...FIXME</p>  <p><code>foldByKey</code> is used when:</p> <ul> <li>RDD.treeAggregate high-level operator is used</li> </ul>","text":""},{"location":"rdd/PairRDDFunctions/#groupbykey","title":"groupByKey <pre><code>groupByKey(): RDD[(K, Iterable[V])] // (1)!\ngroupByKey(\n  numPartitions: Int): RDD[(K, Iterable[V])] // (2)!\ngroupByKey(\n  partitioner: Partitioner): RDD[(K, Iterable[V])]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Creates a HashPartitioner with the given <code>numPartitions</code> partitions</li> </ol> <p><code>groupByKey</code>...FIXME</p>  <p><code>groupByKey</code> is used when:</p> <ul> <li>RDD.groupBy high-level operator is used</li> </ul>","text":""},{"location":"rdd/PairRDDFunctions/#partitionby","title":"partitionBy <pre><code>partitionBy(\n  partitioner: Partitioner): RDD[(K, V)]\n</code></pre> <p><code>partitionBy</code>...FIXME</p>","text":""},{"location":"rdd/PairRDDFunctions/#reducebykey","title":"reduceByKey <pre><code>reduceByKey(\n  func: (V, V) =&gt; V): RDD[(K, V)] // (1)!\nreduceByKey(\n  func: (V, V) =&gt; V,\n  numPartitions: Int): RDD[(K, V)] // (2)!\nreduceByKey(\n  partitioner: Partitioner,\n  func: (V, V) =&gt; V): RDD[(K, V)]\n</code></pre> <ol> <li>Uses the default Partitioner</li> <li>Creates a HashPartitioner with the given <code>numPartitions</code> partitions</li> </ol> <p><code>reduceByKey</code> is sort of a particular case of aggregateByKey.</p>  <p><code>reduceByKey</code> is used when:</p> <ul> <li>RDD.distinct high-level operator is used</li> </ul>","text":""},{"location":"rdd/PairRDDFunctions/#saveasnewapihadoopfile","title":"saveAsNewAPIHadoopFile <pre><code>saveAsNewAPIHadoopFile(\n  path: String,\n  keyClass: Class[_],\n  valueClass: Class[_],\n  outputFormatClass: Class[_ &lt;: NewOutputFormat[_, _]],\n  conf: Configuration = self.context.hadoopConfiguration): Unit\nsaveAsNewAPIHadoopFile[F &lt;: NewOutputFormat[K, V]](\n  path: String)(implicit fm: ClassTag[F]): Unit\n</code></pre> <p><code>saveAsNewAPIHadoopFile</code> creates a new <code>Job</code> (Hadoop MapReduce) for the given <code>Configuration</code> (Hadoop).</p> <p><code>saveAsNewAPIHadoopFile</code> configures the <code>Job</code> (with the given <code>keyClass</code>, <code>valueClass</code> and <code>outputFormatClass</code>).</p> <p><code>saveAsNewAPIHadoopFile</code> sets <code>mapreduce.output.fileoutputformat.outputdir</code> configuration property to be the given <code>path</code> and saveAsNewAPIHadoopDataset.</p>","text":""},{"location":"rdd/PairRDDFunctions/#saveasnewapihadoopdataset","title":"saveAsNewAPIHadoopDataset <pre><code>saveAsNewAPIHadoopDataset(\n  conf: Configuration): Unit\n</code></pre> <p><code>saveAsNewAPIHadoopDataset</code> creates a new HadoopMapReduceWriteConfigUtil (with the given <code>Configuration</code>) and writes the RDD out.</p> <p><code>Configuration</code> should have all the relevant output params set (an output format, output paths, e.g. a table name to write to) in the same way as it would be configured for a Hadoop MapReduce job.</p>","text":""},{"location":"rdd/ParallelCollectionRDD/","title":"ParallelCollectionRDD","text":"<p><code>ParallelCollectionRDD</code> is an RDD of a collection of elements with <code>numSlices</code> partitions and optional <code>locationPrefs</code>.</p> <p><code>ParallelCollectionRDD</code> is the result of <code>SparkContext.parallelize</code> and <code>SparkContext.makeRDD</code> methods.</p> <p>The data collection is split on to <code>numSlices</code> slices.</p> <p>It uses <code>ParallelCollectionPartition</code>.</p>"},{"location":"rdd/Partition/","title":"Partition","text":"<p><code>Partition</code> is a &lt;&gt; of a &lt;&gt; of a RDD. <p>NOTE: A partition is missing when it has not be computed yet.</p> <p>[[contract]] [[index]] <code>Partition</code> is identified by an partition index that is a unique identifier of a partition of a RDD.</p>"},{"location":"rdd/Partition/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/Partition/#index-int","title":"index: Int","text":""},{"location":"rdd/Partitioner/","title":"Partitioner","text":"<p><code>Partitioner</code> is an abstraction of partitioners that define how the elements in a key-value pair RDD are partitioned by key.</p> <p><code>Partitioner</code> maps keys to partition IDs (from 0 to numPartitions exclusive).</p> <p><code>Partitioner</code> ensures that records with the same key are in the same partition.</p> <p><code>Partitioner</code> is a Java <code>Serializable</code>.</p>"},{"location":"rdd/Partitioner/#contract","title":"Contract","text":""},{"location":"rdd/Partitioner/#partition-for-key","title":"Partition for Key <pre><code>getPartition(\n  key: Any): Int\n</code></pre> <p>Partition ID for the given key</p>","text":""},{"location":"rdd/Partitioner/#number-of-partitions","title":"Number of Partitions <pre><code>numPartitions: Int\n</code></pre>","text":""},{"location":"rdd/Partitioner/#implementations","title":"Implementations","text":"<ul> <li>HashPartitioner</li> <li>RangePartitioner</li> </ul>"},{"location":"rdd/RDD/","title":"RDD \u2014 Description of Distributed Computation","text":"<p><code>RDD[T]</code> is an abstraction of fault-tolerant resilient distributed datasets that are mere descriptions of computations over a distributed collection of records (of type <code>T</code>).</p>"},{"location":"rdd/RDD/#contract","title":"Contract","text":""},{"location":"rdd/RDD/#computing-partition","title":"Computing Partition <pre><code>compute(\n  split: Partition,\n  context: TaskContext): Iterator[T]\n</code></pre> <p>Computes the input Partition (with the TaskContext) to produce values (of type <code>T</code>).</p> <p>Used when:</p> <ul> <li><code>RDD</code> is requested to computeOrReadCheckpoint</li> </ul>","text":""},{"location":"rdd/RDD/#getpartitions","title":"getPartitions <pre><code>getPartitions: Array[Partition]\n</code></pre> <p>Used when:</p> <ul> <li><code>RDD</code> is requested for the partitions</li> </ul>","text":""},{"location":"rdd/RDD/#implementations","title":"Implementations","text":"<ul> <li>CheckpointRDD</li> <li>CoalescedRDD</li> <li>CoGroupedRDD</li> <li>HadoopRDD</li> <li>MapPartitionsRDD</li> <li>NewHadoopRDD</li> <li>ParallelCollectionRDD</li> <li>ReliableCheckpointRDD</li> <li>ShuffledRDD</li> <li>SubtractedRDD</li> <li>others</li> </ul>"},{"location":"rdd/RDD/#creating-instance","title":"Creating Instance","text":"<p><code>RDD</code> takes the following to be created:</p> <ul> <li> SparkContext <li> Dependencies (Parent RDDs that should be computed successfully before this RDD) Abstract Class <p><code>RDD</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete RDDs.</p>"},{"location":"rdd/RDD/#barrier-rdd","title":"Barrier RDD","text":"<p>Barrier RDD is a <code>RDD</code> with the isBarrier flag enabled.</p> <p>ShuffledRDD can never be a barrier RDD as it overrides isBarrier method to be always disabled (<code>false</code>).</p>"},{"location":"rdd/RDD/#isBarrier","title":"isBarrier <pre><code>isBarrier(): Boolean\n</code></pre> <p><code>isBarrier</code> is the value of isBarrier_.</p>  <p><code>isBarrier</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to submitMissingTasks (that are either ShuffleMapStages to create ShuffleMapTasks or ResultStage to create ResultTasks)</li> <li><code>RDDInfo</code> is created</li> <li><code>ShuffleDependency</code> is requested to canShuffleMergeBeEnabled</li> <li><code>DAGScheduler</code> is requested to checkBarrierStageWithRDDChainPattern, checkBarrierStageWithDynamicAllocation, checkBarrierStageWithNumSlots, handleTaskCompletion (<code>FetchFailed</code> case to mark a map stage as broken)</li> </ul>","text":""},{"location":"rdd/RDD/#isBarrier_","title":"isBarrier_ <pre><code>isBarrier_ : Boolean // (1)!\n</code></pre> <ol> <li><code>@transient protected lazy val</code></li> </ol> <p><code>isBarrier_</code> is enabled (<code>true</code>) when there is at least one barrier RDD among the parent RDDs (excluding ShuffleDependencyies).</p>  <p>Note</p> <p><code>isBarrier_</code> is overriden by <code>PythonRDD</code> and MapPartitionsRDD that both accept <code>isFromBarrier</code> flag.</p>","text":""},{"location":"rdd/RDD/#resourceProfile","title":"ResourceProfile (Stage-Level Scheduling)","text":"<p><code>RDD</code> can be assigned a ResourceProfile using RDD.withResources method.</p> <pre><code>val rdd: RDD[_] = ...\nrdd\n  .withResources(...) // request resources for a computation\n  .mapPartitions(...) // the computation\n</code></pre> <p><code>RDD</code> uses <code>resourceProfile</code> internal registry for the ResourceProfile that is undefined initially.</p> <p>The <code>ResourceProfile</code> is available using RDD.getResourceProfile method.</p>"},{"location":"rdd/RDD/#withResources","title":"withResources <pre><code>withResources(\n  rp: ResourceProfile): this.type\n</code></pre> <p><code>withResources</code> sets the given ResourceProfile as the resourceProfile and requests the ResourceProfileManager to add the resource profile.</p>","text":""},{"location":"rdd/RDD/#getResourceProfile","title":"getResourceProfile <pre><code>getResourceProfile(): ResourceProfile\n</code></pre> <p><code>getResourceProfile</code> returns the resourceProfile (if defined) or <code>null</code>.</p>  <p><code>getResourceProfile</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested for the ShuffleDependencies and ResourceProfiles of an RDD</li> </ul>","text":""},{"location":"rdd/RDD/#preferred-locations-placement-preferences-of-partition","title":"Preferred Locations (Placement Preferences of Partition) <pre><code>preferredLocations(\n  split: Partition): Seq[String]\n</code></pre>  Final Method <p><code>preferredLocations</code> is a Scala final method and may not be overridden in subclasses.</p> <p>Learn more in the Scala Language Specification.</p>  <p><code>preferredLocations</code> requests the CheckpointRDD for the preferred locations for the given Partition if this <code>RDD</code> is checkpointed orgetPreferredLocations.</p>  <p><code>preferredLocations</code> is a template method that uses getPreferredLocations that custom <code>RDD</code>s can override to specify placement preferences on their own.</p>  <p><code>preferredLocations</code>\u00a0is used when:</p> <ul> <li><code>DAGScheduler</code> is requested for preferred locations</li> </ul>","text":""},{"location":"rdd/RDD/#partitions","title":"Partitions <pre><code>partitions: Array[Partition]\n</code></pre>  Final Method <p><code>partitions</code> is a Scala final method and may not be overridden in subclasses.</p> <p>Learn more in the Scala Language Specification.</p>  <p><code>partitions</code> requests the CheckpointRDD for the partitions if this <code>RDD</code> is checkpointed.</p> <p>Otherwise, when this <code>RDD</code> is not checkpointed, <code>partitions</code> getPartitions (and caches it in the partitions_).</p>  <p>Note</p> <p><code>getPartitions</code> is an abstract method that custom <code>RDD</code>s are required to provide.</p>   <p><code>partitions</code> has the property that their internal index should be equal to their position in this <code>RDD</code>.</p>  <p><code>partitions</code>\u00a0is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to getPreferredLocsInternal</li> <li><code>SparkContext</code> is requested to run a job</li> <li>others</li> </ul>","text":""},{"location":"rdd/RDD/#dependencies","title":"dependencies <pre><code>dependencies: Seq[Dependency[_]]\n</code></pre>  Final Method <p><code>dependencies</code> is a Scala final method and may not be overridden in subclasses.</p> <p>Learn more in the Scala Language Specification.</p>  <p><code>dependencies</code> branches off based on checkpointRDD (and availability of CheckpointRDD).</p> <p>With CheckpointRDD available (this <code>RDD</code> is checkpointed), <code>dependencies</code> returns a OneToOneDependency with the <code>CheckpointRDD</code>.</p> <p>Otherwise, when this <code>RDD</code> is not checkpointed, <code>dependencies</code> getDependencies (and caches it in the dependencies_).</p>  <p>Note</p> <p><code>getDependencies</code> is an abstract method that custom <code>RDD</code>s are required to provide.</p>","text":""},{"location":"rdd/RDD/#reliable-checkpointing","title":"Reliable Checkpointing <pre><code>checkpoint(): Unit\n</code></pre> <p><code>checkpoint</code> creates a new ReliableRDDCheckpointData (with this <code>RDD</code>) and saves it in checkpointData registry.</p> <p><code>checkpoint</code> does nothing when the checkpointData registry has already been defined.</p> <p><code>checkpoint</code> throws a <code>SparkException</code> when the checkpoint directory is not specified:</p> <pre><code>Checkpoint directory has not been set in the SparkContext\n</code></pre>","text":""},{"location":"rdd/RDD/#rddcheckpointdata","title":"RDDCheckpointData <p><code>RDD</code> defines <code>checkpointData</code> internal registry for a RDDCheckpointData[T] (of <code>T</code> type of this <code>RDD</code>).</p> <p>The <code>checkpointData</code> registry is undefined (<code>None</code>) when <code>RDD</code> is created and can be the following values:</p> <ul> <li>ReliableRDDCheckpointData in checkpoint</li> <li>LocalRDDCheckpointData in localCheckpoint</li> </ul> <p>Used when:</p> <ul> <li>isCheckpointedAndMaterialized</li> <li>isLocallyCheckpointed</li> <li>isReliablyCheckpointed</li> <li>getCheckpointFile</li> <li>doCheckpoint</li> </ul>","text":""},{"location":"rdd/RDD/#checkpointrdd","title":"CheckpointRDD <pre><code>checkpointRDD: Option[CheckpointRDD[T]]\n</code></pre> <p><code>checkpointRDD</code> returns the CheckpointRDD of the RDDCheckpointData (if defined and so this <code>RDD</code> checkpointed).</p> <p><code>checkpointRDD</code> is used when:</p> <ul> <li><code>RDD</code> is requested for the dependencies, partitions and preferred locations (all using final methods!)</li> </ul>","text":""},{"location":"rdd/RDD/#docheckpoint","title":"doCheckpoint <pre><code>doCheckpoint(): Unit\n</code></pre> <p><code>doCheckpoint</code> executes in <code>checkpoint</code> scope.</p> <p><code>doCheckpoint</code> turns the doCheckpointCalled flag on (to prevent multiple executions).</p> <p><code>doCheckpoint</code> branches off based on whether a RDDCheckpointData is defined or not:</p> <ol> <li> <p>With the <code>RDDCheckpointData</code> defined, <code>doCheckpoint</code> checks out the checkpointAllMarkedAncestors flag and if enabled, <code>doCheckpoint</code> requests the Dependencies for the RDD that are in turn requested to doCheckpoint themselves. Otherwise, <code>doCheckpoint</code> requests the RDDCheckpointData to checkpoint.</p> </li> <li> <p>With the RDDCheckpointData undefined, <code>doCheckpoint</code> requests the Dependencies for the RDD that are in turn requested to doCheckpoint themselves.</p> </li> </ol> <p>In other words, With the <code>RDDCheckpointData</code> defined, requesting doCheckpointing of the Dependencies is guarded by checkpointAllMarkedAncestors flag.</p> <p><code>doCheckpoint</code> skips execution if called earlier.</p>  <p><code>doCheckpoint</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to run a job synchronously</li> </ul>","text":""},{"location":"rdd/RDD/#iterator","title":"iterator <pre><code>iterator(\n  split: Partition,\n  context: TaskContext): Iterator[T]\n</code></pre> <p><code>iterator</code>...FIXME</p>  <p>Final Method</p> <p><code>iterator</code> is a <code>final</code> method and may not be overridden in subclasses. See 5.2.6 final in the Scala Language Specification.</p>","text":""},{"location":"rdd/RDD/#getorcompute","title":"getOrCompute <pre><code>getOrCompute(\n  partition: Partition,\n  context: TaskContext): Iterator[T]\n</code></pre> <p><code>getOrCompute</code>...FIXME</p>","text":""},{"location":"rdd/RDD/#computeorreadcheckpoint","title":"computeOrReadCheckpoint <pre><code>computeOrReadCheckpoint(\n  split: Partition,\n  context: TaskContext): Iterator[T]\n</code></pre> <p><code>computeOrReadCheckpoint</code>...FIXME</p>","text":""},{"location":"rdd/RDD/#debugging-recursive-dependencies","title":"Debugging Recursive Dependencies <pre><code>toDebugString: String\n</code></pre> <p><code>toDebugString</code> returns a RDD Lineage Graph.</p> <pre><code>val wordCount = sc.textFile(\"README.md\")\n  .flatMap(_.split(\"\\\\s+\"))\n  .map((_, 1))\n  .reduceByKey(_ + _)\n\nscala&gt; println(wordCount.toDebugString)\n(2) ShuffledRDD[21] at reduceByKey at &lt;console&gt;:24 []\n +-(2) MapPartitionsRDD[20] at map at &lt;console&gt;:24 []\n    |  MapPartitionsRDD[19] at flatMap at &lt;console&gt;:24 []\n    |  README.md MapPartitionsRDD[18] at textFile at &lt;console&gt;:24 []\n    |  README.md HadoopRDD[17] at textFile at &lt;console&gt;:24 []\n</code></pre> <p><code>toDebugString</code> uses indentations to indicate a shuffle boundary.</p> <p>The numbers in round brackets show the level of parallelism at each stage, e.g. <code>(2)</code> in the above output.</p> <pre><code>scala&gt; println(wordCount.getNumPartitions)\n2\n</code></pre> <p>With spark.logLineage enabled, <code>toDebugString</code> is printed out when executing an action.</p> <pre><code>$ ./bin/spark-shell --conf spark.logLineage=true\n\nscala&gt; sc.textFile(\"README.md\", 4).count\n...\n15/10/17 14:46:42 INFO SparkContext: Starting job: count at &lt;console&gt;:25\n15/10/17 14:46:42 INFO SparkContext: RDD's recursive dependencies:\n(4) MapPartitionsRDD[1] at textFile at &lt;console&gt;:25 []\n |  README.md HadoopRDD[0] at textFile at &lt;console&gt;:25 []\n</code></pre>","text":""},{"location":"rdd/RDD/#coalesce","title":"coalesce <pre><code>coalesce(\n  numPartitions: Int,\n  shuffle: Boolean = false,\n  partitionCoalescer: Option[PartitionCoalescer] = Option.empty)\n  (implicit ord: Ordering[T] = null): RDD[T]\n</code></pre> <p><code>coalesce</code>...FIXME</p>  <p><code>coalesce</code> is used when:</p> <ul> <li>RDD.repartition high-level operator is used</li> </ul>","text":""},{"location":"rdd/RDD/#implicit-methods","title":"Implicit Methods","text":""},{"location":"rdd/RDD/#rddtoorderedrddfunctions","title":"rddToOrderedRDDFunctions <pre><code>rddToOrderedRDDFunctions[K : Ordering : ClassTag, V: ClassTag](\n  rdd: RDD[(K, V)]): OrderedRDDFunctions[K, V, (K, V)]\n</code></pre> <p><code>rddToOrderedRDDFunctions</code> is an Scala implicit method that creates an OrderedRDDFunctions.</p> <p><code>rddToOrderedRDDFunctions</code> is used (implicitly) when:</p> <ul> <li>RDD.sortBy</li> <li>PairRDDFunctions.combineByKey</li> </ul>","text":""},{"location":"rdd/RDDCheckpointData/","title":"RDDCheckpointData","text":"<p>RDDCheckpointData is an abstraction of information related to RDD checkpointing.</p> <p>== [[implementations]] Available RDDCheckpointDatas</p> <p>[cols=\"30,70\",options=\"header\",width=\"100%\"] |=== | RDDCheckpointData | Description</p> <p>| rdd:LocalRDDCheckpointData.md[LocalRDDCheckpointData] | [[LocalRDDCheckpointData]]</p> <p>| rdd:ReliableRDDCheckpointData.md[ReliableRDDCheckpointData] | [[ReliableRDDCheckpointData]] Reliable Checkpointing</p> <p>|===</p> <p>== [[creating-instance]] Creating Instance</p> <p>RDDCheckpointData takes the following to be created:</p> <ul> <li>[[rdd]] rdd:RDD.md[RDD]</li> </ul> <p>== [[Serializable]] RDDCheckpointData as Serializable</p> <p>RDDCheckpointData is java.io.Serializable.</p> <p>== [[cpState]] States</p> <ul> <li> <p>[[Initialized]] Initialized</p> </li> <li> <p>[[CheckpointingInProgress]] CheckpointingInProgress</p> </li> <li> <p>[[Checkpointed]] Checkpointed</p> </li> </ul> <p>== [[checkpoint]] Checkpointing RDD</p>"},{"location":"rdd/RDDCheckpointData/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/RDDCheckpointData/#checkpoint-checkpointrddt","title":"checkpoint(): CheckpointRDD[T]","text":"<p>checkpoint changes the &lt;&gt; to &lt;&gt; only when in &lt;&gt; state. Otherwise, checkpoint does nothing and returns. <p>checkpoint &lt;&gt; that gives an CheckpointRDD (that is the &lt;&gt; internal registry). <p>checkpoint changes the &lt;&gt; to &lt;&gt;. <p>In the end, checkpoint requests the given &lt;&gt; to rdd:RDD.md#markCheckpointed[markCheckpointed]. <p>checkpoint is used when RDD is requested to rdd:RDD.md#doCheckpoint[doCheckpoint].</p> <p>== [[doCheckpoint]] doCheckpoint Method</p>"},{"location":"rdd/RDDCheckpointData/#source-scala_1","title":"[source, scala]","text":""},{"location":"rdd/RDDCheckpointData/#docheckpoint-checkpointrddt","title":"doCheckpoint(): CheckpointRDD[T]","text":"<p>doCheckpoint is used when RDDCheckpointData is requested to &lt;&gt;."},{"location":"rdd/RangePartitioner/","title":"RangePartitioner","text":"<p><code>RangePartitioner</code> is a Partitioner that partitions sortable records by range into roughly equal ranges (that can be used for bucketed partitioning).</p> <p><code>RangePartitioner</code> is used for sortByKey operator (mostly).</p>"},{"location":"rdd/RangePartitioner/#creating-instance","title":"Creating Instance","text":"<p><code>RangePartitioner</code> takes the following to be created:</p> <ul> <li> Hint for the number of partitions <li> Key-Value RDD (<code>RDD[_ &lt;: Product2[K, V]]</code>) <li> <code>ascending</code> flag (default: <code>true</code>) <li> samplePointsPerPartitionHint (default: <code>20</code>)"},{"location":"rdd/RangePartitioner/#number-of-partitions","title":"Number of Partitions <pre><code>numPartitions: Int\n</code></pre> <p><code>numPartitions</code>\u00a0is part of the Partitioner abstraction.</p>  <p><code>numPartitions</code> is 1 more than the length of the range bounds (since the number of range bounds is 0 for 0 or 1 partitions).</p>","text":""},{"location":"rdd/RangePartitioner/#partition-for-key","title":"Partition for Key <pre><code>getPartition(\n  key: Any): Int\n</code></pre> <p><code>getPartition</code>\u00a0is part of the Partitioner abstraction.</p>  <p><code>getPartition</code> branches off based on the length of the range bounds.</p> <p>For up to 128 range bounds, <code>getPartition</code> is either the first range bound (from the rangeBounds) for which the <code>key</code> value is greater than the value of the range bound or 128 (if no value was found among the rangeBounds). <code>getPartition</code> starts finding a candidate partition number from <code>0</code> and walks over the rangeBounds until a range bound for which the given <code>key</code> value is greater than the value of the range bound is found or there are no more rangeBounds. <code>getPartition</code> increments the candidate partition candidate every iteration.</p> <p>For the number of the rangeBounds above 128, <code>getPartition</code>...FIXME</p> <p>In the end, <code>getPartition</code> returns the candidate partition number for the ascending enabled, or flips it (to be the number of the rangeBounds minus the candidate partition number), otheriwse.</p>","text":""},{"location":"rdd/RangePartitioner/#range-bounds","title":"Range Bounds <pre><code>rangeBounds: Array[K]\n</code></pre> <p><code>rangeBounds</code> is an array of upper bounds.</p> <p>For the number of partitions up to and including 1, <code>rangeBounds</code> is an empty array.</p> <p>For more than 1 partitions, <code>rangeBounds</code> determines the sample size per partitions. The total sample size is the samplePointsPerPartitionHint multiplied by the number of partitions capped by <code>1e6</code>. <code>rangeBounds</code> allows for 3x over-sample per partition.</p> <p><code>rangeBounds</code> sketches the keys of the input rdd (with the <code>sampleSizePerPartition</code>).</p>  <p>Note</p> <p>There is more going on in <code>rangeBounds</code>.</p>  <p>In the end, <code>rangeBounds</code> determines the bounds.</p>","text":""},{"location":"rdd/RangePartitioner/#determinebounds","title":"determineBounds <pre><code>determineBounds[K: Ordering](\n  candidates: ArrayBuffer[(K, Float)],\n  partitions: Int): Array[K]\n</code></pre> <p><code>determineBounds</code>...FIXME</p>","text":""},{"location":"rdd/ReliableCheckpointRDD/","title":"ReliableCheckpointRDD","text":"<p><code>ReliableCheckpointRDD</code> is an CheckpointRDD.</p>"},{"location":"rdd/ReliableCheckpointRDD/#creating-instance","title":"Creating Instance","text":"<p>ReliableCheckpointRDD takes the following to be created:</p> <ul> <li>[[sc]] SparkContext.md[]</li> <li>[[checkpointPath]] Checkpoint Directory (on a Hadoop DFS-compatible file system)</li> <li>&lt;&lt;_partitioner, Partitioner&gt;&gt;</li> </ul> <p>ReliableCheckpointRDD is created when:</p> <ul> <li> <p>ReliableCheckpointRDD utility is used to &lt;&gt;. <li> <p>SparkContext is requested to SparkContext.md#checkpointFile[checkpointFile]</p> </li> <p>== [[checkpointPartitionerFileName]] Checkpointed Partitioner File</p> <p>ReliableCheckpointRDD uses _partitioner as the name of the file in the &lt;&gt; with the &lt;&gt; serialized to. <p>== [[partitioner]] Partitioner</p> <p>ReliableCheckpointRDD can be given a rdd:Partitioner.md[Partitioner] to be created.</p> <p>When rdd:RDD.md#partitioner[requested for the Partitioner] (as an RDD), ReliableCheckpointRDD returns the one it was created with or &lt;&gt;. <p>== [[writeRDDToCheckpointDirectory]] Writing RDD to Checkpoint Directory</p>"},{"location":"rdd/ReliableCheckpointRDD/#source-scala","title":"[source, scala]","text":"<p>writeRDDToCheckpointDirectoryT: ClassTag: ReliableCheckpointRDD[T]</p> <p>writeRDDToCheckpointDirectory...FIXME</p> <p>writeRDDToCheckpointDirectory is used when ReliableRDDCheckpointData is requested to rdd:ReliableRDDCheckpointData.md#doCheckpoint[doCheckpoint].</p> <p>== [[writePartitionerToCheckpointDir]] Writing Partitioner to Checkpoint Directory</p>"},{"location":"rdd/ReliableCheckpointRDD/#sourcescala","title":"[source,scala]","text":"<p>writePartitionerToCheckpointDir(   sc: SparkContext,   partitioner: Partitioner,   checkpointDirPath: Path): Unit</p> <p>writePartitionerToCheckpointDir creates the &lt;&gt; with the buffer size based on configuration-properties.md#spark.buffer.size[spark.buffer.size] configuration property. <p>writePartitionerToCheckpointDir requests the core:SparkEnv.md#serializer[default Serializer] for a new serializer:Serializer.md#newInstance[SerializerInstance].</p> <p>writePartitionerToCheckpointDir requests the SerializerInstance to serializer:SerializerInstance.md#serializeStream[serialize the output stream] and serializer:DeserializationStream.md#writeObject[writes] the given Partitioner.</p> <p>In the end, writePartitionerToCheckpointDir prints out the following DEBUG message to the logs:</p>"},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#written-partitioner-to-partitionerfilepath","title":"Written partitioner to [partitionerFilePath]","text":"<p>In case of any non-fatal exception, writePartitionerToCheckpointDir prints out the following DEBUG message to the logs:</p>"},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#error-writing-partitioner-partitioner-to-checkpointdirpath","title":"Error writing partitioner [partitioner] to [checkpointDirPath]","text":"<p>writePartitionerToCheckpointDir is used when ReliableCheckpointRDD is requested to &lt;&gt;. <p>== [[readCheckpointedPartitionerFile]] Reading Partitioner from Checkpointed Directory</p>"},{"location":"rdd/ReliableCheckpointRDD/#sourcescala_1","title":"[source,scala]","text":"<p>readCheckpointedPartitionerFile(   sc: SparkContext,   checkpointDirPath: String): Option[Partitioner]</p> <p>readCheckpointedPartitionerFile opens the &lt;&gt; with the buffer size based on configuration-properties.md#spark.buffer.size[spark.buffer.size] configuration property. <p>readCheckpointedPartitionerFile requests the core:SparkEnv.md#serializer[default Serializer] for a new serializer:Serializer.md#newInstance[SerializerInstance].</p> <p>readCheckpointedPartitionerFile requests the SerializerInstance to serializer:SerializerInstance.md#deserializeStream[deserialize the input stream] and serializer:DeserializationStream.md#readObject[read the Partitioner] from the partitioner file.</p> <p>readCheckpointedPartitionerFile prints out the following DEBUG message to the logs and returns the partitioner.</p>"},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext_2","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#read-partitioner-from-partitionerfilepath","title":"Read partitioner from [partitionerFilePath]","text":"<p>In case of FileNotFoundException or any non-fatal exceptions, readCheckpointedPartitionerFile prints out a corresponding message to the logs and returns None.</p> <p>readCheckpointedPartitionerFile is used when ReliableCheckpointRDD is requested for the &lt;&gt;. <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.rdd.ReliableCheckpointRDD$</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>"},{"location":"rdd/ReliableCheckpointRDD/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableCheckpointRDD/#log4jloggerorgapachesparkrddreliablecheckpointrddall","title":"log4j.logger.org.apache.spark.rdd.ReliableCheckpointRDD$=ALL","text":"<p>Refer to spark-logging.md[Logging].</p>"},{"location":"rdd/ReliableRDDCheckpointData/","title":"ReliableRDDCheckpointData","text":"<p><code>ReliableRDDCheckpointData</code> is a RDDCheckpointData for Reliable Checkpointing.</p>"},{"location":"rdd/ReliableRDDCheckpointData/#creating-instance","title":"Creating Instance","text":"<p>ReliableRDDCheckpointData takes the following to be created:</p> <ul> <li>[[rdd]] rdd:RDD.md[++RDD[T]++]</li> </ul> <p>ReliableRDDCheckpointData is created for rdd:RDD.md#checkpoint[RDD.checkpoint] operator.</p> <p>== [[cpDir]][[checkpointPath]] Checkpoint Directory</p> <p>ReliableRDDCheckpointData creates a subdirectory of the SparkContext.md#checkpointDir[application-wide checkpoint directory] for &lt;&gt; the given &lt;&gt;. <p>The name of the subdirectory uses the rdd:RDD.md#id[unique identifier] of the &lt;&gt;:"},{"location":"rdd/ReliableRDDCheckpointData/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableRDDCheckpointData/#rdd-id","title":"rdd-[id]","text":"<p>== [[doCheckpoint]] Checkpointing RDD</p>"},{"location":"rdd/ReliableRDDCheckpointData/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/ReliableRDDCheckpointData/#docheckpoint-checkpointrddt","title":"doCheckpoint(): CheckpointRDD[T]","text":"<p>doCheckpoint rdd:ReliableCheckpointRDD.md#writeRDDToCheckpointDirectory[writes] the &lt;&gt; to the &lt;&gt; (that creates a new RDD). <p>With configuration-properties.md#spark.cleaner.referenceTracking.cleanCheckpoints[spark.cleaner.referenceTracking.cleanCheckpoints] configuration property enabled, doCheckpoint requests the SparkContext.md#cleaner[ContextCleaner] to core:ContextCleaner.md#registerRDDCheckpointDataForCleanup[registerRDDCheckpointDataForCleanup] for the new RDD.</p> <p>In the end, doCheckpoint prints out the following INFO message to the logs and returns the new RDD.</p>"},{"location":"rdd/ReliableRDDCheckpointData/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"rdd/ReliableRDDCheckpointData/#done-checkpointing-rdd-id-to-cpdir-new-parent-is-rdd-id","title":"Done checkpointing RDD [id] to [cpDir], new parent is RDD [id]","text":"<p>doCheckpoint is part of the rdd:RDDCheckpointData.md#doCheckpoint[RDDCheckpointData] abstraction.</p>"},{"location":"rdd/ShuffleDependency/","title":"ShuffleDependency","text":"<p><code>ShuffleDependency</code> is a Dependency on the output of a ShuffleMapStage of a key-value RDD.</p> <p><code>ShuffleDependency</code> uses the RDD to know the number of (map-side/pre-shuffle) partitions and the Partitioner for the number of (reduce-size/post-shuffle) partitions.</p> <pre><code>ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]\n</code></pre>","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleDependency</code> takes the following to be created:</p> <ul> <li> RDD (<code>RDD[_ &lt;: Product2[K, V]]</code>) <li>Partitioner</li> <li> Serializer (default: SparkEnv.get.serializer) <li> Optional Key Ordering (default: undefined) <li>Optional Aggregator</li> <li>mapSideCombine</li> <li>ShuffleWriteProcessor</li> <p><code>ShuffleDependency</code> is created\u00a0when:</p> <ul> <li><code>CoGroupedRDD</code> is requested for the dependencies (for RDDs with different partitioners)</li> <li><code>ShuffledRDD</code> is requested for the dependencies</li> <li><code>SubtractedRDD</code> is requested for the dependencies (for an RDD with different partitioner)</li> <li><code>ShuffleExchangeExec</code> (Spark SQL) physical operator is requested to prepare a <code>ShuffleDependency</code></li> </ul> <p>When created, <code>ShuffleDependency</code> gets the shuffle id.</p> <p><code>ShuffleDependency</code> registers itself with the ShuffleManager and gets a <code>ShuffleHandle</code> (available as shuffleHandle). <code>ShuffleDependency</code> uses SparkEnv to access the ShuffleManager.</p> <p>In the end, <code>ShuffleDependency</code> registers itself with the ContextCleaner (if configured) and the ShuffleDriverComponents.</p>","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#aggregator","title":"Aggregator <pre><code>aggregator: Option[Aggregator[K, V, C]]\n</code></pre> <p><code>ShuffleDependency</code> can be given a map/reduce-side Aggregator when created.</p> <p><code>ShuffleDependency</code> asserts (when created) that an <code>Aggregator</code> is defined when the mapSideCombine flag is enabled.</p> <p><code>aggregator</code>\u00a0is used when:</p> <ul> <li><code>SortShuffleWriter</code> is requested to write records (for mapper tasks)</li> <li><code>BlockStoreShuffleReader</code> is requested to read records (for reducer tasks)</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#map-size-partial-aggregation-flag","title":"Map-Size Partial Aggregation Flag <p><code>ShuffleDependency</code> uses a <code>mapSideCombine</code> flag that controls whether to perform map-side partial aggregation (map-side combine) using the Aggregator.</p> <p><code>mapSideCombine</code> is disabled (<code>false</code>) by default and can be enabled (<code>true</code>) for some uses of ShuffledRDD.</p> <p><code>ShuffleDependency</code> requires that the optional Aggregator is actually defined for the flag enabled.</p> <p><code>mapSideCombine</code> is used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> <li><code>SortShuffleManager</code> is requested to register a shuffle</li> <li><code>SortShuffleWriter</code> is requested to write records</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#partitioner","title":"Partitioner <p><code>ShuffleDependency</code> is given a Partitioner (when created).</p> <p><code>ShuffleDependency</code> uses the <code>Partitioner</code> to partition the shuffle output.</p> <p>The <code>Partitioner</code> is used when:</p> <ul> <li><code>SortShuffleWriter</code> is requested to write records (and create an ExternalSorter)</li> <li>others (FIXME)</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#shufflewriteprocessor","title":"ShuffleWriteProcessor <p><code>ShuffleDependency</code> can be given a ShuffleWriteProcessor when created.</p> <p>The <code>ShuffleWriteProcessor</code> is used when:</p> <ul> <li><code>ShuffleMapTask</code> is requested to runTask (to write partition records out to the shuffle system)</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#shuffle-id","title":"Shuffle ID <pre><code>shuffleId: Int\n</code></pre> <p><code>ShuffleDependency</code> is identified uniquely by an application-wide shuffle ID (that is requested from SparkContext when created).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffleDependency/#shufflehandle","title":"ShuffleHandle <p><code>ShuffleDependency</code> registers itself with the ShuffleManager when created.</p> <p>The <code>ShuffleHandle</code> is used when:</p> <ul> <li>CoGroupedRDDs, ShuffledRDD, SubtractedRDD, and <code>ShuffledRowRDD</code> (Spark SQL) are requested to compute a partition (to get a ShuffleReader for a <code>ShuffleDependency</code>)</li> <li><code>ShuffleMapTask</code> is requested to run (to get a <code>ShuffleWriter</code> for a ShuffleDependency).</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/","title":"ShuffledRDD","text":"<p><code>ShuffledRDD</code> is an RDD of key-value pairs that represents a shuffle step in a RDD lineage (and indicates start of a new stage).</p> <p>When requested to compute a partition, <code>ShuffledRDD</code> uses the one and only ShuffleDependency for a ShuffleHandle for a ShuffleReader (from the system ShuffleManager) that is used to read the (combined) key-value pairs.</p>","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffledRDD</code> takes the following to be created:</p> <ul> <li> RDD (of <code>K</code> keys and <code>V</code> values) <li>Partitioner</li> <p><code>ShuffledRDD</code> is created\u00a0for the following RDD operators:</p> <ul> <li> <p>OrderedRDDFunctions.sortByKey and OrderedRDDFunctions.repartitionAndSortWithinPartitions</p> </li> <li> <p>PairRDDFunctions.combineByKeyWithClassTag and PairRDDFunctions.partitionBy</p> </li> <li> <p>RDD.coalesce (with <code>shuffle</code> flag enabled)</p> </li> </ul>","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#partitioner","title":"Partitioner <p><code>ShuffledRDD</code> is given a Partitioner when created:</p> <ul> <li>RangePartitioner for sortByKey</li> <li>HashPartitioner for coalesce</li> <li>Whatever passed in to the following high-level RDD operators when different from the current <code>Partitioner</code> (of the RDD):<ul> <li>repartitionAndSortWithinPartitions</li> <li>combineByKeyWithClassTag</li> <li>partitionBy</li> </ul> </li> </ul> <p>The given <code>Partitioner</code> is the partitioner of this <code>ShuffledRDD</code>.</p> <p>The <code>Partitioner</code> is also used when:</p> <ul> <li>getDependencies (to create the only ShuffleDependency)</li> <li>getPartitions (to create as many <code>ShuffledRDDPartition</code>s as the numPartitions of the <code>Partitioner</code>)</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#dependencies","title":"Dependencies  Signature <pre><code>getDependencies: Seq[Dependency[_]]\n</code></pre> <p><code>getDependencies</code> is part of the RDD abstraction.</p>  <p><code>getDependencies</code> uses the user-specified Serializer, if defined, or requests the current SerializerManager for one.</p> <p><code>getDependencies</code> uses the mapSideCombine internal flag for the types of the keys and values (i.e. <code>K</code> and <code>C</code> or <code>K</code> and <code>V</code> when the flag is enabled or not, respectively).</p> <p>In the end, <code>getDependencies</code> creates a single ShuffleDependency (with the previous RDD, the Partitioner, and the <code>Serializer</code>).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#computing-partition","title":"Computing Partition  Signature <pre><code>compute(\n  split: Partition,\n  context: TaskContext): Iterator[(K, C)]\n</code></pre> <p><code>compute</code> is part of the RDD abstraction.</p>  <p><code>compute</code> assumes that ShuffleDependency is the first dependency among the dependencies (and the only one per getDependencies).</p> <p><code>compute</code> uses the SparkEnv to access the ShuffleManager. <code>compute</code> requests the ShuffleManager for the ShuffleReader based on the following:</p>    ShuffleReader Value     ShuffleHandle ShuffleHandle of the <code>ShuffleDependency</code>   <code>startPartition</code> The index of the given <code>split</code> partition   <code>endPartition</code> <code>index + 1</code>    <p>In the end, <code>compute</code> requests the <code>ShuffleReader</code> to read the (combined) key-value pairs (of type <code>(K, C)</code>).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#key-value-and-combiner-types","title":"Key, Value and Combiner Types <pre><code>class ShuffledRDD[K: ClassTag, V: ClassTag, C: ClassTag]\n</code></pre> <p><code>ShuffledRDD</code> is given an RDD of <code>K</code> keys and <code>V</code> values to be created.</p> <p>When computed, <code>ShuffledRDD</code> produces pairs of <code>K</code> keys and <code>C</code> values.</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#isbarrier-flag","title":"isBarrier Flag <p><code>ShuffledRDD</code> has isBarrier flag always disabled (<code>false</code>).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#map-side-combine-flag","title":"Map-Side Combine Flag <p><code>ShuffledRDD</code> uses a map-side combine flag to create a ShuffleDependency when requested for the dependencies (there is always only one).</p> <p>The flag is disabled (<code>false</code>) by default and can be changed using <code>setMapSideCombine</code> method.</p> <pre><code>setMapSideCombine(\n  mapSideCombine: Boolean): ShuffledRDD[K, V, C]\n</code></pre> <p><code>setMapSideCombine</code> is used for PairRDDFunctions.combineByKeyWithClassTag transformation (which defaults to the flag enabled).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#placement-preferences-of-partition","title":"Placement Preferences of Partition  Signature <pre><code>getPreferredLocations(\n  partition: Partition): Seq[String]\n</code></pre> <p><code>getPreferredLocations</code> is part of the RDD abstraction.</p>  <p><code>getPreferredLocations</code> requests <code>MapOutputTrackerMaster</code> for the preferred locations of the given partition (BlockManagers with the most map outputs).</p> <p><code>getPreferredLocations</code> uses <code>SparkEnv</code> to access the current MapOutputTrackerMaster.</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#shuffledrddpartition","title":"ShuffledRDDPartition <p><code>ShuffledRDDPartition</code> gets an <code>index</code> to be created (that in turn is the index of partitions as calculated by the Partitioner).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#user-specified-serializer","title":"User-Specified Serializer <p>User-specified Serializer for the single ShuffleDependency dependency</p> <pre><code>userSpecifiedSerializer: Option[Serializer] = None\n</code></pre> <p><code>userSpecifiedSerializer</code> is undefined (<code>None</code>) by default and can be changed using <code>setSerializer</code> method (that is used for PairRDDFunctions.combineByKeyWithClassTag transformation).</p>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#demos","title":"Demos","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#shuffledrdd-and-coalesce","title":"ShuffledRDD and coalesce <pre><code>val data = sc.parallelize(0 to 9)\nval coalesced = data.coalesce(numPartitions = 4, shuffle = true)\nscala&gt; println(coalesced.toDebugString)\n(4) MapPartitionsRDD[9] at coalesce at &lt;pastie&gt;:75 []\n |  CoalescedRDD[8] at coalesce at &lt;pastie&gt;:75 []\n |  ShuffledRDD[7] at coalesce at &lt;pastie&gt;:75 []\n +-(16) MapPartitionsRDD[6] at coalesce at &lt;pastie&gt;:75 []\n    |   ParallelCollectionRDD[5] at parallelize at &lt;pastie&gt;:74 []\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"rdd/ShuffledRDD/#shuffledrdd-and-sortbykey","title":"ShuffledRDD and sortByKey <pre><code>val data = sc.parallelize(0 to 9)\nval grouped = rdd.groupBy(_ % 2)\nval sorted = grouped.sortByKey(numPartitions = 2)\nscala&gt; println(sorted.toDebugString)\n(2) ShuffledRDD[15] at sortByKey at &lt;console&gt;:74 []\n +-(4) ShuffledRDD[12] at groupBy at &lt;console&gt;:74 []\n    +-(4) MapPartitionsRDD[11] at groupBy at &lt;console&gt;:74 []\n       |  MapPartitionsRDD[9] at coalesce at &lt;pastie&gt;:75 []\n       |  CoalescedRDD[8] at coalesce at &lt;pastie&gt;:75 []\n       |  ShuffledRDD[7] at coalesce at &lt;pastie&gt;:75 []\n       +-(16) MapPartitionsRDD[6] at coalesce at &lt;pastie&gt;:75 []\n          |   ParallelCollectionRDD[5] at parallelize at &lt;pastie&gt;:74 []\n</code></pre>","text":"","tags":["DeveloperApi"]},{"location":"rdd/SubtractedRDD/","title":"SubtractedRDD","text":"<p>=== [[compute]] Computing Partition (in TaskContext) -- <code>compute</code> Method</p>"},{"location":"rdd/SubtractedRDD/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/SubtractedRDD/#computep-partition-context-taskcontext-iteratork-v","title":"compute(p: Partition, context: TaskContext): Iterator[(K, V)]","text":"<p><code>compute</code> is part of the RDD abstraction.</p> <p><code>compute</code>...FIXME</p>"},{"location":"rdd/checkpointing/","title":"RDD Checkpointing","text":"<p>RDD Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system.</p> <p>There are two types of checkpointing:</p> <ul> <li>&lt;&gt; - RDD checkpointing that saves the actual intermediate RDD data to a reliable distributed file system (e.g. Hadoop DFS) <li>&lt;&gt; - RDD checkpointing that saves the data to a local file system <p>It's up to a Spark application developer to decide when and how to checkpoint using <code>RDD.checkpoint()</code> method.</p> <p>Before checkpointing is used, a Spark developer has to set the checkpoint directory using <code>SparkContext.setCheckpointDir(directory: String)</code> method.</p> <p>== [[reliable-checkpointing]] Reliable Checkpointing</p> <p>You call <code>SparkContext.setCheckpointDir(directory: String)</code> to set the checkpoint directory - the directory where RDDs are checkpointed. The <code>directory</code> must be a HDFS path if running on a cluster. The reason is that the driver may attempt to reconstruct the checkpointed RDD from its own local file system, which is incorrect because the checkpoint files are actually on the executor machines.</p> <p>You mark an RDD for checkpointing by calling <code>RDD.checkpoint()</code>. The RDD will be saved to a file inside the checkpoint directory and all references to its parent RDDs will be removed. This function has to be called before any job has been executed on this RDD.</p> <p>NOTE: It is strongly recommended that a checkpointed RDD is persisted in memory, otherwise saving it on a file will require recomputation.</p> <p>When an action is called on a checkpointed RDD, the following INFO message is printed out in the logs:</p> <pre><code>Done checkpointing RDD 5 to [path], new parent is RDD [id]\n</code></pre> <p>== [[local-checkpointing]] Local Checkpointing</p> <p>localCheckpoint allows to truncate RDD lineage graph while skipping the expensive step of replicating the materialized data to a reliable distributed file system.</p> <p>This is useful for RDDs with long lineages that need to be truncated periodically, e.g. GraphX.</p> <p>Local checkpointing trades fault-tolerance for performance.</p> <p>NOTE: The checkpoint directory set through <code>SparkContext.setCheckpointDir</code> is not used.</p> <p>== [[demo]] Demo</p>"},{"location":"rdd/checkpointing/#sourceplaintext","title":"[source,plaintext]","text":"<p>val rdd = sc.parallelize(0 to 9)</p> <p>scala&gt; rdd.checkpoint org.apache.spark.SparkException: Checkpoint directory has not been set in the SparkContext   at org.apache.spark.rdd.RDD.checkpoint(RDD.scala:1599)   ... 49 elided</p> <p>sc.setCheckpointDir(\"/tmp/rdd-checkpoint\")</p> <p>// Creates a subdirectory for this SparkContext $ ls /tmp/rdd-checkpoint/ fc21e1d1-3cd9-4d51-880f-58d1dd07f783</p> <p>// Mark the RDD to checkpoint at the earliest action rdd.checkpoint</p> <p>scala&gt; println(rdd.getCheckpointFile) Some(file:/tmp/rdd-checkpoint/fc21e1d1-3cd9-4d51-880f-58d1dd07f783/rdd-2)</p> <p>scala&gt; println(ns.id) 2</p> <p>scala&gt; println(rdd.getNumPartitions) 16</p> <p>rdd.count</p> <p>// Check out the checkpoint directory // You should find a directory for the checkpointed RDD, e.g. rdd-2 // The number of part-000* files is exactly the number of partitions $ ls -ltra /tmp/rdd-checkpoint/fc21e1d1-3cd9-4d51-880f-58d1dd07f783/rdd-2/part-000* | wc -l       16</p>"},{"location":"rdd/lineage/","title":"RDD Lineage \u2014 Logical Execution Plan","text":"<p>RDD Lineage (RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of an RDD. </p> <p>RDD lineage is built as a result of applying transformations to an RDD and creates a so-called logical execution plan.</p> <p>Note</p> <p>The execution DAG or physical execution plan is the DAG of stages.</p> <p></p> <p>The above RDD graph could be the result of the following series of transformations:</p> <pre><code>val r00 = sc.parallelize(0 to 9)\nval r01 = sc.parallelize(0 to 90 by 10)\nval r10 = r00.cartesian(r01)\nval r11 = r00.map(n =&gt; (n, n))\nval r12 = r00.zip(r01)\nval r13 = r01.keyBy(_ / 20)\nval r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)\n</code></pre> <p>A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.</p>"},{"location":"rdd/lineage/#logical-execution-plan","title":"Logical Execution Plan","text":"<p>Logical Execution Plan starts with the earliest RDDs (those with no dependencies on other RDDs or reference cached data) and ends with the RDD that produces the result of the action that has been called to execute.</p> <p>Note</p> <p>A logical plan (a DAG) is materialized and executed when <code>SparkContext</code> is requested to run a Spark job.</p>"},{"location":"rdd/spark-rdd-actions/","title":"Actions","text":"<p>RDD Actions are RDD operations that produce concrete non-RDD values. They materialize a value in a Spark program. In other words, a RDD operation that returns a value of any type but <code>RDD[T]</code> is an action.</p> <pre><code>action: RDD =&gt; a value\n</code></pre> <p>NOTE: Actions are synchronous. You can use &lt;&gt; to release a calling thread while calling actions. <p>They trigger execution of &lt;&gt; to return values. Simply put, an action evaluates the RDD lineage graph. <p>You can think of actions as a valve and until action is fired, the data to be processed is not even in the pipes, i.e. transformations. Only actions can materialize the entire processing pipeline with real data.</p> <ul> <li><code>aggregate</code></li> <li><code>collect</code></li> <li><code>count</code></li> <li><code>countApprox*</code></li> <li><code>countByValue*</code></li> <li><code>first</code></li> <li><code>fold</code></li> <li><code>foreach</code></li> <li><code>foreachPartition</code></li> <li><code>max</code></li> <li><code>min</code></li> <li><code>reduce</code></li> <li><code>saveAs*</code> (e.g. <code>saveAsTextFile</code>, <code>saveAsHadoopFile</code>)</li> <li><code>take</code></li> <li><code>takeOrdered</code></li> <li><code>takeSample</code></li> <li><code>toLocalIterator</code></li> <li><code>top</code></li> <li><code>treeAggregate</code></li> <li><code>treeReduce</code></li> </ul> <p>Actions run jobs using SparkContext.runJob or directly DAGScheduler.runJob.</p> <pre><code>scala&gt; :type words\n\nscala&gt; words.count  // &lt;1&gt;\nres0: Long = 502\n</code></pre> <p>TIP: You should cache RDDs you work with when you want to execute two or more actions on it for a better performance. Refer to spark-rdd-caching.md[RDD Caching and Persistence].</p> <p>Before calling an action, Spark does closure/function cleaning (using <code>SparkContext.clean</code>) to make it ready for serialization and sending over the wire to executors. Cleaning can throw a <code>SparkException</code> if the computation cannot be cleaned.</p> <p>NOTE: Spark uses <code>ClosureCleaner</code> to clean closures.</p> <p>=== [[AsyncRDDActions]] AsyncRDDActions</p> <p><code>AsyncRDDActions</code> class offers asynchronous actions that you can use on RDDs (thanks to the implicit conversion <code>rddToAsyncRDDActions</code> in RDD class). The methods return a &lt;&gt;. <p>The following asynchronous methods are available:</p> <ul> <li><code>countAsync</code></li> <li><code>collectAsync</code></li> <li><code>takeAsync</code></li> <li><code>foreachAsync</code></li> <li><code>foreachPartitionAsync</code></li> </ul>"},{"location":"rdd/spark-rdd-caching/","title":"Caching and Persistence","text":"<p>== RDD Caching and Persistence</p> <p>Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.</p> <p>RDDs can be cached using &lt;&gt; operation. They can also be persisted using &lt;&gt; operation. <p>The difference between <code>cache</code> and <code>persist</code> operations is purely syntactic. <code>cache</code> is a synonym of <code>persist</code> or <code>persist(MEMORY_ONLY)</code>, i.e. <code>cache</code> is merely <code>persist</code> with the default storage level <code>MEMORY_ONLY</code>.</p> <p>NOTE: Due to the very small and purely syntactic difference between caching and persistence of RDDs the two terms are often used interchangeably and I will follow the \"pattern\" here.</p> <p>RDDs can also be &lt;&gt; to remove RDD from a permanent storage like memory and/or disk. <p>=== [[cache]] Caching RDD -- <code>cache</code> Method</p>"},{"location":"rdd/spark-rdd-caching/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-caching/#cache-thistype-persist","title":"cache(): this.type = persist()","text":"<p><code>cache</code> is a synonym of &lt;&gt; with storage:StorageLevel.md[<code>MEMORY_ONLY</code> storage level]. <p>=== [[persist]] Persisting RDD -- <code>persist</code> Methods</p>"},{"location":"rdd/spark-rdd-caching/#source-scala_1","title":"[source, scala]","text":"<p>persist(): this.type persist(newLevel: StorageLevel): this.type</p> <p><code>persist</code> marks a RDD for persistence using <code>newLevel</code> storage:StorageLevel.md[storage level].</p> <p>You can only change the storage level once or <code>persist</code> reports an <code>UnsupportedOperationException</code>:</p> <pre><code>Cannot change storage level of an RDD after it was already assigned a level\n</code></pre> <p>NOTE: You can pretend to change the storage level of an RDD with already-assigned storage level only if the storage level is the same as it is currently assigned.</p> <p>If the RDD is marked as persistent the first time, the RDD is core:ContextCleaner.md#registerRDDForCleanup[registered to <code>ContextCleaner</code>] (if available) and SparkContext.md#persistRDD[<code>SparkContext</code>].</p> <p>The internal <code>storageLevel</code> attribute is set to the input <code>newLevel</code> storage level.</p> <p>=== [[unpersist]] Unpersisting RDDs (Clearing Blocks) -- <code>unpersist</code> Method</p>"},{"location":"rdd/spark-rdd-caching/#source-scala_2","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-caching/#unpersistblocking-boolean-true-thistype","title":"unpersist(blocking: Boolean = true): this.type","text":"<p>When called, <code>unpersist</code> prints the following INFO message to the logs:</p> <pre><code>INFO [RddName]: Removing RDD [id] from persistence list\n</code></pre> <p>It then calls SparkContext.md#unpersist[SparkContext.unpersistRDD(id, blocking)] and sets storage:StorageLevel.md[<code>NONE</code> storage level] as the current storage level.</p>"},{"location":"rdd/spark-rdd-operations/","title":"Operators","text":"<p>== Operators - Transformations and Actions</p> <p>RDDs have two types of operations: spark-rdd-transformations.md[transformations] and spark-rdd-actions.md[actions].</p> <p>NOTE: Operators are also called operations.</p> <p>=== Gotchas - things to watch for</p> <p>Even if you don't access it explicitly it cannot be referenced inside a closure as it is serialized and carried around across executors.</p> <p>See https://issues.apache.org/jira/browse/SPARK-5063</p>"},{"location":"rdd/spark-rdd-partitions/","title":"Partitions and Partitioning","text":"<p>== Partitions and Partitioning</p> <p>=== Introduction</p> <p>Depending on how you look at Spark (programmer, devop, admin), an RDD is about the content (developer's and data scientist's perspective) or how it gets spread out over a cluster (performance), i.e. how many partitions an RDD represents.</p> <p>A partition (aka split) is a logical chunk of a large distributed data set.</p>"},{"location":"rdd/spark-rdd-partitions/#caution","title":"[CAUTION]","text":"<p>FIXME</p> <ol> <li>How does the number of partitions map to the number of tasks? How to verify it?</li> <li> </li> </ol> <p>Spark manages data using partitions that helps  parallelize distributed data processing with minimal network traffic for sending data between executors.</p> <p>By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.</p> <p>There is a one-to-one correspondence between how data is laid out in data storage like HDFS or Cassandra (it is partitioned for the same reasons).</p> <p>Features:</p> <ul> <li>size</li> <li>number</li> <li>partitioning scheme</li> <li>node distribution</li> <li>repartitioning</li> </ul>"},{"location":"rdd/spark-rdd-partitions/#how-does-the-mapping-between-partitions-and-tasks-correspond-to-data-locality-if-any","title":"How does the mapping between partitions and tasks correspond to data locality if any?","text":""},{"location":"rdd/spark-rdd-partitions/#tip","title":"[TIP]","text":"<p>Read the following documentations to learn what experts say on the topic:</p> <ul> <li>https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html[How Many Partitions Does An RDD Have?]</li> <li> </li> </ul> <p>By default, a partition is created for each HDFS partition, which by default is 64MB (from http://spark.apache.org/docs/latest/programming-guide.html#external-datasets[Spark's Programming Guide]).</p> <p>RDDs get partitioned automatically without programmer intervention. However, there are times when you'd like to adjust the size and number of partitions or the partitioning scheme according to the needs of your application.</p> <p>You use <code>def getPartitions: Array[Partition]</code> method on a RDD to know the set of partitions in this RDD.</p> <p>As noted in https://github.com/databricks/spark-knowledgebase/blob/master/performance_optimization/how_many_partitions_does_an_rdd_have.md#view-task-execution-against-partitions-using-the-ui[View Task Execution Against Partitions Using the UI]:</p> <p>When a stage executes, you can see the number of partitions for a given stage in the Spark UI.</p> <p>Start <code>spark-shell</code> and see it yourself!</p> <pre><code>scala&gt; sc.parallelize(1 to 100).count\nres0: Long = 100\n</code></pre> <p>When you execute the Spark job, i.e. <code>sc.parallelize(1 to 100).count</code>, you should see the following in http://localhost:4040/jobs[Spark shell application UI].</p> <p>.The number of partition as Total tasks in UI image::spark-partitions-ui-stages.png[align=\"center\"]</p> <p>The reason for <code>8</code> Tasks in Total is that I'm on a 8-core laptop and by default the number of partitions is the number of all available cores.</p> <pre><code>$ sysctl -n hw.ncpu\n8\n</code></pre> <p>You can request for the minimum number of partitions, using the second input parameter to many transformations.</p> <pre><code>scala&gt; sc.parallelize(1 to 100, 2).count\nres1: Long = 100\n</code></pre> <p>.Total tasks in UI shows 2 partitions image::spark-partitions-ui-stages-2-partitions.png[align=\"center\"]</p> <p>You can always ask for the number of partitions using <code>partitions</code> method of a RDD:</p> <pre><code>scala&gt; val ints = sc.parallelize(1 to 100, 4)\nints: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at &lt;console&gt;:24\n\nscala&gt; ints.partitions.size\nres2: Int = 4\n</code></pre> <p>In general, smaller/more numerous partitions allow work to be distributed among more workers, but larger/fewer partitions allow work to be done in larger chunks,  which may result in the work getting done more quickly as long as all workers are kept busy, due to reduced overhead.</p> <p>Increasing partitions count will make each partition to have less data (or not at all!)</p> <p>Spark can only run 1 concurrent task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least have 50 partitions (and probably http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism[2-3x times that]).</p> <p>As far as choosing a \"good\" number of partitions, you generally want at least as many as the number of executors for parallelism. You can get this computed value by calling <code>sc.defaultParallelism</code>.</p> <p>Also, the number of partitions determines how many files get generated by actions that save RDDs to files.</p> <p>The maximum size of a partition is ultimately limited by the available memory of an executor.</p> <p>In the first RDD transformation, e.g. reading from a file using <code>sc.textFile(path, partition)</code>, the <code>partition</code> parameter will be applied to all further transformations and actions on this RDD.</p> <p>Partitions get redistributed among nodes whenever <code>shuffle</code> occurs. Repartitioning may cause <code>shuffle</code> to occur in some situations,  but it is not guaranteed to occur in all cases. And it usually happens during action stage.</p> <p>When creating an RDD by reading a file using <code>rdd = SparkContext().textFile(\"hdfs://.../file.txt\")</code> the number of partitions may be smaller. Ideally, you would get the same number of blocks as you see in HDFS, but if the lines in your file are too long (longer than the block size), there will be fewer partitions.</p> <p>Preferred way to set up the number of partitions for an RDD is to directly pass it as the second input parameter in the call like <code>rdd = sc.textFile(\"hdfs://.../file.txt\", 400)</code>, where <code>400</code> is the number of partitions. In this case, the partitioning makes for 400 splits that would be done by the Hadoop's <code>TextInputFormat</code>, not Spark and it would work much faster. It's also that the code spawns 400 concurrent tasks to try to load <code>file.txt</code> directly into 400 partitions.</p> <p>It will only work as described for uncompressed files.</p> <p>When using <code>textFile</code> with compressed files (<code>file.txt.gz</code> not <code>file.txt</code> or similar), Spark disables splitting that makes for an RDD with only 1 partition (as reads against gzipped files cannot be parallelized). In this case, to change the number of partitions you should do &lt;&gt;. <p>Some operations, e.g. <code>map</code>, <code>flatMap</code>, <code>filter</code>, don't preserve partitioning.</p> <p><code>map</code>, <code>flatMap</code>, <code>filter</code> operations apply a function to every partition.</p> <p>=== [[repartitioning]][[repartition]] Repartitioning RDD -- <code>repartition</code> Transformation</p>"},{"location":"rdd/spark-rdd-partitions/#httpssparkapacheorgdocslatesttuninghtmltuning-spark-the-official-documentation-of-spark","title":"https://spark.apache.org/docs/latest/tuning.html[Tuning Spark] (the official documentation of Spark)","text":""},{"location":"rdd/spark-rdd-partitions/#source-scala","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-partitions/#repartitionnumpartitions-intimplicit-ord-orderingt-null-rddt","title":"repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]","text":"<p><code>repartition</code> is &lt;&gt; with <code>numPartitions</code> and <code>shuffle</code> enabled. <p>With the following computation you can see that <code>repartition(5)</code> causes 5 tasks to be started using <code>NODE_LOCAL</code> data locality.</p> <pre><code>scala&gt; lines.repartition(5).count\n...\n15/10/07 08:10:00 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 7 (MapPartitionsRDD[19] at repartition at &lt;console&gt;:27)\n15/10/07 08:10:00 INFO TaskSchedulerImpl: Adding task set 7.0 with 5 tasks\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 17, localhost, partition 0,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 1.0 in stage 7.0 (TID 18, localhost, partition 1,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 2.0 in stage 7.0 (TID 19, localhost, partition 2,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 3.0 in stage 7.0 (TID 20, localhost, partition 3,NODE_LOCAL, 2089 bytes)\n15/10/07 08:10:00 INFO TaskSetManager: Starting task 4.0 in stage 7.0 (TID 21, localhost, partition 4,NODE_LOCAL, 2089 bytes)\n...\n</code></pre> <p>You can see a change after executing <code>repartition(1)</code> causes 2 tasks to be started using <code>PROCESS_LOCAL</code> data locality.</p> <pre><code>scala&gt; lines.repartition(1).count\n...\n15/10/07 08:14:09 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 8 (MapPartitionsRDD[20] at repartition at &lt;console&gt;:27)\n15/10/07 08:14:09 INFO TaskSchedulerImpl: Adding task set 8.0 with 2 tasks\n15/10/07 08:14:09 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 22, localhost, partition 0,PROCESS_LOCAL, 2058 bytes)\n15/10/07 08:14:09 INFO TaskSetManager: Starting task 1.0 in stage 8.0 (TID 23, localhost, partition 1,PROCESS_LOCAL, 2058 bytes)\n...\n</code></pre> <p>Please note that Spark disables splitting for compressed files and creates RDDs with only 1 partition. In such cases, it's helpful to use <code>sc.textFile('demo.gz')</code> and do repartitioning using <code>rdd.repartition(100)</code> as follows:</p> <pre><code>rdd = sc.textFile('demo.gz')\nrdd = rdd.repartition(100)\n</code></pre> <p>With the lines, you end up with <code>rdd</code> to be exactly 100 partitions of roughly equal in size.</p> <ul> <li><code>rdd.repartition(N)</code> does a <code>shuffle</code> to split data to match <code>N</code> ** partitioning is done on round robin basis</li> </ul> <p>TIP: If partitioning scheme doesn't work for you, you can write your own custom partitioner.</p> <p>TIP: It's useful to get familiar with https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html[Hadoop's TextInputFormat].</p> <p>=== [[coalesce]] <code>coalesce</code> Transformation</p>"},{"location":"rdd/spark-rdd-partitions/#source-scala_1","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-partitions/#coalescenumpartitions-int-shuffle-boolean-falseimplicit-ord-orderingt-null-rddt","title":"coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]","text":"<p>The <code>coalesce</code> transformation is used to change the number of partitions. It can trigger shuffling depending on the <code>shuffle</code> flag (disabled by default, i.e. <code>false</code>).</p> <p>In the following sample, you <code>parallelize</code> a local 10-number sequence and <code>coalesce</code> it first without and then with shuffling (note the <code>shuffle</code> parameter being <code>false</code> and <code>true</code>, respectively).</p> <p>Tip</p> <p>Use toDebugString to check out the RDD lineage graph.</p> <p><pre><code>scala&gt; val rdd = sc.parallelize(0 to 10, 8)\nrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:24\n\nscala&gt; rdd.partitions.size\nres0: Int = 8\n\nscala&gt; rdd.coalesce(numPartitions=8, shuffle=false)   // &lt;1&gt;\nres1: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[1] at coalesce at &lt;console&gt;:27\n\nscala&gt; res1.toDebugString\nres2: String =\n(8) CoalescedRDD[1] at coalesce at &lt;console&gt;:27 []\n |  ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:24 []\n\nscala&gt; rdd.coalesce(numPartitions=8, shuffle=true)\nres3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at coalesce at &lt;console&gt;:27\n\nscala&gt; res3.toDebugString\nres4: String =\n(8) MapPartitionsRDD[5] at coalesce at &lt;console&gt;:27 []\n |  CoalescedRDD[4] at coalesce at &lt;console&gt;:27 []\n |  ShuffledRDD[3] at coalesce at &lt;console&gt;:27 []\n +-(8) MapPartitionsRDD[2] at coalesce at &lt;console&gt;:27 []\n    |  ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:24 []\n</code></pre> &lt;1&gt; <code>shuffle</code> is <code>false</code> by default and it's explicitly used here for demo purposes. Note the number of partitions that remains the same as the number of partitions in the source RDD <code>rdd</code>.</p>"},{"location":"rdd/spark-rdd-transformations/","title":"Transformations -- Lazy Operations on RDD (to Create One or More RDDs)","text":"<p>Transformations are lazy operations on an rdd:RDD.md[RDD] that create one or many new RDDs.</p> <pre><code>// T and U are Scala types\ntransformation: RDD[T] =&gt; RDD[U]\ntransformation: RDD[T] =&gt; Seq[RDD[U]]\n</code></pre> <p>In other words, transformations are functions that take an RDD as the input and produce one or many RDDs as the output. Transformations do not change the input RDD (since rdd:index.md#introduction[RDDs are immutable] and hence cannot be modified), but produce one or more new RDDs by applying the computations they represent.</p> <p>[[methods]] .(Subset of) RDD Transformations (Public API) [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| aggregate a| [[aggregate]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala","title":"[source, scala]","text":"<p>aggregateU(   seqOp:  (U, T) =&gt; U,   combOp: (U, U) =&gt; U): U</p> <p>| barrier a| [[barrier]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_1","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#barrier-rddbarriert","title":"barrier(): RDDBarrier[T]","text":"<p>(New in 2.4.0) Marks the current stage as a &lt;&gt; in &lt;&gt;, where Spark must launch all tasks together <p>Internally, <code>barrier</code> creates a &lt;&gt; over the RDD <p>| cache a| [[cache]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_2","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#cache-thistype","title":"cache(): this.type","text":"<p>Persists the RDD with the storage:StorageLevel.md#MEMORY_ONLY[MEMORY_ONLY] storage level</p> <p>Synonym of &lt;&gt; <p>| coalesce a| [[coalesce]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_3","title":"[source, scala]","text":"<p>coalesce(   numPartitions: Int,   shuffle: Boolean = false,   partitionCoalescer: Option[PartitionCoalescer] = Option.empty)   (implicit ord: Ordering[T] = null): RDD[T]</p> <p>| filter a| [[filter]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_4","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#filterf-t-boolean-rddt","title":"filter(f: T =&gt; Boolean): RDD[T]","text":"<p>| flatMap a| [[flatMap]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_5","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#flatmapu-rddu","title":"flatMapU: RDD[U]","text":"<p>| map a| [[map]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_6","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#mapu-rddu","title":"mapU: RDD[U]","text":"<p>| mapPartitions a| [[mapPartitions]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_7","title":"[source, scala]","text":"<p>mapPartitionsU: RDD[U]</p> <p>| mapPartitionsWithIndex a| [[mapPartitionsWithIndex]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_8","title":"[source, scala]","text":"<p>mapPartitionsWithIndexU: RDD[U]</p> <p>| randomSplit a| [[randomSplit]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_9","title":"[source, scala]","text":"<p>randomSplit(   weights: Array[Double],   seed: Long = Utils.random.nextLong): Array[RDD[T]]</p> <p>| union a| [[union]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_10","title":"[source, scala]","text":"<p>++(other: RDD[T]): RDD[T] union(other: RDD[T]): RDD[T]</p> <p>| persist a| [[persist]]</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_11","title":"[source, scala]","text":"<p>persist(): this.type persist(newLevel: StorageLevel): this.type</p> <p>|===</p> <p>By applying transformations you incrementally build a RDD lineage with all the parent RDDs of the final RDD(s).</p> <p>Transformations are lazy, i.e. are not executed immediately. Only after calling an action are transformations executed.</p> <p>After executing a transformation, the result RDD(s) will always be different from their parents and can be smaller (e.g. <code>filter</code>, <code>count</code>, <code>distinct</code>, <code>sample</code>), bigger (e.g. <code>flatMap</code>, <code>union</code>, <code>cartesian</code>) or the same size (e.g. <code>map</code>).</p> <p>CAUTION: There are transformations that may trigger jobs, e.g. <code>sortBy</code>, &lt;&gt;, etc. <p>.From SparkContext by transformations to the result image::rdd-sparkcontext-transformations-action.png[align=\"center\"]</p> <p>Certain transformations can be pipelined which is an optimization that Spark uses to improve performance of computations.</p>"},{"location":"rdd/spark-rdd-transformations/#sourcescala","title":"[source,scala]","text":"<p>scala&gt; val file = sc.textFile(\"README.md\") file: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at textFile at :24 <p>scala&gt; val allWords = file.flatMap(_.split(\"\\W+\")) allWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[55] at flatMap at :26 <p>scala&gt; val words = allWords.filter(!_.isEmpty) words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at filter at :28 <p>scala&gt; val pairs = words.map((_,1)) pairs: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[57] at map at :30 <p>scala&gt; val reducedByKey = pairs.reduceByKey(_ + _) reducedByKey: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[59] at reduceByKey at :32 <p>scala&gt; val top10words = reducedByKey.takeOrdered(10)(Ordering[Int].reverse.on(_._2)) INFO SparkContext: Starting job: takeOrdered at :34 ... INFO DAGScheduler: Job 18 finished: takeOrdered at :34, took 0.074386 s top10words: Array[(String, Int)] = Array((the,21), (to,14), (Spark,13), (for,11), (and,10), (##,8), (a,8), (run,7), (can,6), (is,6)) <p>There are two kinds of transformations:</p> <ul> <li>&lt;&gt; <li>&lt;&gt; <p>=== [[narrow-transformations]] Narrow Transformations</p> <p>Narrow transformations are the result of <code>map</code>, <code>filter</code> and such that is from the data from a single partition only, i.e. it is self-sustained.</p> <p>An output RDD has partitions with records that originate from a single partition in the parent RDD. Only a limited subset of partitions used to calculate the result.</p> <p>Spark groups narrow transformations as a stage which is called pipelining.</p> <p>=== [[wide-transformations]] Wide Transformations</p> <p>Wide transformations are the result of <code>groupByKey</code> and <code>reduceByKey</code>. The data required to compute the records in a single partition may reside in many partitions of the parent RDD.</p> <p>NOTE: Wide transformations are also called shuffle transformations as they may or may not depend on a shuffle.</p> <p>All of the tuples with the same key must end up in the same partition, processed by the same task. To satisfy these operations, Spark must execute a RDD shuffle, which transfers data across cluster and results in a new stage with a new set of partitions.</p>"},{"location":"rdd/spark-rdd-transformations/#zipwithindex","title":"zipWithIndex","text":""},{"location":"rdd/spark-rdd-transformations/#source-scala_12","title":"[source, scala]","text":""},{"location":"rdd/spark-rdd-transformations/#zipwithindex-rddt-long","title":"zipWithIndex(): RDD[(T, Long)] <p><code>zipWithIndex</code> zips this <code>RDD[T]</code> with its element indices.</p>","text":""},{"location":"rdd/spark-rdd-transformations/#caution","title":"[CAUTION]","text":"<p>If the number of partitions of the source RDD is greater than 1, it will submit an additional job to calculate start indices.</p>"},{"location":"rdd/spark-rdd-transformations/#source-scala_13","title":"[source, scala] <p>val onePartition = sc.parallelize(0 to 9, 1)</p> <p>scala&gt; onePartition.partitions.length res0: Int = 1</p> <p>// no job submitted onePartition.zipWithIndex</p> <p>val eightPartitions = sc.parallelize(0 to 9, 8)</p> <p>scala&gt; eightPartitions.partitions.length res1: Int = 8</p> <p>// submits a job eightPartitions.zipWithIndex</p>  <p>.Spark job submitted by zipWithIndex transformation image::spark-transformations-zipWithIndex-webui.png[align=\"center\"] ====</p>","text":""},{"location":"rest/","title":"Index","text":"<p>= Status REST API -- Monitoring Spark Applications Using REST API</p> <p>Status REST API is a collection of REST endpoints under <code>/api/v1</code> URI path in the spark-api-UIRoot.md[root containers for application UI information]:</p> <ul> <li> <p>[[SparkUI]] spark-webui-SparkUI.md[SparkUI] - Application UI for an active Spark application (i.e. a Spark application that is still running)</p> </li> <li> <p>[[HistoryServer]] spark-history-server:HistoryServer.md[HistoryServer] - Application UI for active and completed Spark applications (i.e. Spark applications that are still running or have already finished)</p> </li> </ul> <p>Status REST API uses spark-api-ApiRootResource.md[ApiRootResource] main resource class that registers <code>/api/v1</code> URI &lt;&gt;. <p>[[paths]] .URI Paths [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Path | Description</p> <p>| [[applications]] <code>applications</code> | [[ApplicationListResource]] Delegates to the spark-api-ApplicationListResource.md[ApplicationListResource] resource class</p> <p>| [[applications_appId]] <code>applications/\\{appId}</code> | [[OneApplicationResource]] Delegates to the spark-api-OneApplicationResource.md[OneApplicationResource] resource class</p> <p>| [[version]] <code>version</code> | Creates a <code>VersionInfo</code> with the current version of Spark |===</p> <p>Status REST API uses the following components:</p> <ul> <li> <p>https://jersey.github.io/[Jersey RESTful Web Services framework] with support for the https://github.com/jax-rs[Java API for RESTful Web Services] (JAX-RS API)</p> </li> <li> <p>https://www.eclipse.org/jetty/[Eclipse Jetty] as the lightweight HTTP server and the https://jcp.org/en/jsr/detail?id=369[Java Servlet] container</p> </li> </ul>"},{"location":"rest/AbstractApplicationResource/","title":"AbstractApplicationResource","text":"<p>== [[AbstractApplicationResource]] AbstractApplicationResource</p> <p><code>AbstractApplicationResource</code> is a spark-api-BaseAppResource.md[BaseAppResource] with a set of &lt;&gt; that are common across &lt;&gt;. <pre><code>// start spark-shell\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 257\nContent-Type: application/json\nDate: Tue, 05 Jun 2018 18:46:32 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-05T15:04:48.328GMT\",\n                \"lastUpdatedEpoch\": 1528211088328,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-05T15:04:48.328GMT\",\n                \"startTimeEpoch\": 1528211088328\n            }\n        ],\n        \"id\": \"local-1528211089216\",\n        \"name\": \"Spark shell\"\n    }\n]\n\n$ http http://localhost:4040/api/v1/applications/local-1528211089216/storage/rdd\nHTTP/1.1 200 OK\nContent-Length: 3\nContent-Type: application/json\nDate: Tue, 05 Jun 2018 18:48:00 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[]\n\n// Execute the following query in spark-shell\nspark.range(5).cache.count\n\n$ http http://localhost:4040/api/v1/applications/local-1528211089216/storage/rdd\n// output omitted for brevity\n</code></pre> <p>[[implementations]] .AbstractApplicationResources [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | AbstractApplicationResource | Description</p> <p>| spark-api-OneApplicationResource.md[OneApplicationResource] | [[OneApplicationResource]] Handles <code>applications/appId</code> requests</p> <p>| spark-api-OneApplicationAttemptResource.md[OneApplicationAttemptResource] | [[OneApplicationAttemptResource]] |===</p> <p>[[paths]] .AbstractApplicationResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description</p> <p>| <code>allexecutors</code> | GET | &lt;&gt; <p>| <code>environment</code> | GET | &lt;&gt; <p>| <code>executors</code> | GET | &lt;&gt; <p>| <code>jobs</code> | GET | &lt;&gt; <p>| <code>jobs/{jobId: \\\\d+}</code> | GET | &lt;&gt; <p>| <code>logs</code> | GET | &lt;&gt; <code>stages</code> &lt;&gt; <p>| <code>storage/rdd/{rddId: \\\\d+}</code> | GET | &lt;&gt; <p>| [[storage_rdd]] <code>storage/rdd</code> | GET | &lt;&gt; |=== <p>=== [[rddList]] <code>rddList</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#rddlist-seqrddstorageinfo","title":"rddList(): Seq[RDDStorageInfo]","text":"<p><code>rddList</code>...FIXME</p> <p>NOTE: <code>rddList</code> is used when...FIXME</p> <p>=== [[environmentInfo]] <code>environmentInfo</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala_1","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#environmentinfo-applicationenvironmentinfo","title":"environmentInfo(): ApplicationEnvironmentInfo","text":"<p><code>environmentInfo</code>...FIXME</p> <p>NOTE: <code>environmentInfo</code> is used when...FIXME</p> <p>=== [[rddData]] <code>rddData</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala_2","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#rdddatapathparamrddid-rddid-int-rddstorageinfo","title":"rddData(@PathParam(\"rddId\") rddId: Int): RDDStorageInfo","text":"<p><code>rddData</code>...FIXME</p> <p>NOTE: <code>rddData</code> is used when...FIXME</p> <p>=== [[allExecutorList]] <code>allExecutorList</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala_3","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#allexecutorlist-seqexecutorsummary","title":"allExecutorList(): Seq[ExecutorSummary]","text":"<p><code>allExecutorList</code>...FIXME</p> <p>NOTE: <code>allExecutorList</code> is used when...FIXME</p> <p>=== [[executorList]] <code>executorList</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala_4","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#executorlist-seqexecutorsummary","title":"executorList(): Seq[ExecutorSummary]","text":"<p><code>executorList</code>...FIXME</p> <p>NOTE: <code>executorList</code> is used when...FIXME</p> <p>=== [[oneJob]] <code>oneJob</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala_5","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#onejobpathparamjobid-jobid-int-jobdata","title":"oneJob(@PathParam(\"jobId\") jobId: Int): JobData","text":"<p><code>oneJob</code>...FIXME</p> <p>NOTE: <code>oneJob</code> is used when...FIXME</p> <p>=== [[jobsList]] <code>jobsList</code> Method</p>"},{"location":"rest/AbstractApplicationResource/#source-scala_6","title":"[source, scala]","text":""},{"location":"rest/AbstractApplicationResource/#jobslistqueryparamstatus-statuses-jlistjobexecutionstatus-seqjobdata","title":"jobsList(@QueryParam(\"status\") statuses: JList[JobExecutionStatus]): Seq[JobData]","text":"<p><code>jobsList</code>...FIXME</p> <p>NOTE: <code>jobsList</code> is used when...FIXME</p>"},{"location":"rest/ApiRequestContext/","title":"ApiRequestContext","text":"<p>== [[ApiRequestContext]] ApiRequestContext</p> <p><code>ApiRequestContext</code> is the &lt;&gt; of...FIXME <p>[[contract]] [source, scala]</p> <p>package org.apache.spark.status.api.v1</p> <p>trait ApiRequestContext {   // only required methods that have no implementation   // the others follow   @Context   var servletContext: ServletContext = _</p> <p>@Context   var httpRequest: HttpServletRequest = _ }</p> <p>NOTE: <code>ApiRequestContext</code> is a <code>private[v1]</code> contract.</p> <p>.ApiRequestContext Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| <code>httpRequest</code> | [[httpRequest]] Java Servlets' <code>HttpServletRequest</code></p> <p>Used when...FIXME</p> <p>| <code>servletContext</code> | [[servletContext]] Java Servlets' <code>ServletContext</code></p> <p>Used when...FIXME |===</p> <p>[[implementations]] .ApiRequestContexts [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | ApiRequestContext | Description</p> <p>| spark-api-ApiRootResource.md[ApiRootResource] | [[ApiRootResource]]</p> <p>| <code>ApiStreamingApp</code> | [[ApiStreamingApp]]</p> <p>| spark-api-ApplicationListResource.md[ApplicationListResource] | [[ApplicationListResource]]</p> <p>| spark-api-BaseAppResource.md[BaseAppResource] | [[BaseAppResource]]</p> <p>| <code>SecurityFilter</code> | [[SecurityFilter]] |===</p> <p>=== [[uiRoot]] Getting Current UIRoot -- <code>uiRoot</code> Method</p>"},{"location":"rest/ApiRequestContext/#source-scala","title":"[source, scala]","text":""},{"location":"rest/ApiRequestContext/#uiroot-uiroot","title":"uiRoot: UIRoot","text":"<p><code>uiRoot</code> simply requests <code>UIRootFromServletContext</code> to spark-api-UIRootFromServletContext.md#getUiRoot[get the current UIRoot] (for the given &lt;&gt;). <p>NOTE: <code>uiRoot</code> is used when...FIXME</p>"},{"location":"rest/ApiRootResource/","title":"ApiRootResource","text":"<p>== [[ApiRootResource]] ApiRootResource -- /api/v1 URI Handler</p> <p><code>ApiRootResource</code> is the spark-api-ApiRequestContext.md[ApiRequestContext] for the <code>/v1</code> URI path.</p> <p><code>ApiRootResource</code> uses <code>@Path(\"/v1\")</code> annotation at the class level. It is a partial URI path template relative to the base URI of the server on which the resource is deployed, the context root of the application, and the URL pattern to which the JAX-RS runtime responds.</p> <p>TIP: Learn more about <code>@Path</code> annotation in https://docs.oracle.com/cd/E19798-01/821-1841/6nmq2cp26/index.html[The @Path Annotation and URI Path Templates].</p> <p><code>ApiRootResource</code> &lt;&gt; the <code>/api/*</code> context handler (with the REST resources and providers in <code>org.apache.spark.status.api.v1</code> package). <p>With the <code>@Path(\"/v1\")</code> annotation and after &lt;&gt; the <code>/api/*</code> context handler, <code>ApiRootResource</code> serves HTTP requests for &lt;&gt; under the <code>/api/v1</code> URI paths for spark-webui-SparkUI.md#initialize[SparkUI] and spark-history-server:HistoryServer.md#initialize[HistoryServer]. <p><code>ApiRootResource</code> gives the metrics of a Spark application in JSON format (using JAX-RS API).</p> <pre><code>// start spark-shell\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 257\nContent-Type: application/json\nDate: Tue, 05 Jun 2018 18:36:16 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-05T15:04:48.328GMT\",\n                \"lastUpdatedEpoch\": 1528211088328,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-05T15:04:48.328GMT\",\n                \"startTimeEpoch\": 1528211088328\n            }\n        ],\n        \"id\": \"local-1528211089216\",\n        \"name\": \"Spark shell\"\n    }\n]\n\n// Fixed in Spark 2.3.1\n// https://issues.apache.org/jira/browse/SPARK-24188\n$ http http://localhost:4040/api/v1/version\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 43\nContent-Type: application/json\nDate: Thu, 14 Jun 2018 08:19:06 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n{\n    \"spark\": \"2.3.1\"\n}\n</code></pre> <p>[[paths]] .ApiRootResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description</p> [[applications]] <code>applications</code> [[ApplicationListResource]] Delegates to the spark-api-ApplicationListResource.md[ApplicationListResource] resource class [[applications_appId]] <code>applications/\\{appId}</code> [[OneApplicationResource]] Delegates to the spark-api-OneApplicationResource.md[OneApplicationResource] resource class <p>| [[version]] <code>version</code> | GET | Creates a <code>VersionInfo</code> with the current version of Spark |===</p> <p>=== [[getServletHandler]] Creating /api/* Context Handler -- <code>getServletHandler</code> Method</p>"},{"location":"rest/ApiRootResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/ApiRootResource/#getservlethandleruiroot-uiroot-servletcontexthandler","title":"getServletHandler(uiRoot: UIRoot): ServletContextHandler","text":"<p><code>getServletHandler</code> creates a Jetty <code>ServletContextHandler</code> for <code>/api</code> context path.</p> <p>NOTE: The Jetty <code>ServletContextHandler</code> created does not support HTTP sessions as REST API is stateless.</p> <p><code>getServletHandler</code> creates a Jetty <code>ServletHolder</code> with the resources and providers in <code>org.apache.spark.status.api.v1</code> package. It then registers the <code>ServletHolder</code> to serve <code>/*</code> context path (under the <code>ServletContextHandler</code> for <code>/api</code>).</p> <p><code>getServletHandler</code> requests <code>UIRootFromServletContext</code> to spark-api-UIRootFromServletContext.md#setUiRoot[setUiRoot] with the <code>ServletContextHandler</code> and the input spark-api-UIRoot.md[UIRoot].</p> <p>NOTE: <code>getServletHandler</code> is used when spark-webui-SparkUI.md#initialize[SparkUI] and spark-history-server:HistoryServer.md#initialize[HistoryServer] are requested to initialize.</p>"},{"location":"rest/ApplicationListResource/","title":"ApplicationListResource","text":"<p>== [[ApplicationListResource]] ApplicationListResource -- applications URI Handler</p> <p><code>ApplicationListResource</code> is a spark-api-ApiRequestContext.md[ApiRequestContext] that spark-api-ApiRootResource.md#applications[ApiRootResource] uses to handle &lt;&gt; URI path. <p>[[paths]] .ApplicationListResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description</p> <p>| [[root]] <code>/</code> | GET | &lt;&gt; |=== <pre><code>// start spark-shell\n// there should be a single Spark application -- the spark-shell itself\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 255\nContent-Type: application/json\nDate: Wed, 06 Jun 2018 12:40:33 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-06T12:30:19.220GMT\",\n                \"lastUpdatedEpoch\": 1528288219220,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-06T12:30:19.220GMT\",\n                \"startTimeEpoch\": 1528288219220\n            }\n        ],\n        \"id\": \"local-1528288219790\",\n        \"name\": \"Spark shell\"\n    }\n]\n</code></pre> <p>=== [[isAttemptInRange]] <code>isAttemptInRange</code> Internal Method</p>"},{"location":"rest/ApplicationListResource/#source-scala","title":"[source, scala]","text":"<p>isAttemptInRange(   attempt: ApplicationAttemptInfo,   minStartDate: SimpleDateParam,   maxStartDate: SimpleDateParam,   minEndDate: SimpleDateParam,   maxEndDate: SimpleDateParam,   anyRunning: Boolean): Boolean</p> <p><code>isAttemptInRange</code>...FIXME</p> <p>NOTE: <code>isAttemptInRange</code> is used exclusively when <code>ApplicationListResource</code> is requested to handle a &lt;&gt; HTTP request.</p> <p>=== [[appList]] <code>appList</code> Method</p>"},{"location":"rest/ApplicationListResource/#source-scala_1","title":"[source, scala]","text":"<p>appList(   @QueryParam(\"status\") status: JList[ApplicationStatus],   @DefaultValue(\"2010-01-01\") @QueryParam(\"minDate\") minDate: SimpleDateParam,   @DefaultValue(\"3000-01-01\") @QueryParam(\"maxDate\") maxDate: SimpleDateParam,   @DefaultValue(\"2010-01-01\") @QueryParam(\"minEndDate\") minEndDate: SimpleDateParam,   @DefaultValue(\"3000-01-01\") @QueryParam(\"maxEndDate\") maxEndDate: SimpleDateParam,   @QueryParam(\"limit\") limit: Integer) : Iterator[ApplicationInfo]</p> <p><code>appList</code>...FIXME</p> <p>NOTE: <code>appList</code> is used when...FIXME</p>"},{"location":"rest/BaseAppResource/","title":"BaseAppResource","text":"<p>== [[BaseAppResource]] BaseAppResource</p> <p><code>BaseAppResource</code> is the contract of spark-api-ApiRequestContext.md[ApiRequestContexts] that can &lt;&gt; and use &lt;&gt; and &lt;&gt; path parameters in URI paths. <p>[[path-params]] .BaseAppResource's Path Parameters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| <code>appId</code> | [[appId]] <code>@PathParam(\"appId\")</code></p> <p>Used when...FIXME</p> <p>| <code>attemptId</code> | [[attemptId]] <code>@PathParam(\"attemptId\")</code></p> <p>Used when...FIXME |===</p> <p>[[implementations]] .BaseAppResources [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | BaseAppResource | Description</p> <p>| spark-api-AbstractApplicationResource.md[AbstractApplicationResource] | [[AbstractApplicationResource]]</p> <p>| <code>BaseStreamingAppResource</code> | [[BaseStreamingAppResource]]</p> <p>| spark-api-StagesResource.md[StagesResource] | [[StagesResource]] |===</p> <p>NOTE: <code>BaseAppResource</code> is a <code>private[v1]</code> contract.</p> <p>=== [[withUI]] <code>withUI</code> Method</p>"},{"location":"rest/BaseAppResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/BaseAppResource/#withuit-t","title":"withUIT: T","text":"<p><code>withUI</code>...FIXME</p> <p>NOTE: <code>withUI</code> is used when...FIXME</p>"},{"location":"rest/OneApplicationAttemptResource/","title":"OneApplicationAttemptResource","text":"<p>== [[OneApplicationAttemptResource]] OneApplicationAttemptResource</p> <p><code>OneApplicationAttemptResource</code> is a spark-api-AbstractApplicationResource.md[AbstractApplicationResource] (and so a spark-api-ApiRequestContext.md[ApiRequestContext] indirectly).</p> <p><code>OneApplicationAttemptResource</code> is used when <code>AbstractApplicationResource</code> is requested to spark-api-AbstractApplicationResource.md#applicationAttempt[applicationAttempt].</p> <p>[[paths]] .OneApplicationAttemptResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description</p> <p>| [[root]] <code>/</code> | GET | &lt;&gt; |=== <pre><code>// start spark-shell\n// there should be a single Spark application -- the spark-shell itself\n// CAUTION: FIXME Demo of OneApplicationAttemptResource in Action\n</code></pre> <p>=== [[getAttempt]] <code>getAttempt</code> Method</p>"},{"location":"rest/OneApplicationAttemptResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/OneApplicationAttemptResource/#getattempt-applicationattemptinfo","title":"getAttempt(): ApplicationAttemptInfo","text":"<p><code>getAttempt</code> requests the spark-api-ApiRequestContext.md#uiRoot[UIRoot] for the spark-api-UIRoot.md#getApplicationInfo[application info] (given the spark-api-BaseAppResource.md#appId[appId]) and finds the spark-api-BaseAppResource.md#attemptId[attemptId] among the available attempts.</p> <p>NOTE: spark-api-BaseAppResource.md#appId[appId] and spark-api-BaseAppResource.md#attemptId[attemptId] are path parameters.</p> <p>In the end, <code>getAttempt</code> returns the <code>ApplicationAttemptInfo</code> if available or reports a <code>NotFoundException</code>:</p> <pre><code>unknown app [appId], attempt [attemptId]\n</code></pre>"},{"location":"rest/OneApplicationResource/","title":"OneApplicationResource","text":"<p>== [[OneApplicationResource]] OneApplicationResource -- applications/appId URI Handler</p> <p><code>OneApplicationResource</code> is a spark-api-AbstractApplicationResource.md[AbstractApplicationResource] (and so a spark-api-ApiRequestContext.md[ApiRequestContext] indirectly) that spark-api-ApiRootResource.md#applications_appId[ApiRootResource] uses to handle &lt;&gt; URI path. <p>[[paths]] .OneApplicationResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description</p> <p>| [[root]] <code>/</code> | GET | &lt;&gt; |=== <pre><code>// start spark-shell\n// there should be a single Spark application -- the spark-shell itself\n$ http http://localhost:4040/api/v1/applications\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 255\nContent-Type: application/json\nDate: Wed, 06 Jun 2018 12:40:33 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n[\n    {\n        \"attempts\": [\n            {\n                \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n                \"completed\": false,\n                \"duration\": 0,\n                \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n                \"endTimeEpoch\": -1,\n                \"lastUpdated\": \"2018-06-06T12:30:19.220GMT\",\n                \"lastUpdatedEpoch\": 1528288219220,\n                \"sparkUser\": \"jacek\",\n                \"startTime\": \"2018-06-06T12:30:19.220GMT\",\n                \"startTimeEpoch\": 1528288219220\n            }\n        ],\n        \"id\": \"local-1528288219790\",\n        \"name\": \"Spark shell\"\n    }\n]\n\n$ http http://localhost:4040/api/v1/applications/local-1528288219790\nHTTP/1.1 200 OK\nContent-Encoding: gzip\nContent-Length: 255\nContent-Type: application/json\nDate: Wed, 06 Jun 2018 12:41:43 GMT\nServer: Jetty(9.3.z-SNAPSHOT)\nVary: Accept-Encoding, User-Agent\n\n{\n    \"attempts\": [\n        {\n            \"appSparkVersion\": \"2.3.1-SNAPSHOT\",\n            \"completed\": false,\n            \"duration\": 0,\n            \"endTime\": \"1969-12-31T23:59:59.999GMT\",\n            \"endTimeEpoch\": -1,\n            \"lastUpdated\": \"2018-06-06T12:30:19.220GMT\",\n            \"lastUpdatedEpoch\": 1528288219220,\n            \"sparkUser\": \"jacek\",\n            \"startTime\": \"2018-06-06T12:30:19.220GMT\",\n            \"startTimeEpoch\": 1528288219220\n        }\n    ],\n    \"id\": \"local-1528288219790\",\n    \"name\": \"Spark shell\"\n}\n</code></pre> <p>=== [[getApp]] <code>getApp</code> Method</p>"},{"location":"rest/OneApplicationResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/OneApplicationResource/#getapp-applicationinfo","title":"getApp(): ApplicationInfo","text":"<p><code>getApp</code> requests the spark-api-ApiRequestContext.md#uiRoot[UIRoot] for the spark-api-UIRoot.md#getApplicationInfo[application info] (given the spark-api-BaseAppResource.md#appId[appId]).</p> <p>In the end, <code>getApp</code> returns the <code>ApplicationInfo</code> if available or reports a <code>NotFoundException</code>:</p> <pre><code>unknown app: [appId]\n</code></pre>"},{"location":"rest/StagesResource/","title":"StagesResource","text":"<p>== [[StagesResource]] StagesResource</p> <p><code>StagesResource</code> is...FIXME</p> <p>[[paths]] .StagesResource's Paths [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Path | HTTP Method | Description</p> <p>| | GET | &lt;&gt; <p>| <code>{stageId: \\d+}</code> | GET | &lt;&gt; <p>| <code>{stageId: \\d+}/{stageAttemptId: \\d+}</code> | GET | &lt;&gt; <p>| <code>{stageId: \\d+}/{stageAttemptId: \\d+}/taskSummary</code> | GET | &lt;&gt; <p>| <code>{stageId: \\d+}/{stageAttemptId: \\d+}/taskList</code> | GET | &lt;&gt; |=== <p>=== [[stageList]] <code>stageList</code> Method</p>"},{"location":"rest/StagesResource/#source-scala","title":"[source, scala]","text":""},{"location":"rest/StagesResource/#stagelistqueryparamstatus-statuses-jliststagestatus-seqstagedata","title":"stageList(@QueryParam(\"status\") statuses: JList[StageStatus]): Seq[StageData]","text":"<p><code>stageList</code>...FIXME</p> <p>NOTE: <code>stageList</code> is used when...FIXME</p> <p>=== [[stageData]] <code>stageData</code> Method</p>"},{"location":"rest/StagesResource/#source-scala_1","title":"[source, scala]","text":"<p>stageData(   @PathParam(\"stageId\") stageId: Int,   @QueryParam(\"details\") @DefaultValue(\"true\") details: Boolean): Seq[StageData]</p> <p><code>stageData</code>...FIXME</p> <p>NOTE: <code>stageData</code> is used when...FIXME</p> <p>=== [[oneAttemptData]] <code>oneAttemptData</code> Method</p>"},{"location":"rest/StagesResource/#source-scala_2","title":"[source, scala]","text":"<p>oneAttemptData(   @PathParam(\"stageId\") stageId: Int,   @PathParam(\"stageAttemptId\") stageAttemptId: Int,   @QueryParam(\"details\") @DefaultValue(\"true\") details: Boolean): StageData</p> <p><code>oneAttemptData</code>...FIXME</p> <p>NOTE: <code>oneAttemptData</code> is used when...FIXME</p> <p>=== [[taskSummary]] <code>taskSummary</code> Method</p>"},{"location":"rest/StagesResource/#source-scala_3","title":"[source, scala]","text":"<p>taskSummary(   @PathParam(\"stageId\") stageId: Int,   @PathParam(\"stageAttemptId\") stageAttemptId: Int,   @DefaultValue(\"0.05,0.25,0.5,0.75,0.95\") @QueryParam(\"quantiles\") quantileString: String) : TaskMetricDistributions</p> <p><code>taskSummary</code>...FIXME</p> <p>NOTE: <code>taskSummary</code> is used when...FIXME</p> <p>=== [[taskList]] <code>taskList</code> Method</p>"},{"location":"rest/StagesResource/#source-scala_4","title":"[source, scala]","text":"<p>taskList(   @PathParam(\"stageId\") stageId: Int,   @PathParam(\"stageAttemptId\") stageAttemptId: Int,   @DefaultValue(\"0\") @QueryParam(\"offset\") offset: Int,   @DefaultValue(\"20\") @QueryParam(\"length\") length: Int,   @DefaultValue(\"ID\") @QueryParam(\"sortBy\") sortBy: TaskSorting): Seq[TaskData]</p> <p><code>taskList</code>...FIXME</p> <p>NOTE: <code>taskList</code> is used when...FIXME</p>"},{"location":"rest/UIRoot/","title":"UIRoot","text":"<p>== [[UIRoot]] UIRoot -- Contract for Root Contrainers of Application UI Information</p> <p><code>UIRoot</code> is the &lt;&gt; of the &lt;&gt;. <p>[[contract]] [source, scala]</p> <p>package org.apache.spark.status.api.v1</p> <p>trait UIRoot {   // only required methods that have no implementation   // the others follow   def withSparkUIT(fn: SparkUI =&gt; T): T   def getApplicationInfoList: Iterator[ApplicationInfo]   def getApplicationInfo(appId: String): Option[ApplicationInfo]   def securityManager: SecurityManager }</p> <p>NOTE: <code>UIRoot</code> is a <code>private[spark]</code> contract.</p> <p>.UIRoot Contract [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| <code>getApplicationInfo</code> | [[getApplicationInfo]] Used when...FIXME</p> <p>| <code>getApplicationInfoList</code> | [[getApplicationInfoList]] Used when...FIXME</p> <p>| <code>securityManager</code> | [[securityManager]] Used when...FIXME</p> <p>| <code>withSparkUI</code> | [[withSparkUI]] Used exclusively when <code>BaseAppResource</code> is requested spark-api-BaseAppResource.md#withUI[withUI] |===</p> <p>[[implementations]] .UIRoots [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | UIRoot | Description</p> <p>| spark-history-server:HistoryServer.md[HistoryServer] | [[HistoryServer]] Application UI for active and completed Spark applications (i.e. Spark applications that are still running or have already finished)</p> <p>| spark-webui-SparkUI.md[SparkUI] | [[SparkUI]] Application UI for an active Spark application (i.e. a Spark application that is still running) |===</p> <p>=== [[writeEventLogs]] <code>writeEventLogs</code> Method</p>"},{"location":"rest/UIRoot/#source-scala","title":"[source, scala]","text":""},{"location":"rest/UIRoot/#writeeventlogsappid-string-attemptid-optionstring-zipstream-zipoutputstream-unit","title":"writeEventLogs(appId: String, attemptId: Option[String], zipStream: ZipOutputStream): Unit","text":"<p><code>writeEventLogs</code>...FIXME</p> <p>NOTE: <code>writeEventLogs</code> is used when...FIXME</p>"},{"location":"rest/UIRootFromServletContext/","title":"UIRootFromServletContext","text":"<p>== [[UIRootFromServletContext]] UIRootFromServletContext</p> <p><code>UIRootFromServletContext</code> manages the current &lt;&gt; object in a Jetty <code>ContextHandler</code>. <p>[[attribute]] <code>UIRootFromServletContext</code> uses its canonical name for the context attribute that is used to &lt;&gt; or &lt;&gt; the current spark-api-UIRoot.md[UIRoot] object (in Jetty's <code>ContextHandler</code>). <p>NOTE: https://www.eclipse.org/jetty/javadoc/current/org/eclipse/jetty/server/handler/ContextHandler.html[ContextHandler] is the environment for multiple Jetty <code>Handlers</code>, e.g. URI context path, class loader, static resource base.</p> <p>In essence, <code>UIRootFromServletContext</code> is simply a \"bridge\" between two worlds, Spark's spark-api-UIRoot.md[UIRoot] and Jetty's <code>ContextHandler</code>.</p> <p>=== [[setUiRoot]] <code>setUiRoot</code> Method</p>"},{"location":"rest/UIRootFromServletContext/#source-scala","title":"[source, scala]","text":""},{"location":"rest/UIRootFromServletContext/#setuirootcontexthandler-contexthandler-uiroot-uiroot-unit","title":"setUiRoot(contextHandler: ContextHandler, uiRoot: UIRoot): Unit","text":"<p><code>setUiRoot</code>...FIXME</p> <p>NOTE: <code>setUiRoot</code> is used exclusively when <code>ApiRootResource</code> is requested to spark-api-ApiRootResource.md#getServletHandler[register /api/* context handler].</p> <p>=== [[getUiRoot]] <code>getUiRoot</code> Method</p>"},{"location":"rest/UIRootFromServletContext/#source-scala_1","title":"[source, scala]","text":""},{"location":"rest/UIRootFromServletContext/#getuirootcontext-servletcontext-uiroot","title":"getUiRoot(context: ServletContext): UIRoot","text":"<p><code>getUiRoot</code>...FIXME</p> <p>NOTE: <code>getUiRoot</code> is used exclusively when <code>ApiRequestContext</code> is requested for the current spark-api-ApiRequestContext.md#uiRoot[UIRoot].</p>"},{"location":"rpc/","title":"RPC System","text":"<p>RPC System is a communication system of Spark services.</p> <p>The main abstractions are RpcEnv and RpcEndpoint.</p> <p></p>"},{"location":"rpc/NettyRpcEnv/","title":"NettyRpcEnv","text":"<p><code>NettyRpcEnv</code> is an RpcEnv that uses Netty (\"an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers &amp; clients\").</p>"},{"location":"rpc/NettyRpcEnv/#creating-instance","title":"Creating Instance","text":"<p><code>NettyRpcEnv</code> takes the following to be created:</p> <ul> <li> SparkConf <li> JavaSerializerInstance <li> Host Name <li> <code>SecurityManager</code> <li> Number of CPU Cores <p><code>NettyRpcEnv</code> is created\u00a0when:</p> <ul> <li><code>NettyRpcEnvFactory</code> is requested to create an RpcEnv</li> </ul>"},{"location":"rpc/NettyRpcEnvFactory/","title":"NettyRpcEnvFactory","text":"<p><code>NettyRpcEnvFactory</code> is an RpcEnvFactory for a Netty-based RpcEnv.</p>"},{"location":"rpc/NettyRpcEnvFactory/#creating-rpcenv","title":"Creating RpcEnv <pre><code>create(\n  config: RpcEnvConfig): RpcEnv\n</code></pre> <p><code>create</code> creates a JavaSerializerInstance (using a JavaSerializer).</p>  <p>Note</p> <p><code>KryoSerializer</code> is not supported.</p>  <p>create creates a rpc:NettyRpcEnv.md[] with the JavaSerializerInstance. create uses the given rpc:RpcEnvConfig.md[] for the rpc:RpcEnvConfig.md#advertiseAddress[advertised address], rpc:RpcEnvConfig.md#securityManager[SecurityManager] and rpc:RpcEnvConfig.md#numUsableCores[number of CPU cores].</p> <p>create returns the NettyRpcEnv unless the rpc:RpcEnvConfig.md#clientMode[clientMode] is turned off (server mode).</p> <p>In server mode, create attempts to start the NettyRpcEnv on a given port. create uses the given rpc:RpcEnvConfig.md[] for the rpc:RpcEnvConfig.md#port[port], rpc:RpcEnvConfig.md#bindAddress[bind address], and rpc:RpcEnvConfig.md#name[name]. With the port, the NettyRpcEnv is requested to rpc:NettyRpcEnv.md#startServer[start a server].</p> <p>create is part of the rpc:RpcEnvFactory.md#create[RpcEnvFactory] abstraction.</p>","text":""},{"location":"rpc/RpcAddress/","title":"RpcAddress","text":"<p><code>RpcAddress</code> is a logical address of an RPC system, with hostname and port.</p> <p><code>RpcAddress</code> can be encoded as a Spark URL in the format of <code>spark://host:port</code>.</p>"},{"location":"rpc/RpcAddress/#creating-instance","title":"Creating Instance","text":"<p><code>RpcAddress</code> takes the following to be created:</p> <ul> <li> Host <li> Port"},{"location":"rpc/RpcAddress/#creating-rpcaddress-based-on-spark-url","title":"Creating RpcAddress based on Spark URL <pre><code>fromSparkURL(\n  sparkUrl: String): RpcAddress\n</code></pre> <p><code>fromSparkURL</code> extract a host and a port from the input Spark URL and creates an RpcAddress.</p> <p><code>fromSparkURL</code>\u00a0is used when:</p> <ul> <li><code>StandaloneAppClient</code> (Spark Standalone) is created</li> <li><code>ClientApp</code> (Spark Standalone) is requested to <code>start</code></li> <li><code>Worker</code> (Spark Standalone) is requested to <code>startRpcEnvAndEndpoint</code></li> </ul>","text":""},{"location":"rpc/RpcEndpoint/","title":"RpcEndpoint","text":"<p><code>RpcEndpoint</code> is an abstraction of RPC endpoints that are registered to an RpcEnv to process one- (fire-and-forget) or two-way messages.</p>"},{"location":"rpc/RpcEndpoint/#contract","title":"Contract","text":""},{"location":"rpc/RpcEndpoint/#onconnected","title":"onConnected <pre><code>onConnected(\n  remoteAddress: RpcAddress): Unit\n</code></pre> <p>Invoked when RpcAddress is connected to the current node</p> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process a <code>RemoteProcessConnected</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#ondisconnected","title":"onDisconnected <pre><code>onDisconnected(\n  remoteAddress: RpcAddress): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process a <code>RemoteProcessDisconnected</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#onerror","title":"onError <pre><code>onError(\n  cause: Throwable): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process a message that threw a <code>NonFatal</code> exception</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#onnetworkerror","title":"onNetworkError <pre><code>onNetworkError(\n  cause: Throwable,\n  remoteAddress: RpcAddress): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process a <code>RemoteProcessConnectionError</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#onstart","title":"onStart <pre><code>onStart(): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process an <code>OnStart</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#onstop","title":"onStop <pre><code>onStop(): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process an <code>OnStop</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#processing-one-way-messages","title":"Processing One-Way Messages <pre><code>receive: PartialFunction[Any, Unit]\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process an <code>OneWayMessage</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#processing-two-way-messages","title":"Processing Two-Way Messages <pre><code>receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n</code></pre> <p>Used when:</p> <ul> <li><code>Inbox</code> is requested to process a <code>RpcMessage</code> message</li> </ul>","text":""},{"location":"rpc/RpcEndpoint/#rpcenv","title":"RpcEnv <pre><code>rpcEnv: RpcEnv\n</code></pre> <p>RpcEnv this <code>RpcEndpoint</code> is registered to</p>","text":""},{"location":"rpc/RpcEndpoint/#implementations","title":"Implementations","text":"<ul> <li>AMEndpoint</li> <li> IsolatedRpcEndpoint <li>MapOutputTrackerMasterEndpoint</li> <li>OutputCommitCoordinatorEndpoint</li> <li>RpcEndpointVerifier</li> <li> ThreadSafeRpcEndpoint <li>WorkerWatcher</li>"},{"location":"rpc/RpcEndpoint/#self","title":"self <pre><code>self: RpcEndpointRef\n</code></pre> <p><code>self</code> requests the RpcEnv for the RpcEndpointRef of this <code>RpcEndpoint</code>.</p> <p><code>self</code> throws an <code>IllegalArgumentException</code> when the RpcEnv has not been initialized:</p> <pre><code>rpcEnv has not been initialized\n</code></pre>","text":""},{"location":"rpc/RpcEndpoint/#stopping-rpcendpoint","title":"Stopping RpcEndpoint <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> requests the RpcEnv to stop this RpcEndpoint</p>","text":""},{"location":"rpc/RpcEndpointAddress/","title":"RpcEndpointAddress","text":"<p>= RpcEndpointAddress</p> <p>RpcEndpointAddress is a logical address of an endpoint in an RPC system, with &lt;&gt; and name. <p>RpcEndpointAddress is in the format of <code>spark://[name]@[rpcAddress.host]:[rpcAddress.port]</code>.</p>"},{"location":"rpc/RpcEndpointRef/","title":"RpcEndpointRef","text":"<p><code>RpcEndpointRef</code> is a reference to a rpc:RpcEndpoint.md[RpcEndpoint] in a rpc:index.md[RpcEnv].</p> <p>RpcEndpointRef is a serializable entity and so you can send it over a network or save it for later use (it can however be deserialized using the owning <code>RpcEnv</code> only).</p> <p>A RpcEndpointRef has &lt;&gt; (a Spark URL), and a name. <p>You can send asynchronous one-way messages to the corresponding RpcEndpoint using &lt;&gt; method. <p>You can send a semi-synchronous message, i.e. \"subscribe\" to be notified when a response arrives, using <code>ask</code> method. You can also block the current calling thread for a response using <code>askWithRetry</code> method.</p> <ul> <li><code>spark.rpc.numRetries</code> (default: <code>3</code>) - the number of times to retry connection attempts.</li> <li><code>spark.rpc.retry.wait</code> (default: <code>3s</code>) - the number of milliseconds to wait on each retry.</li> </ul> <p>It also uses rpc:index.md#endpoint-lookup-timeout[lookup timeouts].</p> <p>== [[send]] send Method</p> <p>CAUTION: FIXME</p> <p>== [[askWithRetry]] askWithRetry Method</p> <p>CAUTION: FIXME</p>"},{"location":"rpc/RpcEnv/","title":"RpcEnv","text":"<p><code>RpcEnv</code> is an abstraction of RPC environments.</p>"},{"location":"rpc/RpcEnv/#contract","title":"Contract","text":""},{"location":"rpc/RpcEnv/#address","title":"address <pre><code>address: RpcAddress\n</code></pre> <p>RpcAddress of this RPC environments</p>","text":""},{"location":"rpc/RpcEnv/#asyncsetupendpointrefbyuri","title":"asyncSetupEndpointRefByURI <pre><code>asyncSetupEndpointRefByURI(\n  uri: String): Future[RpcEndpointRef]\n</code></pre> <p>Looking up a RpcEndpointRef of the RPC endpoint by URI (asynchronously)</p> <p>Used when:</p> <ul> <li><code>WorkerWatcher</code> is created</li> <li><code>CoarseGrainedExecutorBackend</code> is requested to onStart</li> <li><code>RpcEnv</code> is requested to setupEndpointRefByURI</li> </ul>","text":""},{"location":"rpc/RpcEnv/#awaittermination","title":"awaitTermination <pre><code>awaitTermination(): Unit\n</code></pre> <p>Blocks the current thread till the RPC environment terminates</p> <p>Used when:</p> <ul> <li><code>SparkEnv</code> is requested to stop</li> <li><code>ClientApp</code> (Spark Standalone) is requested to <code>start</code></li> <li><code>LocalSparkCluster</code> (Spark Standalone) is requested to <code>stop</code></li> <li><code>Master</code> (Spark Standalone) and <code>Worker</code> (Spark Standalone) are launched</li> <li><code>CoarseGrainedExecutorBackend</code> is requested to run</li> </ul>","text":""},{"location":"rpc/RpcEnv/#deserialize","title":"deserialize <pre><code>deserialize[T](\n  deserializationAction: () =&gt; T): T\n</code></pre> <p>Used when:</p> <ul> <li><code>PersistenceEngine</code> is requested to <code>readPersistedData</code></li> <li><code>NettyRpcEnv</code> is requested to deserialize</li> </ul>","text":""},{"location":"rpc/RpcEnv/#endpointref","title":"endpointRef <pre><code>endpointRef(\n  endpoint: RpcEndpoint): RpcEndpointRef\n</code></pre> <p>Used when:</p> <ul> <li><code>RpcEndpoint</code> is requested for the RpcEndpointRef to itself</li> </ul>","text":""},{"location":"rpc/RpcEnv/#rpcenvfileserver","title":"RpcEnvFileServer <pre><code>fileServer: RpcEnvFileServer\n</code></pre> <p>RpcEnvFileServer of this RPC environment</p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested to addFile, addJar and is created (and registers the REPL's output directory)</li> </ul>","text":""},{"location":"rpc/RpcEnv/#openchannel","title":"openChannel <pre><code>openChannel(\n  uri: String): ReadableByteChannel\n</code></pre> <p>Opens a channel to download a file at the given URI</p> <p>Used when:</p> <ul> <li><code>Utils</code> utility is used to doFetchFile</li> <li><code>ExecutorClassLoader</code> is requested to <code>getClassFileInputStreamFromSparkRPC</code></li> </ul>","text":""},{"location":"rpc/RpcEnv/#setupendpoint","title":"setupEndpoint <pre><code>setupEndpoint(\n  name: String,\n  endpoint: RpcEndpoint): RpcEndpointRef\n</code></pre>","text":""},{"location":"rpc/RpcEnv/#shutdown","title":"shutdown <pre><code>shutdown(): Unit\n</code></pre> <p>Shuts down this RPC environment asynchronously (and to make sure this <code>RpcEnv</code> exits successfully, use awaitTermination)</p> <p>Used when:</p> <ul> <li><code>SparkEnv</code> is requested to stop</li> <li><code>LocalSparkCluster</code> (Spark Standalone) is requested to <code>stop</code></li> <li><code>DriverWrapper</code> is launched</li> <li><code>CoarseGrainedExecutorBackend</code> is launched</li> <li><code>NettyRpcEnvFactory</code> is requested to create an RpcEnv (in server mode and failed to assign a port)</li> </ul>","text":""},{"location":"rpc/RpcEnv/#stopping-rpcendpointref","title":"Stopping RpcEndpointRef <pre><code>stop(\n  endpoint: RpcEndpointRef): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested to stop</li> <li><code>RpcEndpoint</code> is requested to stop</li> <li><code>BlockManager</code> is requested to stop</li> <li>in Spark SQL</li> </ul>","text":""},{"location":"rpc/RpcEnv/#implementations","title":"Implementations","text":"<ul> <li>NettyRpcEnv</li> </ul>"},{"location":"rpc/RpcEnv/#creating-instance","title":"Creating Instance","text":"<p><code>RpcEnv</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>RpcEnv</code> is created using RpcEnv.create utility.</p> Abstract Class <p><code>RpcEnv</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete RpcEnvs.</p>"},{"location":"rpc/RpcEnv/#creating-rpcenv","title":"Creating RpcEnv <pre><code>create(\n  name: String,\n  host: String,\n  port: Int,\n  conf: SparkConf,\n  securityManager: SecurityManager,\n  clientMode: Boolean = false): RpcEnv // (1)\ncreate(\n  name: String,\n  bindAddress: String,\n  advertiseAddress: String,\n  port: Int,\n  conf: SparkConf,\n  securityManager: SecurityManager,\n  numUsableCores: Int,\n  clientMode: Boolean): RpcEnv\n</code></pre> <ol> <li>Uses <code>0</code> for <code>numUsableCores</code></li> </ol> <p><code>create</code> creates a NettyRpcEnvFactory and requests it to create an RpcEnv (with a new RpcEnvConfig with all the given arguments).</p> <p><code>create</code> is used when:</p> <ul> <li><code>SparkEnv</code> utility is requested to create a SparkEnv (<code>clientMode</code> flag is turned on for executors and off for the driver)</li> <li> <p>With <code>clientMode</code> flag <code>true</code>:</p> <ul> <li><code>CoarseGrainedExecutorBackend</code> is requested to run</li> <li><code>ClientApp</code> (Spark Standalone) is requested to <code>start</code></li> <li><code>Master</code> (Spark Standalone) is requested to <code>startRpcEnvAndEndpoint</code></li> <li><code>Worker</code> (Spark Standalone) is requested to <code>startRpcEnvAndEndpoint</code></li> <li><code>DriverWrapper</code> is launched</li> <li><code>ApplicationMaster</code> (Spark on YARN) is requested to <code>runExecutorLauncher</code> (in client deploy mode)</li> </ul> </li> </ul>","text":""},{"location":"rpc/RpcEnv/#default-endpoint-lookup-timeout","title":"Default Endpoint Lookup Timeout <p><code>RpcEnv</code> uses the default lookup timeout for...FIXME</p> <p>When a remote endpoint is resolved, a local RPC environment connects to the remote one (endpoint lookup). To configure the time needed for the endpoint lookup you can use the following settings.</p> <p>It is a prioritized list of lookup timeout properties (the higher on the list, the more important):</p> <ul> <li>spark.rpc.lookupTimeout</li> <li>spark.network.timeout</li> </ul>","text":""},{"location":"rpc/RpcEnvConfig/","title":"RpcEnvConfig","text":"<p>= RpcEnvConfig :page-toclevels: -1</p> <p>[[creating-instance]] RpcEnvConfig is a configuration of an rpc:RpcEnv.md[]:</p> <ul> <li>[[conf]] SparkConf.md[]</li> <li>[[name]] System Name</li> <li>[[bindAddress]] Bind Address</li> <li>[[advertiseAddress]] Advertised Address</li> <li>[[port]] Port</li> <li>[[securityManager]] SecurityManager</li> <li>[[numUsableCores]] Number of CPU cores</li> <li>&lt;&gt; <p>RpcEnvConfig is created when RpcEnv utility is used to rpc:RpcEnv.md#create[create an RpcEnv] (using rpc:RpcEnvFactory.md[]).</p> <p>== [[clientMode]] Client Mode</p> <p>When an RPC Environment is initialized core:SparkEnv.md#createDriverEnv[as part of the initialization of the driver] or core:SparkEnv.md#createExecutorEnv[executors] (using <code>RpcEnv.create</code>), <code>clientMode</code> is <code>false</code> for the driver and <code>true</code> for executors.</p> <p>Copied (almost verbatim) from https://issues.apache.org/jira/browse/SPARK-10997[SPARK-10997 Netty-based RPC env should support a \"client-only\" mode] and the https://github.com/apache/spark/commit/71d1c907dec446db566b19f912159fd8f46deb7d[commit]:</p> <p>\"Client mode\" means the RPC env will not listen for incoming connections.</p> <p>This allows certain processes in the Spark stack (such as Executors or tha YARN client-mode AM) to act as pure clients when using the netty-based RPC backend, reducing the number of sockets Spark apps need to use and also the number of open ports.</p> <p>The AM connects to the driver in \"client mode\", and that connection is used for all driver -- AM communication, and so the AM is properly notified when the connection goes down.</p> <p>In \"general\", non-YARN case, <code>clientMode</code> flag is therefore enabled for executors and disabled for the driver.</p> <p>In Spark on YARN in <code>client</code> deploy mode, <code>clientMode</code> flag is however enabled explicitly when Spark on YARN's spark-yarn-applicationmaster.md#runExecutorLauncher-sparkYarnAM[ApplicationMaster] creates the <code>sparkYarnAM</code> RPC Environment.</p>"},{"location":"rpc/RpcEnvFactory/","title":"RpcEnvFactory","text":"<p>= RpcEnvFactory</p> <p>RpcEnvFactory is an abstraction of &lt;&gt; to &lt;&gt;. <p>== [[implementations]] Available RpcEnvFactories</p> <p>rpc:NettyRpcEnvFactory.md[] is the default and only known RpcEnvFactory in Apache Spark (as of https://github.com/apache/spark/commit/4f5a24d7e73104771f233af041eeba4f41675974[this commit]).</p> <p>== [[create]] Creating RpcEnv</p>"},{"location":"rpc/RpcEnvFactory/#sourcescala","title":"[source,scala]","text":"<p>create(   config: RpcEnvConfig): RpcEnv</p> <p>create is used when RpcEnv utility is requested to rpc:RpcEnv.md#create[create an RpcEnv].</p>"},{"location":"rpc/RpcEnvFileServer/","title":"RpcEnvFileServer","text":"<p>= RpcEnvFileServer</p> <p>RpcEnvFileServer is...FIXME</p>"},{"location":"rpc/RpcUtils/","title":"RpcUtils","text":""},{"location":"rpc/RpcUtils/#maximum-message-size","title":"Maximum Message Size <pre><code>maxMessageSizeBytes(\n  conf: SparkConf): Int\n</code></pre> <p><code>maxMessageSizeBytes</code> is the value of spark.rpc.message.maxSize configuration property in bytes (by multiplying the value by <code>1024 * 1024</code>).</p> <p><code>maxMessageSizeBytes</code> throws an <code>IllegalArgumentException</code> when the value is above <code>2047</code> MB:</p> <pre><code>spark.rpc.message.maxSize should not be greater than 2047 MB\n</code></pre> <p><code>maxMessageSizeBytes</code> is used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested for the maxRpcMessageSize</li> <li><code>Executor</code> is requested for the maxDirectResultSize</li> <li><code>CoarseGrainedSchedulerBackend</code> is requested for the maxRpcMessageSize</li> </ul>","text":""},{"location":"rpc/RpcUtils/#makedriverref","title":"makeDriverRef <pre><code>makeDriverRef(\n  name: String,\n  conf: SparkConf,\n  rpcEnv: RpcEnv): RpcEndpointRef\n</code></pre> <p><code>makeDriverRef</code>...FIXME</p>  <p><code>makeDriverRef</code> is used when:</p> <ul> <li>BarrierTaskContext is created</li> <li><code>SparkEnv</code> utility is used to create a SparkEnv (on executors)</li> <li>Executor is created</li> <li><code>PluginContextImpl</code> is requested for <code>driverEndpoint</code></li> </ul>","text":""},{"location":"rpc/spark-rpc-netty/","title":"Netty-Based RpcEnv","text":"<p>Netty-based RPC Environment is created by <code>NettyRpcEnvFactory</code> when rpc:index.md#settings[spark.rpc] is <code>netty</code> or <code>org.apache.spark.rpc.netty.NettyRpcEnvFactory</code>.</p> <p>NettyRpcEnv is only started on spark-driver.md[the driver]. See &lt;&gt;. <p>The default port to listen to is <code>7077</code>.</p> <p>When NettyRpcEnv starts, the following INFO message is printed out in the logs:</p> <pre><code>Successfully started service 'NettyRpcEnv' on port 0.\n</code></pre> <p>== [[thread-pools]] Thread Pools</p> <p>=== shuffle-server-ID</p> <p><code>EventLoopGroup</code> uses a daemon thread pool called <code>shuffle-server-ID</code>, where <code>ID</code> is a unique integer for <code>NioEventLoopGroup</code> (<code>NIO</code>) or <code>EpollEventLoopGroup</code> (<code>EPOLL</code>) for the Shuffle server.</p> <p>CAUTION: FIXME Review Netty's <code>NioEventLoopGroup</code>.</p> <p>CAUTION: FIXME Where are <code>SO_BACKLOG</code>, <code>SO_RCVBUF</code>, <code>SO_SNDBUF</code> channel options used?</p> <p>=== dispatcher-event-loop-ID</p> <p>NettyRpcEnv's Dispatcher uses the daemon fixed thread pool with &lt;&gt; threads. <p>Thread names are formatted as <code>dispatcher-event-loop-ID</code>, where <code>ID</code> is a unique, sequentially assigned integer.</p> <p>It starts the message processing loop on all of the threads.</p> <p>=== netty-rpc-env-timeout</p> <p>NettyRpcEnv uses the daemon single-thread scheduled thread pool <code>netty-rpc-env-timeout</code>.</p> <pre><code>\"netty-rpc-env-timeout\" #87 daemon prio=5 os_prio=31 tid=0x00007f887775a000 nid=0xc503 waiting on condition [0x0000000123397000]\n</code></pre> <p>=== netty-rpc-connection-ID</p> <p>NettyRpcEnv uses the daemon cached thread pool with up to &lt;&gt; threads. <p>Thread names are formatted as <code>netty-rpc-connection-ID</code>, where <code>ID</code> is a unique, sequentially assigned integer.</p> <p>== [[settings]] Settings</p> <p>The Netty-based implementation uses the following properties:</p> <ul> <li><code>spark.rpc.io.mode</code> (default: <code>NIO</code>) - <code>NIO</code> or <code>EPOLL</code> for low-level IO. <code>NIO</code> is always available, while <code>EPOLL</code> is only available on Linux. <code>NIO</code> uses <code>io.netty.channel.nio.NioEventLoopGroup</code> while <code>EPOLL</code> <code>io.netty.channel.epoll.EpollEventLoopGroup</code>.</li> <li><code>spark.shuffle.io.numConnectionsPerPeer</code> always equals <code>1</code></li> <li><code>spark.rpc.io.threads</code> (default: <code>0</code>; maximum: <code>8</code>) - the number of threads to use for the Netty client and server thread pools. ** <code>spark.shuffle.io.serverThreads</code> (default: the value of <code>spark.rpc.io.threads</code>) ** <code>spark.shuffle.io.clientThreads</code> (default: the value of <code>spark.rpc.io.threads</code>)</li> <li><code>spark.rpc.netty.dispatcher.numThreads</code> (default: the number of processors available to JVM)</li> <li><code>spark.rpc.connect.threads</code> (default: <code>64</code>) - used in cluster mode to communicate with a remote RPC endpoint</li> <li><code>spark.port.maxRetries</code> (default: <code>16</code> or <code>100</code> for testing when <code>spark.testing</code> is set) controls the maximum number of binding attempts/retries to a port before giving up.</li> </ul> <p>== [[endpoints]] Endpoints</p> <ul> <li><code>endpoint-verifier</code> (<code>RpcEndpointVerifier</code>) - a rpc:RpcEndpoint.md[RpcEndpoint] for remote RpcEnvs to query whether an RpcEndpoint exists or not. It uses <code>Dispatcher</code> that keeps track of registered endpoints and responds <code>true</code>/<code>false</code> to <code>CheckExistence</code> message.</li> </ul> <p><code>endpoint-verifier</code> is used to check out whether a given endpoint exists or not before the endpoint's reference is given back to clients.</p> <p>One use case is when an spark-standalone.md#AppClient[AppClient connects to standalone Masters] before it registers the application it acts for.</p> <p>CAUTION: FIXME Who'd like to use <code>endpoint-verifier</code> and how?</p> <p>== Message Dispatcher</p> <p>A message dispatcher is responsible for routing RPC messages to the appropriate endpoint(s).</p> <p>It uses the daemon fixed thread pool <code>dispatcher-event-loop</code> with <code>spark.rpc.netty.dispatcher.numThreads</code> threads for dispatching messages.</p> <pre><code>\"dispatcher-event-loop-0\" #26 daemon prio=5 os_prio=31 tid=0x00007f8877153800 nid=0x7103 waiting on condition [0x000000011f78b000]\n</code></pre>"},{"location":"scheduler/","title":"Spark Scheduler","text":"<p>Spark Scheduler is a core component of Apache Spark that is responsible for scheduling tasks for execution.</p> <p>Spark Scheduler uses the high-level stage-oriented DAGScheduler and the low-level task-oriented TaskScheduler.</p>"},{"location":"scheduler/#stage-execution","title":"Stage Execution","text":"<p>Every partition of a Stage is transformed into a Task (ShuffleMapTask or ResultTask for ShuffleMapStage and ResultStage, respectively).</p> <p>Submitting a stage can therefore trigger execution of a series of dependent parent stages.</p> <p></p> <p>When a Spark job is submitted, a new stage is created (they can be created from scratch or linked to, i.e. shared, if other jobs use them already).</p> <p></p> <p><code>DAGScheduler</code> splits up a job into a collection of Stages. A <code>Stage</code> contains a sequence of narrow transformations that can be completed without shuffling data set, separated at shuffle boundaries (where shuffle occurs). Stages are thus a result of breaking the RDD graph at shuffle boundaries.</p> <p></p> <p>Shuffle boundaries introduce a barrier where stages/tasks must wait for the previous stage to finish before they fetch map outputs.</p> <p></p>"},{"location":"scheduler/#resources","title":"Resources","text":"<ul> <li>Deep Dive into the Apache Spark Scheduler by Xingbo Jiang (Databricks)</li> </ul>"},{"location":"scheduler/ActiveJob/","title":"ActiveJob","text":"<p><code>ActiveJob</code> (job, action job) is a top-level work item (computation) submitted to DAGScheduler for execution (usually to compute the result of an <code>RDD</code> action).</p> <p></p> <p>Executing a job is equivalent to computing the partitions of the RDD an action has been executed upon. The number of partitions (<code>numPartitions</code>) to compute in a job depends on the type of a stage (ResultStage or ShuffleMapStage).</p> <p>A job starts with a single target RDD, but can ultimately include other <code>RDD</code>s that are all part of RDD lineage.</p> <p>The parent stages are always ShuffleMapStages.</p> <p></p> <p>Note</p> <p>Not always all partitions have to be computed for ResultStages (e.g. for actions like <code>first()</code> and <code>lookup()</code>).</p>"},{"location":"scheduler/ActiveJob/#creating-instance","title":"Creating Instance","text":"<p><code>ActiveJob</code> takes the following to be created:</p> <ul> <li> Job ID <li>Final Stage</li> <li> <code>CallSite</code> <li> JobListener <li> <code>Properties</code> <p><code>ActiveJob</code> is created when:</p> <ul> <li><code>DAGScheduler</code> is requested to handleJobSubmitted and handleMapStageSubmitted</li> </ul>"},{"location":"scheduler/ActiveJob/#final-stage","title":"Final Stage <p><code>ActiveJob</code> is given a Stage when created that determines a logical type:</p> <ol> <li>Map-Stage Job that computes the map output files for a ShuffleMapStage (for <code>submitMapStage</code>) before any downstream stages are submitted</li> <li>Result job that computes a ResultStage to execute an action</li> </ol>","text":""},{"location":"scheduler/ActiveJob/#finished-computed-partitions","title":"Finished (Computed) Partitions <p><code>ActiveJob</code> uses <code>finished</code> registry of flags to track partitions that have already been computed (<code>true</code>) or not (<code>false</code>).</p>","text":""},{"location":"scheduler/BlacklistTracker/","title":"BlacklistTracker","text":"<p><code>BlacklistTracker</code> is...FIXME</p>"},{"location":"scheduler/CoarseGrainedSchedulerBackend/","title":"CoarseGrainedSchedulerBackend","text":"<p><code>CoarseGrainedSchedulerBackend</code> is a base SchedulerBackend for coarse-grained schedulers.</p> <p><code>CoarseGrainedSchedulerBackend</code> is an ExecutorAllocationClient.</p> <p><code>CoarseGrainedSchedulerBackend</code> is responsible for requesting resources from a cluster manager for executors that it in turn uses to launch tasks (on CoarseGrainedExecutorBackend).</p> <p><code>CoarseGrainedSchedulerBackend</code> holds executors for the duration of the Spark job rather than relinquishing executors whenever a task is done and asking the scheduler to launch a new executor for each new task.</p> <p><code>CoarseGrainedSchedulerBackend</code> registers CoarseGrainedScheduler RPC Endpoint that executors use for RPC communication.</p> <p>Note</p> <p>Active executors are executors that are not pending to be removed or lost.</p>"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#implementations","title":"Implementations","text":"<ul> <li><code>KubernetesClusterSchedulerBackend</code> (Spark on Kubernetes)</li> <li><code>MesosCoarseGrainedSchedulerBackend</code> (Spark on Mesos)</li> <li><code>StandaloneSchedulerBackend</code> (Spark Standalone)</li> <li><code>YarnSchedulerBackend</code> (Spark on YARN)</li> </ul>"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#creating-instance","title":"Creating Instance","text":"<p><code>CoarseGrainedSchedulerBackend</code> takes the following to be created:</p> <ul> <li> TaskSchedulerImpl <li> RpcEnv"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#driverEndpoint","title":"CoarseGrainedScheduler RPC Endpoint","text":"<pre><code>driverEndpoint: RpcEndpointRef\n</code></pre> <p><code>CoarseGrainedSchedulerBackend</code> registers a DriverEndpoint RPC endpoint known as CoarseGrainedScheduler when created.</p>"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#createDriverEndpoint","title":"Creating DriverEndpoint","text":"<pre><code>createDriverEndpoint(): DriverEndpoint\n</code></pre> <p><code>createDriverEndpoint</code> creates a new DriverEndpoint.</p> <p>Note</p> <p>The purpose of <code>createDriverEndpoint</code> is to let CoarseGrainedSchedulerBackends to provide their own custom implementations:</p> <ul> <li><code>KubernetesClusterSchedulerBackend</code> (Spark on Kubernetes)</li> <li><code>StandaloneSchedulerBackend</code></li> </ul> <p><code>createDriverEndpoint</code> is used when:</p> <ul> <li><code>CoarseGrainedSchedulerBackend</code> is created (and registers CoarseGrainedScheduler RPC endpoint)</li> </ul>"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks","text":"SchedulerBackend <pre><code>maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n</code></pre> <p><code>maxNumConcurrentTasks</code> is part of the SchedulerBackend abstraction.</p> <p><code>maxNumConcurrentTasks</code> uses the Available Executors registry to find out about available ResourceProfiles, total number of CPU cores and ExecutorResourceInfos of every active executor.</p> <p>In the end, <code>maxNumConcurrentTasks</code> calculates the available (parallel) slots for the given ResourceProfile (and given the available executor resources).</p>"},{"location":"scheduler/CoarseGrainedSchedulerBackend/#totalregisteredexecutors-registry","title":"totalRegisteredExecutors Registry <pre><code>totalRegisteredExecutors: AtomicInteger\n</code></pre> <p><code>totalRegisteredExecutors</code> is an internal registry of the number of registered executors (a Java AtomicInteger).</p> <p><code>totalRegisteredExecutors</code> starts from <code>0</code>.</p> <p><code>totalRegisteredExecutors</code> is incremented when:</p> <ul> <li><code>DriverEndpoint</code> is requested to handle a RegisterExecutor message</li> </ul> <p><code>totalRegisteredExecutors</code> is decremented when:</p> <ul> <li><code>DriverEndpoint</code> is requested to remove an executor</li> </ul>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#sufficient-resources-registered","title":"Sufficient Resources Registered <pre><code>sufficientResourcesRegistered(): Boolean\n</code></pre> <p><code>sufficientResourcesRegistered</code> is <code>true</code> (and is supposed to be overriden by custom CoarseGrainedSchedulerBackends).</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#minimum-resources-available-ratio","title":"Minimum Resources Available Ratio <pre><code>minRegisteredRatio: Double\n</code></pre> <p><code>minRegisteredRatio</code> is a ratio of the minimum resources available to the total expected resources for the <code>CoarseGrainedSchedulerBackend</code> to be ready for scheduling tasks (for execution).</p> <p><code>minRegisteredRatio</code> uses spark.scheduler.minRegisteredResourcesRatio configuration property if defined or defaults to <code>0.0</code>.</p> <p><code>minRegisteredRatio</code> can be between <code>0.0</code> and <code>1.0</code> (inclusive).</p> <p><code>minRegisteredRatio</code> is used when:</p> <ul> <li><code>CoarseGrainedSchedulerBackend</code> is requested to isReady</li> <li><code>StandaloneSchedulerBackend</code> is requested to <code>sufficientResourcesRegistered</code></li> <li><code>KubernetesClusterSchedulerBackend</code> is requested to <code>sufficientResourcesRegistered</code></li> <li><code>MesosCoarseGrainedSchedulerBackend</code> is requested to <code>sufficientResourcesRegistered</code></li> <li><code>YarnSchedulerBackend</code> is requested to <code>sufficientResourcesRegistered</code></li> </ul>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#available-executors-registry","title":"Available Executors Registry <pre><code>executorDataMap: HashMap[String, ExecutorData]\n</code></pre> <p><code>CoarseGrainedSchedulerBackend</code> tracks available executors using <code>executorDataMap</code> registry (of ExecutorDatas by executor id).</p> <p>A new entry is added when <code>DriverEndpoint</code> is requested to handle RegisterExecutor message.</p> <p>An entry is removed when <code>DriverEndpoint</code> is requested to handle RemoveExecutor message or a remote host (with one or many executors) disconnects.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#revive-messages-scheduler-service","title":"Revive Messages Scheduler Service <pre><code>reviveThread: ScheduledExecutorService\n</code></pre> <p><code>CoarseGrainedSchedulerBackend</code> creates a Java ScheduledExecutorService when created.</p> <p>The <code>ScheduledExecutorService</code> is used by <code>DriverEndpoint</code> RPC Endpoint to post ReviveOffers messages regularly.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#maximum-size-of-rpc-message","title":"Maximum Size of RPC Message <p><code>maxRpcMessageSize</code> is the value of spark.rpc.message.maxSize configuration property.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#making-fake-resource-offers-on-executors","title":"Making Fake Resource Offers on Executors <pre><code>makeOffers(): Unit\nmakeOffers(\n  executorId: String): Unit\n</code></pre> <p><code>makeOffers</code> takes the active executors (out of the &lt;&gt; internal registry) and creates <code>WorkerOffer</code> resource offers for each (one per executor with the executor's id, host and free cores). <p>CAUTION: Only free cores are considered in making offers. Memory is not! Why?!</p> <p>It then requests TaskSchedulerImpl.md#resourceOffers[<code>TaskSchedulerImpl</code> to process the resource offers] to create a collection of TaskDescription collections that it in turn uses to launch tasks.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#getting-executor-ids","title":"Getting Executor Ids <p>When called, <code>getExecutorIds</code> simply returns executor ids from the internal &lt;&gt; registry. <p>NOTE: It is called when SparkContext.md#getExecutorIds[SparkContext calculates executor ids].</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#requesting-executors","title":"Requesting Executors <pre><code>requestExecutors(\n  numAdditionalExecutors: Int): Boolean\n</code></pre> <p><code>requestExecutors</code> is a \"decorator\" method that ultimately calls a cluster-specific doRequestTotalExecutors method and returns whether the request was acknowledged or not (it is assumed <code>false</code> by default).</p> <p><code>requestExecutors</code> method is part of the ExecutorAllocationClient abstraction.</p> <p>When called, you should see the following INFO message followed by DEBUG message in the logs:</p> <pre><code>Requesting [numAdditionalExecutors] additional executor(s) from the cluster manager\nNumber of pending executors is now [numPendingExecutors]\n</code></pre> <p>&lt;&gt; is increased by the input <code>numAdditionalExecutors</code>. <p><code>requestExecutors</code> requests executors from a cluster manager (that reflects the current computation needs). The \"new executor total\" is a sum of the internal &lt;&gt; and &lt;&gt; decreased by the &lt;&gt;. <p>If <code>numAdditionalExecutors</code> is negative, a <code>IllegalArgumentException</code> is thrown:</p> <pre><code>Attempted to request a negative number of additional executor(s) [numAdditionalExecutors] from the cluster manager. Please specify a positive number!\n</code></pre> <p>NOTE: It is a final method that no other scheduler backends could customize further.</p> <p>NOTE: The method is a synchronized block that makes multiple concurrent requests be handled in a serial fashion, i.e. one by one.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#requesting-exact-number-of-executors","title":"Requesting Exact Number of Executors <pre><code>requestTotalExecutors(\n  numExecutors: Int,\n  localityAwareTasks: Int,\n  hostToLocalTaskCount: Map[String, Int]): Boolean\n</code></pre> <p><code>requestTotalExecutors</code> is a \"decorator\" method that ultimately calls a cluster-specific doRequestTotalExecutors method and returns whether the request was acknowledged or not (it is assumed <code>false</code> by default).</p> <p><code>requestTotalExecutors</code> is part of the ExecutorAllocationClient abstraction.</p> <p>It sets the internal &lt;&gt; and &lt;&gt; registries. It then calculates the exact number of executors which is the input <code>numExecutors</code> and the &lt;&gt; decreased by the number of &lt;&gt;. <p>If <code>numExecutors</code> is negative, a <code>IllegalArgumentException</code> is thrown:</p> <pre><code>Attempted to request a negative number of executor(s) [numExecutors] from the cluster manager. Please specify a positive number!\n</code></pre> <p>NOTE: It is a final method that no other scheduler backends could customize further.</p> <p>NOTE: The method is a synchronized block that makes multiple concurrent requests be handled in a serial fashion, i.e. one by one.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#finding-default-level-of-parallelism","title":"Finding Default Level of Parallelism <pre><code>defaultParallelism(): Int\n</code></pre> <p><code>defaultParallelism</code> is part of the SchedulerBackend abstraction.</p> <p><code>defaultParallelism</code> is spark.default.parallelism configuration property if defined.</p> <p>Otherwise, <code>defaultParallelism</code> is the maximum of totalCoreCount or <code>2</code>.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#killing-task","title":"Killing Task <pre><code>killTask(\n  taskId: Long,\n  executorId: String,\n  interruptThread: Boolean): Unit\n</code></pre> <p><code>killTask</code> is part of the SchedulerBackend abstraction.</p> <p><code>killTask</code> simply sends a KillTask message to &lt;&gt;.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#stopping-all-executors","title":"Stopping All Executors <p><code>stopExecutors</code> sends a blocking &lt;&gt; message to &lt;&gt; (if already initialized). <p>NOTE: It is called exclusively while <code>CoarseGrainedSchedulerBackend</code> is &lt;&gt;. <p>You should see the following INFO message in the logs:</p> <pre><code>Shutting down all executors\n</code></pre>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#reset-state","title":"Reset State <p><code>reset</code> resets the internal state:</p> <ol> <li>Sets &lt;&gt; to 0 <li>Clears <code>executorsPendingToRemove</code></li> <li>Sends a blocking &lt;&gt; message to &lt;&gt; for every executor (in the internal <code>executorDataMap</code>) to inform it about <code>SlaveLost</code> with the message: + <pre><code>Stale executor after cluster manager re-registered.\n</code></pre>  <p><code>reset</code> is a method that is defined in <code>CoarseGrainedSchedulerBackend</code>, but used and overriden exclusively by yarn/spark-yarn-yarnschedulerbackend.md[YarnSchedulerBackend].</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#remove-executor","title":"Remove Executor <pre><code>removeExecutor(executorId: String, reason: ExecutorLossReason)\n</code></pre> <p><code>removeExecutor</code> sends a blocking &lt;&gt; message to &lt;&gt;. <p>NOTE: It is called by subclasses spark-standalone.md#SparkDeploySchedulerBackend[SparkDeploySchedulerBackend], spark-mesos/spark-mesos.md#CoarseMesosSchedulerBackend[CoarseMesosSchedulerBackend], and yarn/spark-yarn-yarnschedulerbackend.md[YarnSchedulerBackend].</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#coarsegrainedscheduler-rpc-endpoint","title":"CoarseGrainedScheduler RPC Endpoint <p>When &lt;&gt;, it registers CoarseGrainedScheduler RPC endpoint to be the driver's communication endpoint. <p><code>driverEndpoint</code> is a DriverEndpoint.</p>  <p>Note</p> <p><code>CoarseGrainedSchedulerBackend</code> is created while SparkContext is being created that in turn lives inside a Spark driver. That explains the name <code>driverEndpoint</code> (at least partially).</p>  <p>It is called standalone scheduler's driver endpoint internally.</p> <p>It tracks:</p> <p>It uses <code>driver-revive-thread</code> daemon single-thread thread pool for ...FIXME</p> <p>CAUTION: FIXME A potential issue with <code>driverEndpoint.asInstanceOf[NettyRpcEndpointRef].toURI</code> - doubles <code>spark://</code> prefix.</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#starting-coarsegrainedschedulerbackend","title":"Starting CoarseGrainedSchedulerBackend <pre><code>start(): Unit\n</code></pre> <p><code>start</code> is part of the SchedulerBackend abstraction.</p> <p><code>start</code> takes all <code>spark.</code>-prefixed properties and registers the &lt;CoarseGrainedScheduler RPC endpoint&gt;&gt; (backed by DriverEndpoint ThreadSafeRpcEndpoint). <p></p> <p>NOTE: <code>start</code> uses &lt;&gt; to access the current SparkContext.md[SparkContext] and in turn SparkConf.md[SparkConf]. <p>NOTE: <code>start</code> uses &lt;&gt; that was given when &lt;CoarseGrainedSchedulerBackend was created&gt;&gt;.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#checking-if-sufficient-compute-resources-available-or-waiting-time-passedmethod","title":"Checking If Sufficient Compute Resources Available Or Waiting Time PassedMethod <pre><code>isReady(): Boolean\n</code></pre> <p><code>isReady</code> is part of the SchedulerBackend abstraction.</p> <p><code>isReady</code> allows to delay task launching until &lt;&gt; or &lt;&gt; passes. <p>Internally, <code>isReady</code> &lt;&gt;. <p>NOTE: &lt;&gt; by default responds that sufficient resources are available. <p>If the &lt;&gt;, you should see the following INFO message in the logs and <code>isReady</code> is positive. <pre><code>SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: [minRegisteredRatio]\n</code></pre> <p>If there are no sufficient resources available yet (the above requirement does not hold), <code>isReady</code> checks whether the time since &lt;&gt; passed &lt;&gt; to give a way to launch tasks (even when &lt;&gt; not being reached yet). <p>You should see the following INFO message in the logs and <code>isReady</code> is positive.</p> <pre><code>SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: [maxRegisteredWaitingTimeMs](ms)\n</code></pre> <p>Otherwise, when &lt;&gt; and &lt;&gt; has not elapsed, <code>isReady</code> is negative.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#reviving-resource-offers","title":"Reviving Resource Offers <pre><code>reviveOffers(): Unit\n</code></pre> <p><code>reviveOffers</code> is part of the SchedulerBackend abstraction.</p> <p><code>reviveOffers</code> simply sends a ReviveOffers message to CoarseGrainedSchedulerBackend RPC endpoint.</p> <p></p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#stopping-schedulerbackend","title":"Stopping SchedulerBackend <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> is part of the SchedulerBackend abstraction.</p> <p><code>stop</code> &lt;&gt; and &lt;CoarseGrainedScheduler RPC endpoint&gt;&gt; (by sending a blocking StopDriver message). <p>In case of any <code>Exception</code>, <code>stop</code> reports a <code>SparkException</code> with the message:</p> <pre><code>Error stopping standalone scheduler's driver endpoint\n</code></pre>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#createdriverendpointref","title":"createDriverEndpointRef <pre><code>createDriverEndpointRef(\n  properties: ArrayBuffer[(String, String)]): RpcEndpointRef\n</code></pre> <p><code>createDriverEndpointRef</code> &lt;&gt; and rpc:index.md#setupEndpoint[registers it] as CoarseGrainedScheduler. <p><code>createDriverEndpointRef</code> is used when <code>CoarseGrainedSchedulerBackend</code> is requested to &lt;&gt;.","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#checking-whether-executor-is-active","title":"Checking Whether Executor is Active <pre><code>isExecutorActive(\n  id: String): Boolean\n</code></pre> <p><code>isExecutorActive</code> is part of the ExecutorAllocationClient abstraction.</p> <p><code>isExecutorActive</code>...FIXME</p>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#requesting-executors-from-cluster-manager","title":"Requesting Executors from Cluster Manager <pre><code>doRequestTotalExecutors(\n  requestedTotal: Int): Future[Boolean]\n</code></pre> <p><code>doRequestTotalExecutors</code> returns a completed <code>Future</code> with <code>false</code> value.</p> <p><code>doRequestTotalExecutors</code> is used when:</p> <ul> <li><code>CoarseGrainedSchedulerBackend</code> is requested to requestExecutors, requestTotalExecutors and killExecutors</li> </ul>","text":""},{"location":"scheduler/CoarseGrainedSchedulerBackend/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/CompressedMapStatus/","title":"CompressedMapStatus","text":"<p><code>CompressedMapStatus</code> is...FIXME</p>"},{"location":"scheduler/DAGScheduler/","title":"DAGScheduler","text":""},{"location":"scheduler/DAGScheduler/#dagscheduler","title":"DAGScheduler","text":"<p>Note</p> <p>The introduction that follows was highly influenced by the scaladoc of org.apache.spark.scheduler.DAGScheduler. As <code>DAGScheduler</code> is a <code>private class</code> it does not appear in the official API documentation. You are strongly encouraged to read the sources and only then read this and the related pages afterwards.</p>"},{"location":"scheduler/DAGScheduler/#introduction","title":"Introduction","text":"<p><code>DAGScheduler</code> is the scheduling layer of Apache Spark that implements stage-oriented scheduling using Jobs and Stages.</p> <p><code>DAGScheduler</code> transforms a logical execution plan (RDD lineage of dependencies built using RDD transformations) to a physical execution plan (using stages).</p> <p></p> <p>After an action has been called on an <code>RDD</code>, SparkContext hands over a logical plan to <code>DAGScheduler</code> that it in turn translates to a set of stages that are submitted as TaskSets for execution.</p> <p></p> <p><code>DAGScheduler</code> works solely on the driver and is created as part of SparkContext's initialization (right after TaskScheduler and SchedulerBackend are ready).</p> <p></p> <p><code>DAGScheduler</code> does three things in Spark:</p> <ul> <li>Computes an execution DAG (DAG of stages) for a job</li> <li>Determines the preferred locations to run each task on</li> <li>Handles failures due to shuffle output files being lost</li> </ul> <p>DAGScheduler computes a directed acyclic graph (DAG) of stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a minimal schedule to run jobs. It then submits stages to TaskScheduler.</p> <p></p> <p>In addition to coming up with the execution DAG, DAGScheduler also determines the preferred locations to run each task on, based on the current cache status, and passes the information to TaskScheduler.</p> <p>DAGScheduler tracks which rdd/spark-rdd-caching.md[RDDs are cached (or persisted)] to avoid \"recomputing\" them, i.e. redoing the map side of a shuffle. DAGScheduler remembers what ShuffleMapStage.md[ShuffleMapStage]s have already produced output files (that are stored in BlockManagers).</p> <p><code>DAGScheduler</code> is only interested in cache location coordinates, i.e. host and executor id, per partition of a RDD.</p> <p>Furthermore, it handles failures due to shuffle output files being lost, in which case old stages may need to be resubmitted. Failures within a stage that are not caused by shuffle file loss are handled by the TaskScheduler itself, which will retry each task a small number of times before cancelling the whole stage.</p> <p>DAGScheduler uses an event queue architecture in which a thread can post <code>DAGSchedulerEvent</code> events, e.g. a new job or stage being submitted, that DAGScheduler reads and executes sequentially. See the section Event Bus.</p> <p>DAGScheduler runs stages in topological order.</p> <p>DAGScheduler uses SparkContext, TaskScheduler, LiveListenerBus.md[], MapOutputTracker.md[MapOutputTracker] and storage:BlockManager.md[BlockManager] for its services. However, at the very minimum, DAGScheduler takes a <code>SparkContext</code> only (and requests <code>SparkContext</code> for the other services).</p> <p>When DAGScheduler schedules a job as a result of rdd/index.md#actions[executing an action on a RDD] or calling SparkContext.runJob() method directly, it spawns parallel tasks to compute (partial) results per partition.</p>"},{"location":"scheduler/DAGScheduler/#creating-instance","title":"Creating Instance","text":"<p><code>DAGScheduler</code> takes the following to be created:</p> <ul> <li> SparkContext <li> TaskScheduler <li> LiveListenerBus <li> MapOutputTrackerMaster <li> BlockManagerMaster <li> SparkEnv <li> <code>Clock</code> <p><code>DAGScheduler</code> is created\u00a0when SparkContext is created.</p> <p>While being created, <code>DAGScheduler</code> requests the TaskScheduler to associate itself with and requests DAGScheduler Event Bus to start accepting events.</p>"},{"location":"scheduler/DAGScheduler/#submitting-mapstage-for-execution-posting-mapstagesubmitted","title":"Submitting MapStage for Execution (Posting MapStageSubmitted) <pre><code>submitMapStage[K, V, C](\n  dependency: ShuffleDependency[K, V, C],\n  callback: MapOutputStatistics =&gt; Unit,\n  callSite: CallSite,\n  properties: Properties): JobWaiter[MapOutputStatistics]\n</code></pre> <p><code>submitMapStage</code> requests the given ShuffleDependency for the RDD.</p> <p><code>submitMapStage</code> gets the job ID and increments it (for future submissions).</p> <p><code>submitMapStage</code> creates a JobWaiter to wait for a MapOutputStatistics. The <code>JobWaiter</code> waits for 1 task and, when completed successfully, executes the given <code>callback</code> function with the computed <code>MapOutputStatistics</code>.</p> <p>In the end, <code>submitMapStage</code> posts a MapStageSubmitted and returns the <code>JobWaiter</code>.</p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested to submit a MapStage for execution</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#dagschedulersource","title":"DAGSchedulerSource <p><code>DAGScheduler</code> uses DAGSchedulerSource for performance metrics.</p>","text":""},{"location":"scheduler/DAGScheduler/#dagscheduler-event-bus","title":"DAGScheduler Event Bus <p><code>DAGScheduler</code> uses an event bus to process scheduling events on a separate thread (one by one and asynchronously).</p> <p><code>DAGScheduler</code> requests the event bus to start right when created and stops it when requested to stop.</p> <p><code>DAGScheduler</code> defines event-posting methods for posting DAGSchedulerEvent events to the event bus.</p>","text":""},{"location":"scheduler/DAGScheduler/#taskscheduler","title":"TaskScheduler <p><code>DAGScheduler</code> is given a TaskScheduler when created.</p> <p><code>TaskScheduler</code> is used for the following:</p> <ul> <li>Submitting missing tasks of a stage</li> <li>Handling task completion (CompletionEvent)</li> <li>Killing a task</li> <li>Failing a job and all other independent single-job stages</li> <li>Stopping itself</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#running-job","title":"Running Job <pre><code>runJob[T, U](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) =&gt; U,\n  partitions: Seq[Int],\n  callSite: CallSite,\n  resultHandler: (Int, U) =&gt; Unit,\n  properties: Properties): Unit\n</code></pre> <p><code>runJob</code> submits a job and waits until a result is available.</p> <p><code>runJob</code> prints out the following INFO message to the logs when the job has finished successfully:</p> <pre><code>Job [jobId] finished: [callSite], took [time] s\n</code></pre> <p><code>runJob</code> prints out the following INFO message to the logs when the job has failed:</p> <pre><code>Job [jobId] failed: [callSite], took [time] s\n</code></pre> <p><code>runJob</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to run a job</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#submitting-job","title":"Submitting Job <pre><code>submitJob[T, U](\n  rdd: RDD[T],\n  func: (TaskContext, Iterator[T]) =&gt; U,\n  partitions: Seq[Int],\n  callSite: CallSite,\n  resultHandler: (Int, U) =&gt; Unit,\n  properties: Properties): JobWaiter[U]\n</code></pre> <p><code>submitJob</code> increments the nextJobId internal counter.</p> <p><code>submitJob</code> creates a JobWaiter for the (number of) partitions and the given <code>resultHandler</code> function.</p> <p><code>submitJob</code> requests the DAGSchedulerEventProcessLoop to post a JobSubmitted.</p> <p>In the end, <code>submitJob</code> returns the <code>JobWaiter</code>.</p> <p>For empty partitions (no partitions to compute), <code>submitJob</code> requests the LiveListenerBus to post a SparkListenerJobStart and SparkListenerJobEnd (with <code>JobSucceeded</code> result marker) events and returns a JobWaiter with no tasks to wait for.</p> <p><code>submitJob</code> throws an <code>IllegalArgumentException</code> when the partitions indices are not among the partitions of the given <code>RDD</code>:</p> <pre><code>Attempting to access a non-existent partition: [p]. Total number of partitions: [maxPartitions]\n</code></pre> <p><code>submitJob</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to submit a job</li> <li><code>DAGScheduler</code> is requested to run a job</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#partition-placement-preferences","title":"Partition Placement Preferences <p><code>DAGScheduler</code> keeps track of block locations per RDD and partition.</p> <p><code>DAGScheduler</code> uses TaskLocation that includes a host name and an executor id on that host (as <code>ExecutorCacheTaskLocation</code>).</p> <p>The keys are RDDs (their ids) and the values are arrays indexed by partition numbers.</p> <p>Each entry is a set of block locations where a RDD partition is cached, i.e. the BlockManagers of the blocks.</p> <p>Initialized empty when <code>DAGScheduler</code> is created.</p> <p>Used when <code>DAGScheduler</code> is requested for the locations of the cache blocks of a RDD.</p>","text":""},{"location":"scheduler/DAGScheduler/#activejobs","title":"ActiveJobs <p><code>DAGScheduler</code> tracks ActiveJobs:</p> <ul> <li> <p>Adds a new <code>ActiveJob</code> when requested to handle JobSubmitted or MapStageSubmitted events</p> </li> <li> <p>Removes an <code>ActiveJob</code> when requested to clean up after an ActiveJob and independent stages.</p> </li> <li> <p>Removes all <code>ActiveJobs</code> when requested to doCancelAllJobs.</p> </li> </ul> <p><code>DAGScheduler</code> uses <code>ActiveJobs</code> registry when requested to handle JobGroupCancelled or TaskCompletion events, to cleanUpAfterSchedulerStop and to abort a stage.</p> <p>The number of ActiveJobs is available using job.activeJobs performance metric.</p>","text":""},{"location":"scheduler/DAGScheduler/#createResultStage","title":"Creating ResultStage for RDD <pre><code>createResultStage(\n  rdd: RDD[_],\n  func: (TaskContext, Iterator[_]) =&gt; _,\n  partitions: Array[Int],\n  jobId: Int,\n  callSite: CallSite): ResultStage\n</code></pre> <p><code>createResultStage</code> creates a new ResultStage for the ShuffleDependencies and ResourceProfiles of the given RDD.</p>  <p><code>createResultStage</code> finds the ShuffleDependencies and ResourceProfiles for the given RDD.</p> <p><code>createResultStage</code> merges the ResourceProfiles for the Stage (if enabled or reports an exception).</p> <p><code>createResultStage</code> does the following checks (that may report violations and break the execution):</p> <ul> <li>checkBarrierStageWithDynamicAllocation</li> <li>checkBarrierStageWithNumSlots</li> <li>checkBarrierStageWithRDDChainPattern</li> </ul> <p><code>createResultStage</code> getOrCreateParentStages (with the <code>ShuffleDependency</code>ies and the given <code>jobId</code>).</p> <p><code>createResultStage</code> uses the nextStageId counter for a stage ID.</p> <p><code>createResultStage</code> creates a new ResultStage (with the unique id of a ResourceProfile among others).</p> <p><code>createResultStage</code> registers the <code>ResultStage</code> with the stage ID in stageIdToStage.</p> <p><code>createResultStage</code> updateJobIdStageIdMaps and returns the <code>ResultStage</code>.</p>  <p><code>createResultStage</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handle a JobSubmitted event</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#creating-shufflemapstage-for-shuffledependency","title":"Creating ShuffleMapStage for ShuffleDependency <pre><code>createShuffleMapStage(\n  shuffleDep: ShuffleDependency[_, _, _],\n  jobId: Int): ShuffleMapStage\n</code></pre> <p><code>createShuffleMapStage</code> creates a ShuffleMapStage for the given ShuffleDependency as follows:</p> <ul> <li> <p>Stage ID is generated based on nextStageId internal counter</p> </li> <li> <p>RDD is taken from the given ShuffleDependency</p> </li> <li> <p>Number of tasks is the number of partitions of the RDD</p> </li> <li> <p>Parent RDDs</p> </li> <li> <p>MapOutputTrackerMaster</p> </li> </ul> <p><code>createShuffleMapStage</code> registers the <code>ShuffleMapStage</code> in the stageIdToStage and shuffleIdToMapStage internal registries.</p> <p><code>createShuffleMapStage</code> updateJobIdStageIdMaps.</p> <p><code>createShuffleMapStage</code> requests the MapOutputTrackerMaster to check whether it contains the shuffle ID or not.</p> <p>If not, <code>createShuffleMapStage</code> prints out the following INFO message to the logs and requests the MapOutputTrackerMaster to register the shuffle.</p> <pre><code>Registering RDD [id] ([creationSite]) as input to shuffle [shuffleId]\n</code></pre> <p></p> <p><code>createShuffleMapStage</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to find or create a ShuffleMapStage for a given ShuffleDependency</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#cleaning-up-after-job-and-independent-stages","title":"Cleaning Up After Job and Independent Stages <pre><code>cleanupStateForJobAndIndependentStages(\n  job: ActiveJob): Unit\n</code></pre> <p><code>cleanupStateForJobAndIndependentStages</code> cleans up the state for <code>job</code> and any stages that are not part of any other job.</p> <p><code>cleanupStateForJobAndIndependentStages</code> looks the <code>job</code> up in the internal jobIdToStageIds registry.</p> <p>If no stages are found, the following ERROR is printed out to the logs:</p> <pre><code>No stages registered for job [jobId]\n</code></pre> <p>Oterwise, <code>cleanupStateForJobAndIndependentStages</code> uses stageIdToStage registry to find the stages (the real objects not ids!).</p> <p>For each stage, <code>cleanupStateForJobAndIndependentStages</code> reads the jobs the stage belongs to.</p> <p>If the <code>job</code> does not belong to the jobs of the stage, the following ERROR is printed out to the logs:</p> <pre><code>Job [jobId] not registered for stage [stageId] even though that stage was registered for the job\n</code></pre> <p>If the <code>job</code> was the only job for the stage, the stage (and the stage id) gets cleaned up from the registries, i.e. runningStages, shuffleIdToMapStage, waitingStages, failedStages and stageIdToStage.</p> <p>While removing from runningStages, you should see the following DEBUG message in the logs:</p> <pre><code>Removing running stage [stageId]\n</code></pre> <p>While removing from waitingStages, you should see the following DEBUG message in the logs:</p> <pre><code>Removing stage [stageId] from waiting set.\n</code></pre> <p>While removing from failedStages, you should see the following DEBUG message in the logs:</p> <pre><code>Removing stage [stageId] from failed set.\n</code></pre> <p>After all cleaning (using stageIdToStage as the source registry), if the stage belonged to the one and only <code>job</code>, you should see the following DEBUG message in the logs:</p> <pre><code>After removal of stage [stageId], remaining stages = [stageIdToStage.size]\n</code></pre> <p>The <code>job</code> is removed from jobIdToStageIds, jobIdToActiveJob, activeJobs registries.</p> <p>The final stage of the <code>job</code> is removed, i.e. ResultStage or ShuffleMapStage.</p> <p><code>cleanupStateForJobAndIndependentStages</code> is used in handleTaskCompletion when a <code>ResultTask</code> has completed successfully, failJobAndIndependentStages and markMapStageJobAsFinished.</p>","text":""},{"location":"scheduler/DAGScheduler/#marking-shufflemapstage-job-finished","title":"Marking ShuffleMapStage Job Finished <pre><code>markMapStageJobAsFinished(\n  job: ActiveJob,\n  stats: MapOutputStatistics): Unit\n</code></pre> <p><code>markMapStageJobAsFinished</code> marks the given ActiveJob finished and posts a SparkListenerJobEnd.</p>  <p><code>markMapStageJobAsFinished</code> requests the given ActiveJob to turn on (<code>true</code>) the 0<sup>th</sup> bit in the finished partitions registry and increase the number of tasks finished.</p> <p><code>markMapStageJobAsFinished</code> requests the given <code>ActiveJob</code> for the JobListener that is requested to taskSucceeded (with the 0<sup>th</sup> index and the given MapOutputStatistics).</p> <p><code>markMapStageJobAsFinished</code> cleanupStateForJobAndIndependentStages.</p> <p>In the end, <code>markMapStageJobAsFinished</code> requests the LiveListenerBus to post a SparkListenerJobEnd.</p> <p><code>markMapStageJobAsFinished</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handleMapStageSubmitted and markMapStageJobsAsFinished</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#finding-or-creating-missing-direct-parent-shufflemapstages-for-shuffledependencies-of-rdd","title":"Finding Or Creating Missing Direct Parent ShuffleMapStages (For ShuffleDependencies) of RDD <pre><code>getOrCreateParentStages(\n  rdd: RDD[_],\n  firstJobId: Int): List[Stage]\n</code></pre> <p><code>getOrCreateParentStages</code> &lt;ShuffleDependencies&gt;&gt; of the input <code>rdd</code> and then &lt;ShuffleMapStage stages&gt;&gt; for each ShuffleDependency. <p><code>getOrCreateParentStages</code> is used when <code>DAGScheduler</code> is requested to create a ShuffleMapStage or a ResultStage.</p>","text":""},{"location":"scheduler/DAGScheduler/#marking-stage-finished","title":"Marking Stage Finished <pre><code>markStageAsFinished(\n  stage: Stage,\n  errorMessage: Option[String] = None,\n  willRetry: Boolean = false): Unit\n</code></pre> <p><code>markStageAsFinished</code>...FIXME</p> <p><code>markStageAsFinished</code> is used when...FIXME</p>","text":""},{"location":"scheduler/DAGScheduler/#looking-up-shufflemapstage-for-shuffledependency","title":"Looking Up ShuffleMapStage for ShuffleDependency <pre><code>getOrCreateShuffleMapStage(\n  shuffleDep: ShuffleDependency[_, _, _],\n  firstJobId: Int): ShuffleMapStage\n</code></pre> <p><code>getOrCreateShuffleMapStage</code> finds a ShuffleMapStage by the shuffleId of the given ShuffleDependency in the shuffleIdToMapStage internal registry and returns it if available.</p> <p>If not found, <code>getOrCreateShuffleMapStage</code> finds all the missing ancestor shuffle dependencies and creates the missing ShuffleMapStage stages (including one for the input <code>ShuffleDependency</code>).</p> <p><code>getOrCreateShuffleMapStage</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to find or create missing direct parent ShuffleMapStages of an RDD, find missing parent ShuffleMapStages for a stage, handle a MapStageSubmitted event, and check out stage dependency on a stage</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#missing-shuffledependencies-of-rdd","title":"Missing ShuffleDependencies of RDD <pre><code>getMissingAncestorShuffleDependencies(\n   rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]]\n</code></pre> <p><code>getMissingAncestorShuffleDependencies</code> finds all the missing ShuffleDependencies for the given RDD (traversing its RDD lineage).</p>  <p>Note</p> <p>A ShuffleDependency (of an <code>RDD</code>) is considered missing when not registered in the shuffleIdToMapStage internal registry.</p>  <p>Internally, <code>getMissingAncestorShuffleDependencies</code> finds direct parent shuffle dependencies\u2009of the input <code>RDD</code> and collects the ones that are not registered in the shuffleIdToMapStage internal registry. It repeats the process for the <code>RDD</code>s of the parent shuffle dependencies.</p>","text":""},{"location":"scheduler/DAGScheduler/#finding-direct-parent-shuffle-dependencies-of-rdd","title":"Finding Direct Parent Shuffle Dependencies of RDD <pre><code>getShuffleDependencies(\n   rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]]\n</code></pre> <p><code>getShuffleDependencies</code> finds direct parent shuffle dependencies for the given RDD.</p> <p></p> <p>Internally, <code>getShuffleDependencies</code> takes the direct rdd/index.md#dependencies[shuffle dependencies of the input RDD] and direct shuffle dependencies of all the parent non-<code>ShuffleDependencies</code> in the RDD lineage.</p> <p><code>getShuffleDependencies</code> is used when <code>DAGScheduler</code> is requested to find or create missing direct parent ShuffleMapStages (for ShuffleDependencies of a RDD) and find all missing shuffle dependencies for a given RDD.</p>","text":""},{"location":"scheduler/DAGScheduler/#failing-job-and-independent-single-job-stages","title":"Failing Job and Independent Single-Job Stages <pre><code>failJobAndIndependentStages(\n  job: ActiveJob,\n  failureReason: String,\n  exception: Option[Throwable] = None): Unit\n</code></pre> <p><code>failJobAndIndependentStages</code> fails the input <code>job</code> and all the stages that are only used by the job.</p> <p>Internally, <code>failJobAndIndependentStages</code> uses <code>jobIdToStageIds</code> internal registry to look up the stages registered for the job.</p> <p>If no stages could be found, you should see the following ERROR message in the logs:</p> <pre><code>No stages registered for job [id]\n</code></pre> <p>Otherwise, for every stage, <code>failJobAndIndependentStages</code> finds the job ids the stage belongs to.</p> <p>If no stages could be found or the job is not referenced by the stages, you should see the following ERROR message in the logs:</p> <pre><code>Job [id] not registered for stage [id] even though that stage was registered for the job\n</code></pre> <p>Only when there is exactly one job registered for the stage and the stage is in RUNNING state (in <code>runningStages</code> internal registry), TaskScheduler.md#contract[<code>TaskScheduler</code> is requested to cancel the stage's tasks] and &lt;&gt;. <p>NOTE: <code>failJobAndIndependentStages</code> uses jobIdToStageIds, stageIdToStage, and runningStages internal registries.</p> <p><code>failJobAndIndependentStages</code> is used when...FIXME</p>","text":""},{"location":"scheduler/DAGScheduler/#aborting-stage","title":"Aborting Stage <pre><code>abortStage(\n  failedStage: Stage,\n  reason: String,\n  exception: Option[Throwable]): Unit\n</code></pre> <p><code>abortStage</code> is an internal method that finds all the active jobs that depend on the <code>failedStage</code> stage and fails them.</p> <p>Internally, <code>abortStage</code> looks the <code>failedStage</code> stage up in the internal stageIdToStage registry and exits if there the stage was not registered earlier.</p> <p>If it was, <code>abortStage</code> finds all the active jobs (in the internal activeJobs registry) with the &lt;failedStage stage&gt;&gt;. <p>At this time, the <code>completionTime</code> property (of the failed stage's StageInfo) is assigned to the current time (millis).</p> <p>All the active jobs that depend on the failed stage (as calculated above) and the stages that do not belong to other jobs (aka independent stages) are &lt;&gt; (with the failure reason being \"Job aborted due to stage failure: [reason]\" and the input <code>exception</code>). <p>If there are no jobs depending on the failed stage, you should see the following INFO message in the logs:</p> <pre><code>Ignoring failure of [failedStage] because all jobs depending on it are done\n</code></pre> <p><code>abortStage</code> is used when <code>DAGScheduler</code> is requested to handle a TaskSetFailed event, submit a stage, submit missing tasks of a stage, handle a TaskCompletion event.</p>","text":""},{"location":"scheduler/DAGScheduler/#checking-out-stage-dependency-on-given-stage","title":"Checking Out Stage Dependency on Given Stage <pre><code>stageDependsOn(\n  stage: Stage,\n  target: Stage): Boolean\n</code></pre> <p><code>stageDependsOn</code> compares two stages and returns whether the <code>stage</code> depends on <code>target</code> stage (i.e. <code>true</code>) or not (i.e. <code>false</code>).</p> <p>NOTE: A stage <code>A</code> depends on stage <code>B</code> if <code>B</code> is among the ancestors of <code>A</code>.</p> <p>Internally, <code>stageDependsOn</code> walks through the graph of RDDs of the input <code>stage</code>. For every RDD in the RDD's dependencies (using <code>RDD.dependencies</code>) <code>stageDependsOn</code> adds the RDD of a NarrowDependency to a stack of RDDs to visit while for a ShuffleDependency it &lt;ShuffleMapStage stages for a <code>ShuffleDependency</code>&gt;&gt; for the dependency and the <code>stage</code>'s first job id that it later adds to a stack of RDDs to visit if the map stage is ready, i.e. all the partitions have shuffle outputs. <p>After all the RDDs of the input <code>stage</code> are visited, <code>stageDependsOn</code> checks if the <code>target</code>'s RDD is among the RDDs of the <code>stage</code>, i.e. whether the <code>stage</code> depends on <code>target</code> stage.</p> <p><code>stageDependsOn</code> is used when <code>DAGScheduler</code> is requested to abort a stage.</p>","text":""},{"location":"scheduler/DAGScheduler/#submitting-waiting-child-stages-for-execution","title":"Submitting Waiting Child Stages for Execution <pre><code>submitWaitingChildStages(\n  parent: Stage): Unit\n</code></pre> <p><code>submitWaitingChildStages</code> submits for execution all waiting stages for which the input <code>parent</code> Stage.md[Stage] is the direct parent.</p> <p>NOTE: Waiting stages are the stages registered in <code>waitingStages</code> internal registry.</p> <p>When executed, you should see the following <code>TRACE</code> messages in the logs:</p> <pre><code>Checking if any dependencies of [parent] are now runnable\nrunning: [runningStages]\nwaiting: [waitingStages]\nfailed: [failedStages]\n</code></pre> <p><code>submitWaitingChildStages</code> finds child stages of the input <code>parent</code> stage, removes them from <code>waitingStages</code> internal registry, and &lt;&gt; one by one sorted by their job ids. <p><code>submitWaitingChildStages</code> is used when <code>DAGScheduler</code> is requested to submits missing tasks for a stage and handles a successful ShuffleMapTask completion.</p>","text":""},{"location":"scheduler/DAGScheduler/#submitting-stage-with-missing-parents-for-execution","title":"Submitting Stage (with Missing Parents) for Execution <pre><code>submitStage(\n  stage: Stage): Unit\n</code></pre> <p><code>submitStage</code> submits the input <code>stage</code> or its missing parents (if there any stages not computed yet before the input <code>stage</code> could).</p> <p>NOTE: <code>submitStage</code> is also used to DAGSchedulerEventProcessLoop.md#resubmitFailedStages[resubmit failed stages].</p> <p><code>submitStage</code> recursively submits any missing parents of the <code>stage</code>.</p> <p>Internally, <code>submitStage</code> first finds the earliest-created job id that needs the <code>stage</code>.</p> <p>NOTE: A stage itself tracks the jobs (their ids) it belongs to (using the internal <code>jobIds</code> registry).</p> <p>The following steps depend on whether there is a job or not.</p> <p>If there are no jobs that require the <code>stage</code>, <code>submitStage</code> &lt;&gt; with the reason: <pre><code>No active job for stage [id]\n</code></pre> <p>If however there is a job for the <code>stage</code>, you should see the following DEBUG message in the logs:</p> <pre><code>submitStage([stage])\n</code></pre> <p><code>submitStage</code> checks the status of the <code>stage</code> and continues when it was not recorded in waiting, running or failed internal registries. It simply exits otherwise.</p> <p>With the <code>stage</code> ready for submission, <code>submitStage</code> calculates the &lt;stage&gt;&gt; (sorted by their job ids). You should see the following DEBUG message in the logs: <pre><code>missing: [missing]\n</code></pre> <p>When the <code>stage</code> has no parent stages missing, you should see the following INFO message in the logs:</p> <pre><code>Submitting [stage] ([stage.rdd]), which has no missing parents\n</code></pre> <p><code>submitStage</code> &lt;stage&gt;&gt; (with the earliest-created job id) and finishes. <p>If however there are missing parent stages for the <code>stage</code>, <code>submitStage</code> &lt;&gt;, and the <code>stage</code> is recorded in the internal waitingStages registry. <p><code>submitStage</code> is used recursively for missing parents of the given stage and when DAGScheduler is requested for the following:</p> <ul> <li> <p>resubmitFailedStages (ResubmitFailedStages event)</p> </li> <li> <p>submitWaitingChildStages (CompletionEvent event)</p> </li> <li> <p>Handle JobSubmitted, MapStageSubmitted and TaskCompletion events</p> </li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#stage-attempts","title":"Stage Attempts <p>A single stage can be re-executed in multiple attempts due to fault recovery. The number of attempts is configured (FIXME).</p> <p>If <code>TaskScheduler</code> reports that a task failed because a map output file from a previous stage was lost, the DAGScheduler resubmits the lost stage. This is detected through a DAGSchedulerEventProcessLoop.md#handleTaskCompletion-FetchFailed[<code>CompletionEvent</code> with <code>FetchFailed</code>], or an &lt;&gt; event. DAGScheduler will wait a small amount of time to see whether other nodes or tasks fail, then resubmit <code>TaskSets</code> for any lost stage(s) that compute the missing tasks. <p>Please note that tasks from the old attempts of a stage could still be running.</p> <p>A stage object tracks multiple StageInfo objects to pass to Spark listeners or the web UI.</p> <p>The latest <code>StageInfo</code> for the most recent attempt for a stage is accessible through <code>latestInfo</code>.</p>","text":""},{"location":"scheduler/DAGScheduler/#preferred-locations","title":"Preferred Locations <p>DAGScheduler computes where to run each task in a stage based on the rdd/index.md#getPreferredLocations[preferred locations of its underlying RDDs], or &lt;&gt;.","text":""},{"location":"scheduler/DAGScheduler/#adaptive-query-planning-adaptive-scheduling","title":"Adaptive Query Planning / Adaptive Scheduling <p>See SPARK-9850 Adaptive execution in Spark for the design document. The work is currently in progress.</p> <p>DAGScheduler.submitMapStage method is used for adaptive query planning, to run map stages and look at statistics about their outputs before submitting downstream stages.</p>","text":""},{"location":"scheduler/DAGScheduler/#scheduledexecutorservice-daemon-services","title":"ScheduledExecutorService daemon services <p>DAGScheduler uses the following ScheduledThreadPoolExecutors (with the policy of removing cancelled tasks from a work queue at time of cancellation):</p> <ul> <li><code>dag-scheduler-message</code> - a daemon thread pool using <code>j.u.c.ScheduledThreadPoolExecutor</code> with core pool size <code>1</code>. It is used to post a DAGSchedulerEventProcessLoop.md#ResubmitFailedStages[ResubmitFailedStages] event when DAGSchedulerEventProcessLoop.md#handleTaskCompletion-FetchFailed[<code>FetchFailed</code> is reported].</li> </ul> <p>They are created using <code>ThreadUtils.newDaemonSingleThreadScheduledExecutor</code> method that uses Guava DSL to instantiate a ThreadFactory.</p>","text":""},{"location":"scheduler/DAGScheduler/#finding-missing-parent-shufflemapstages-for-stage","title":"Finding Missing Parent ShuffleMapStages For Stage <pre><code>getMissingParentStages(\n  stage: Stage): List[Stage]\n</code></pre> <p><code>getMissingParentStages</code> finds missing parent ShuffleMapStages in the dependency graph of the input <code>stage</code> (using the breadth-first search algorithm).</p> <p>Internally, <code>getMissingParentStages</code> starts with the <code>stage</code>'s RDD and walks up the tree of all parent RDDs to find &lt;&gt;. <p>NOTE: A <code>Stage</code> tracks the associated RDD using Stage.md#rdd[<code>rdd</code> property].</p> <p>NOTE: An uncached partition of a RDD is a partition that has <code>Nil</code> in the &lt;&gt; (which results in no RDD blocks in any of the active storage:BlockManager.md[BlockManager]s on executors). <p><code>getMissingParentStages</code> traverses the rdd/index.md#dependencies[parent dependencies of the RDD] and acts according to their type, i.e. ShuffleDependency or NarrowDependency.</p> <p>NOTE: ShuffleDependency and NarrowDependency are the main top-level Dependencies.</p> <p>For each <code>NarrowDependency</code>, <code>getMissingParentStages</code> simply marks the corresponding RDD to visit and moves on to a next dependency of a RDD or works on another unvisited parent RDD.</p> <p>NOTE: NarrowDependency is a RDD dependency that allows for pipelined execution.</p> <p><code>getMissingParentStages</code> focuses on <code>ShuffleDependency</code> dependencies.</p> <p>NOTE: ShuffleDependency is a RDD dependency that represents a dependency on the output of a ShuffleMapStage, i.e. shuffle map stage.</p> <p>For each <code>ShuffleDependency</code>, <code>getMissingParentStages</code> &lt;ShuffleMapStage stages&gt;&gt;. If the <code>ShuffleMapStage</code> is not available, it is added to the set of missing (map) stages. <p>NOTE: A <code>ShuffleMapStage</code> is available when all its partitions are computed, i.e. results are available (as blocks).</p> <p>CAUTION: FIXME...IMAGE with ShuffleDependencies queried</p> <p><code>getMissingParentStages</code> is used when <code>DAGScheduler</code> is requested to submit a stage and handle JobSubmitted and MapStageSubmitted events.</p>","text":""},{"location":"scheduler/DAGScheduler/#submitting-missing-tasks-of-stage","title":"Submitting Missing Tasks of Stage <pre><code>submitMissingTasks(\n  stage: Stage,\n  jobId: Int): Unit\n</code></pre> <p><code>submitMissingTasks</code> prints out the following DEBUG message to the logs:</p> <pre><code>submitMissingTasks([stage])\n</code></pre> <p><code>submitMissingTasks</code> requests the given Stage for the missing partitions (partitions that need to be computed).</p> <p><code>submitMissingTasks</code> adds the stage to the runningStages internal registry.</p> <p><code>submitMissingTasks</code> notifies the OutputCommitCoordinator that stage execution started.</p> <p> <code>submitMissingTasks</code> determines preferred locations (task locality preferences) of the missing partitions. <p><code>submitMissingTasks</code> requests the stage for a new stage attempt.</p> <p><code>submitMissingTasks</code> requests the LiveListenerBus to post a SparkListenerStageSubmitted event.</p> <p><code>submitMissingTasks</code> uses the closure Serializer to serialize the stage and create a so-called task binary. <code>submitMissingTasks</code> serializes the RDD (of the stage) and either the <code>ShuffleDependency</code> or the compute function based on the type of the stage (<code>ShuffleMapStage</code> or <code>ResultStage</code>, respectively).</p> <p><code>submitMissingTasks</code> creates a broadcast variable for the task binary.</p>  <p>Note</p> <p>That shows how important broadcast variables are for Spark itself to distribute data among executors in a Spark application in the most efficient way.</p>  <p><code>submitMissingTasks</code> creates tasks for every missing partition:</p> <ul> <li> <p>ShuffleMapTasks for a ShuffleMapStage</p> </li> <li> <p>ResultTasks for a ResultStage</p> </li> </ul> <p>If there are tasks to submit for execution (i.e. there are missing partitions in the stage), submitMissingTasks prints out the following INFO message to the logs:</p> <pre><code>Submitting [size] missing tasks from [stage] ([rdd]) (first 15 tasks are for partitions [partitionIds])\n</code></pre> <p><code>submitMissingTasks</code> requests the &lt;&gt; to TaskScheduler.md#submitTasks[submit the tasks for execution] (as a new TaskSet.md[TaskSet]). <p>With no tasks to submit for execution, <code>submitMissingTasks</code> &lt;&gt;. <p><code>submitMissingTasks</code> prints out the following DEBUG messages based on the type of the stage:</p> <pre><code>Stage [stage] is actually done; (available: [isAvailable],available outputs: [numAvailableOutputs],partitions: [numPartitions])\n</code></pre> <p>or</p> <pre><code>Stage [stage] is actually done; (partitions: [numPartitions])\n</code></pre> <p>for <code>ShuffleMapStage</code> and <code>ResultStage</code>, respectively.</p> <p>In the end, with no tasks to submit for execution, <code>submitMissingTasks</code> &lt;&gt; and exits. <p><code>submitMissingTasks</code> is used when <code>DAGScheduler</code> is requested to submit a stage for execution.</p>","text":""},{"location":"scheduler/DAGScheduler/#finding-preferred-locations-for-missing-partitions","title":"Finding Preferred Locations for Missing Partitions <pre><code>getPreferredLocs(\n   rdd: RDD[_],\n  partition: Int): Seq[TaskLocation]\n</code></pre> <p><code>getPreferredLocs</code> is simply an alias for the internal (recursive) &lt;&gt;. <p><code>getPreferredLocs</code> is used when...FIXME</p>","text":""},{"location":"scheduler/DAGScheduler/#finding-blockmanagers-executors-for-cached-rdd-partitions-aka-block-location-discovery","title":"Finding BlockManagers (Executors) for Cached RDD Partitions (aka Block Location Discovery) <pre><code>getCacheLocs(\n   rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]]\n</code></pre> <p><code>getCacheLocs</code> gives TaskLocations (block locations) for the partitions of the input <code>rdd</code>. <code>getCacheLocs</code> caches lookup results in &lt;&gt; internal registry. <p>NOTE: The size of the collection from <code>getCacheLocs</code> is exactly the number of partitions in <code>rdd</code> RDD.</p> <p>NOTE: The size of every TaskLocation collection (i.e. every entry in the result of <code>getCacheLocs</code>) is exactly the number of blocks managed using storage:BlockManager.md[BlockManagers] on executors.</p> <p>Internally, <code>getCacheLocs</code> finds <code>rdd</code> in the &lt;&gt; internal registry (of partition locations per RDD). <p>If <code>rdd</code> is not in &lt;&gt; internal registry, <code>getCacheLocs</code> branches per its storage:StorageLevel.md[storage level]. <p>For <code>NONE</code> storage level (i.e. no caching), the result is an empty locations (i.e. no location preference).</p> <p>For other non-<code>NONE</code> storage levels, <code>getCacheLocs</code> storage:BlockManagerMaster.md#getLocations-block-array[requests <code>BlockManagerMaster</code> for block locations] that are then mapped to TaskLocations with the hostname of the owning <code>BlockManager</code> for a block (of a partition) and the executor id.</p> <p>NOTE: <code>getCacheLocs</code> uses &lt;&gt; that was defined when &lt;&gt;. <p><code>getCacheLocs</code> records the computed block locations per partition (as TaskLocation) in &lt;&gt; internal registry. <p>NOTE: <code>getCacheLocs</code> requests locations from <code>BlockManagerMaster</code> using storage:BlockId.md#RDDBlockId[RDDBlockId] with the RDD id and the partition indices (which implies that the order of the partitions matters to request proper blocks).</p> <p>NOTE: DAGScheduler uses TaskLocation.md[TaskLocations] (with host and executor) while storage:BlockManagerMaster.md[BlockManagerMaster] uses storage:BlockManagerId.md[] (to track similar information, i.e. block locations).</p> <p><code>getCacheLocs</code> is used when <code>DAGScheduler</code> is requested to find missing parent MapStages and getPreferredLocsInternal.</p>","text":""},{"location":"scheduler/DAGScheduler/#finding-placement-preferences-for-rdd-partition-recursively","title":"Finding Placement Preferences for RDD Partition (recursively) <pre><code>getPreferredLocsInternal(\n   rdd: RDD[_],\n  partition: Int,\n  visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation]\n</code></pre> <p><code>getPreferredLocsInternal</code> first &lt;TaskLocations for the <code>partition</code> of the <code>rdd</code>&gt;&gt; (using &lt;&gt; internal cache) and returns them. <p>Otherwise, if not found, <code>getPreferredLocsInternal</code> rdd/index.md#preferredLocations[requests <code>rdd</code> for the preferred locations of <code>partition</code>] and returns them.</p> <p>NOTE: Preferred locations of the partitions of a RDD are also called placement preferences or locality preferences.</p> <p>Otherwise, if not found, <code>getPreferredLocsInternal</code> finds the first parent NarrowDependency and (recursively) finds <code>TaskLocations</code>.</p> <p>If all the attempts fail to yield any non-empty result, <code>getPreferredLocsInternal</code> returns an empty collection of TaskLocation.md[TaskLocations].</p> <p><code>getPreferredLocsInternal</code> is used when <code>DAGScheduler</code> is requested for the preferred locations for missing partitions.</p>","text":""},{"location":"scheduler/DAGScheduler/#stopping-dagscheduler","title":"Stopping DAGScheduler <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> stops the internal <code>dag-scheduler-message</code> thread pool, dag-scheduler-event-loop, and TaskScheduler.</p> <p><code>stop</code> is used when <code>SparkContext</code> is requested to stop.</p>","text":""},{"location":"scheduler/DAGScheduler/#killing-task","title":"Killing Task <pre><code>killTaskAttempt(\n  taskId: Long,\n  interruptThread: Boolean,\n  reason: String): Boolean\n</code></pre> <p><code>killTaskAttempt</code> requests the TaskScheduler to kill a task.</p> <p><code>killTaskAttempt</code> is used when <code>SparkContext</code> is requested to kill a task.</p>","text":""},{"location":"scheduler/DAGScheduler/#cleanupafterschedulerstop","title":"cleanUpAfterSchedulerStop <pre><code>cleanUpAfterSchedulerStop(): Unit\n</code></pre> <p><code>cleanUpAfterSchedulerStop</code>...FIXME</p> <p><code>cleanUpAfterSchedulerStop</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to onStop.</p>","text":""},{"location":"scheduler/DAGScheduler/#removeexecutorandunregisteroutputs","title":"removeExecutorAndUnregisterOutputs <pre><code>removeExecutorAndUnregisterOutputs(\n  execId: String,\n  fileLost: Boolean,\n  hostToUnregisterOutputs: Option[String],\n  maybeEpoch: Option[Long] = None): Unit\n</code></pre> <p>removeExecutorAndUnregisterOutputs...FIXME</p> <p>removeExecutorAndUnregisterOutputs is used when DAGScheduler is requested to handle &lt;&gt; (due to a fetch failure) and &lt;&gt; events.","text":""},{"location":"scheduler/DAGScheduler/#markmapstagejobsasfinished","title":"markMapStageJobsAsFinished <pre><code>markMapStageJobsAsFinished(\n  shuffleStage: ShuffleMapStage): Unit\n</code></pre> <p><code>markMapStageJobsAsFinished</code> checks out whether the given ShuffleMapStage is fully-available yet there are still map-stage jobs running.</p> <p>If so, <code>markMapStageJobsAsFinished</code> requests the MapOutputTrackerMaster for the statistics (for the ShuffleDependency of the given ShuffleMapStage).</p> <p>For every map-stage job, <code>markMapStageJobsAsFinished</code> marks the map-stage job as finished (with the statistics).</p> <p><code>markMapStageJobsAsFinished</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit missing tasks (of a <code>ShuffleMapStage</code> that has just been computed) and processShuffleMapStageCompletion</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#processshufflemapstagecompletion","title":"processShuffleMapStageCompletion <pre><code>processShuffleMapStageCompletion(\n  shuffleStage: ShuffleMapStage): Unit\n</code></pre> <p><code>processShuffleMapStageCompletion</code>...FIXME</p> <p><code>processShuffleMapStageCompletion</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handleTaskCompletion and handleShuffleMergeFinalized</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#handleshufflemergefinalized","title":"handleShuffleMergeFinalized <pre><code>handleShuffleMergeFinalized(\n  stage: ShuffleMapStage): Unit\n</code></pre> <p><code>handleShuffleMergeFinalized</code>...FIXME</p> <p><code>handleShuffleMergeFinalized</code> is used when:</p> <ul> <li><code>DAGSchedulerEventProcessLoop</code> is requested to handle a ShuffleMergeFinalized event</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#scheduleshufflemergefinalize","title":"scheduleShuffleMergeFinalize <pre><code>scheduleShuffleMergeFinalize(\n  stage: ShuffleMapStage): Unit\n</code></pre> <p><code>scheduleShuffleMergeFinalize</code>...FIXME</p> <p><code>scheduleShuffleMergeFinalize</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handle a task completion</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#finalizeshufflemerge","title":"finalizeShuffleMerge <pre><code>finalizeShuffleMerge(\n  stage: ShuffleMapStage): Unit\n</code></pre> <p><code>finalizeShuffleMerge</code>...FIXME</p>","text":""},{"location":"scheduler/DAGScheduler/#updatejobidstageidmaps","title":"updateJobIdStageIdMaps <pre><code>updateJobIdStageIdMaps(\n  jobId: Int,\n  stage: Stage): Unit\n</code></pre> <p><code>updateJobIdStageIdMaps</code>...FIXME</p> <p><code>updateJobIdStageIdMaps</code> is used when <code>DAGScheduler</code> is requested to create ShuffleMapStage and ResultStage stages.</p>","text":""},{"location":"scheduler/DAGScheduler/#executorheartbeatreceived","title":"executorHeartbeatReceived <pre><code>executorHeartbeatReceived(\n  execId: String,\n  // (taskId, stageId, stageAttemptId, accumUpdates)\n  accumUpdates: Array[(Long, Int, Int, Seq[AccumulableInfo])],\n  blockManagerId: BlockManagerId,\n  // (stageId, stageAttemptId) -&gt; metrics\n  executorUpdates: mutable.Map[(Int, Int), ExecutorMetrics]): Boolean\n</code></pre> <p><code>executorHeartbeatReceived</code> posts a SparkListenerExecutorMetricsUpdate (to listenerBus) and informs BlockManagerMaster that <code>blockManagerId</code> block manager is alive (by posting BlockManagerHeartbeat).</p> <p><code>executorHeartbeatReceived</code> is used when <code>TaskSchedulerImpl</code> is requested to handle an executor heartbeat.</p>","text":""},{"location":"scheduler/DAGScheduler/#event-handlers","title":"Event Handlers","text":""},{"location":"scheduler/DAGScheduler/#alljobscancelled-event-handler","title":"AllJobsCancelled Event Handler <pre><code>doCancelAllJobs(): Unit\n</code></pre> <p><code>doCancelAllJobs</code>...FIXME</p> <p><code>doCancelAllJobs</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle an AllJobsCancelled event and onError.</p>","text":""},{"location":"scheduler/DAGScheduler/#beginevent-event-handler","title":"BeginEvent Event Handler <pre><code>handleBeginEvent(\n  task: Task[_],\n  taskInfo: TaskInfo): Unit\n</code></pre> <p><code>handleBeginEvent</code>...FIXME</p> <p><code>handleBeginEvent</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a BeginEvent event.</p>","text":""},{"location":"scheduler/DAGScheduler/#handling-task-completion-event","title":"Handling Task Completion Event <pre><code>handleTaskCompletion(\n  event: CompletionEvent): Unit\n</code></pre>  <p><code>handleTaskCompletion</code> handles a CompletionEvent.</p> <p><code>handleTaskCompletion</code> notifies the OutputCommitCoordinator that a task completed.</p> <p><code>handleTaskCompletion</code> finds the stage in the stageIdToStage registry. If not found, <code>handleTaskCompletion</code> postTaskEnd and quits.</p> <p><code>handleTaskCompletion</code> updateAccumulators.</p> <p><code>handleTaskCompletion</code> announces task completion application-wide.</p> <p><code>handleTaskCompletion</code> branches off per <code>TaskEndReason</code> (as <code>event.reason</code>).</p>    TaskEndReason Description     Success Acts according to the type of the task that completed, i.e. ShuffleMapTask and ResultTask   Resubmitted    others","text":""},{"location":"scheduler/DAGScheduler/#handling-successful-task-completion","title":"Handling Successful Task Completion <p>When a task has finished successfully (i.e. <code>Success</code> end reason), <code>handleTaskCompletion</code> marks the partition as no longer pending (i.e. the partition the task worked on is removed from <code>pendingPartitions</code> of the stage).</p> <p>NOTE: A <code>Stage</code> tracks its own pending partitions using scheduler:Stage.md#pendingPartitions[<code>pendingPartitions</code> property].</p> <p><code>handleTaskCompletion</code> branches off given the type of the task that completed, i.e. &lt;&gt; and &lt;&gt;.","text":""},{"location":"scheduler/DAGScheduler/#handling-successful-resulttask-completion","title":"Handling Successful ResultTask Completion <p>For scheduler:ResultTask.md[ResultTask], the stage is assumed a scheduler:ResultStage.md[ResultStage].</p> <p><code>handleTaskCompletion</code> finds the <code>ActiveJob</code> associated with the <code>ResultStage</code>.</p> <p>NOTE: scheduler:ResultStage.md[ResultStage] tracks the optional <code>ActiveJob</code> as scheduler:ResultStage.md#activeJob[<code>activeJob</code> property]. There could only be one active job for a <code>ResultStage</code>.</p> <p>If there is no job for the <code>ResultStage</code>, you should see the following INFO message in the logs:</p> <pre><code>Ignoring result from [task] because its job has finished\n</code></pre> <p>Otherwise, when the <code>ResultStage</code> has a <code>ActiveJob</code>, <code>handleTaskCompletion</code> checks the status of the partition output for the partition the <code>ResultTask</code> ran for.</p> <p>NOTE: <code>ActiveJob</code> tracks task completions in <code>finished</code> property with flags for every partition in a stage. When the flag for a partition is enabled (i.e. <code>true</code>), it is assumed that the partition has been computed (and no results from any <code>ResultTask</code> are expected and hence simply ignored).</p> <p>CAUTION: FIXME Describe why could a partition has more <code>ResultTask</code> running.</p> <p><code>handleTaskCompletion</code> ignores the <code>CompletionEvent</code> when the partition has already been marked as completed for the stage and simply exits.</p> <p><code>handleTaskCompletion</code> scheduler:DAGScheduler.md#updateAccumulators[updates accumulators].</p> <p>The partition for the <code>ActiveJob</code> (of the <code>ResultStage</code>) is marked as computed and the number of partitions calculated increased.</p> <p>NOTE: <code>ActiveJob</code> tracks what partitions have already been computed and their number.</p> <p>If the <code>ActiveJob</code> has finished (when the number of partitions computed is exactly the number of partitions in a stage) <code>handleTaskCompletion</code> does the following (in order):</p> <ol> <li>scheduler:DAGScheduler.md#markStageAsFinished[Marks <code>ResultStage</code> computed].</li> <li>scheduler:DAGScheduler.md#cleanupStateForJobAndIndependentStages[Cleans up after <code>ActiveJob</code> and independent stages].</li> <li>Announces the job completion application-wide (by posting a SparkListener.md#SparkListenerJobEnd[SparkListenerJobEnd] to scheduler:LiveListenerBus.md[]).</li> </ol> <p>In the end, <code>handleTaskCompletion</code> notifies <code>JobListener</code> of the <code>ActiveJob</code> that the task succeeded.</p> <p>NOTE: A task succeeded notification holds the output index and the result.</p> <p>When the notification throws an exception (because it runs user code), <code>handleTaskCompletion</code> notifies <code>JobListener</code> about the failure (wrapping it inside a <code>SparkDriverExecutionException</code> exception).</p>","text":""},{"location":"scheduler/DAGScheduler/#handling-successful-shufflemaptask-completion","title":"Handling Successful ShuffleMapTask Completion <p>For scheduler:ShuffleMapTask.md[ShuffleMapTask], the stage is assumed a  scheduler:ShuffleMapStage.md[ShuffleMapStage].</p> <p><code>handleTaskCompletion</code> scheduler:DAGScheduler.md#updateAccumulators[updates accumulators].</p> <p>The task's result is assumed scheduler:MapStatus.md[MapStatus] that knows the executor where the task has finished.</p> <p>You should see the following DEBUG message in the logs:</p> <pre><code>ShuffleMapTask finished on [execId]\n</code></pre> <p>If the executor is registered in scheduler:DAGScheduler.md#failedEpoch[<code>failedEpoch</code> internal registry] and the epoch of the completed task is not greater than that of the executor (as in <code>failedEpoch</code> registry), you should see the following INFO message in the logs:</p> <pre><code>Ignoring possibly bogus [task] completion from executor [executorId]\n</code></pre> <p>Otherwise, <code>handleTaskCompletion</code> scheduler:ShuffleMapStage.md#addOutputLoc[registers the <code>MapStatus</code> result for the partition with the stage] (of the completed task).</p> <p><code>handleTaskCompletion</code> does more processing only if the <code>ShuffleMapStage</code> is registered as still running (in scheduler:DAGScheduler.md#runningStages[<code>runningStages</code> internal registry]) and the scheduler:Stage.md#pendingPartitions[<code>ShuffleMapStage</code> stage has no pending partitions to compute].</p> <p>The <code>ShuffleMapStage</code> is &lt;&gt;. <p>You should see the following INFO messages in the logs:</p> <pre><code>looking for newly runnable stages\nrunning: [runningStages]\nwaiting: [waitingStages]\nfailed: [failedStages]\n</code></pre> <p><code>handleTaskCompletion</code> scheduler:MapOutputTrackerMaster.md#registerMapOutputs[registers the shuffle map outputs of the <code>ShuffleDependency</code> with <code>MapOutputTrackerMaster</code>] (with the epoch incremented) and scheduler:DAGScheduler.md#clearCacheLocs[clears internal cache of the stage's RDD block locations].</p> <p>NOTE: scheduler:MapOutputTrackerMaster.md[MapOutputTrackerMaster] is given when scheduler:DAGScheduler.md#creating-instance[DAGScheduler is created].</p> <p>If the scheduler:ShuffleMapStage.md#isAvailable[<code>ShuffleMapStage</code> stage is ready], all scheduler:ShuffleMapStage.md#mapStageJobs[active jobs of the stage] (aka map-stage jobs) are scheduler:DAGScheduler.md#markMapStageJobAsFinished[marked as finished] (with scheduler:MapOutputTrackerMaster.md#getStatistics[<code>MapOutputStatistics</code> from <code>MapOutputTrackerMaster</code> for the <code>ShuffleDependency</code>]).</p> <p>NOTE: A <code>ShuffleMapStage</code> stage is ready (aka available) when all partitions have shuffle outputs, i.e. when their tasks have completed.</p> <p>Eventually, <code>handleTaskCompletion</code> scheduler:DAGScheduler.md#submitWaitingChildStages[submits waiting child stages (of the ready <code>ShuffleMapStage</code>)].</p> <p>If however the <code>ShuffleMapStage</code> is not ready, you should see the following INFO message in the logs:</p> <pre><code>Resubmitting [shuffleStage] ([shuffleStage.name]) because some of its tasks had failed: [missingPartitions]\n</code></pre> <p>In the end, <code>handleTaskCompletion</code> scheduler:DAGScheduler.md#submitStage[submits the <code>ShuffleMapStage</code> for execution].</p>","text":""},{"location":"scheduler/DAGScheduler/#taskendreason-resubmitted","title":"TaskEndReason: Resubmitted <p>For <code>Resubmitted</code> case, you should see the following INFO message in the logs:</p> <pre><code>Resubmitted [task], so marking it as still running\n</code></pre> <p>The task (by <code>task.partitionId</code>) is added to the collection of pending partitions of the stage (using <code>stage.pendingPartitions</code>).</p> <p>TIP: A stage knows how many partitions are yet to be calculated. A task knows about the partition id for which it was launched.</p>","text":""},{"location":"scheduler/DAGScheduler/#task-failed-with-fetchfailed-exception","title":"Task Failed with FetchFailed Exception <pre><code>FetchFailed(\n  bmAddress: BlockManagerId,\n  shuffleId: Int,\n  mapId: Int,\n  reduceId: Int,\n  message: String)\nextends TaskFailedReason\n</code></pre> <p>When <code>FetchFailed</code> happens, <code>stageIdToStage</code> is used to access the failed stage (using <code>task.stageId</code> and the <code>task</code> is available in <code>event</code> in <code>handleTaskCompletion(event: CompletionEvent)</code>). <code>shuffleToMapStage</code> is used to access the map stage (using <code>shuffleId</code>).</p> <p>If <code>failedStage.latestInfo.attemptId != task.stageAttemptId</code>, you should see the following INFO in the logs:</p> <pre><code>Ignoring fetch failure from [task] as it's from [failedStage] attempt [task.stageAttemptId] and there is a more recent attempt for that stage (attempt ID [failedStage.latestInfo.attemptId]) running\n</code></pre> <p>CAUTION: FIXME What does <code>failedStage.latestInfo.attemptId != task.stageAttemptId</code> mean?</p> <p>And the case finishes. Otherwise, the case continues.</p> <p>If the failed stage is in <code>runningStages</code>, the following INFO message shows in the logs:</p> <pre><code>Marking [failedStage] ([failedStage.name]) as failed due to a fetch failure from [mapStage] ([mapStage.name])\n</code></pre> <p><code>markStageAsFinished(failedStage, Some(failureMessage))</code> is called.</p> <p>CAUTION: FIXME What does <code>markStageAsFinished</code> do?</p> <p>If the failed stage is not in <code>runningStages</code>, the following DEBUG message shows in the logs:</p> <pre><code>Received fetch failure from [task], but its from [failedStage] which is no longer running\n</code></pre> <p>When <code>disallowStageRetryForTest</code> is set, <code>abortStage(failedStage, \"Fetch failure will not retry stage due to testing config\", None)</code> is called.</p> <p>CAUTION: FIXME Describe <code>disallowStageRetryForTest</code> and <code>abortStage</code>.</p> <p>If the scheduler:Stage.md#failedOnFetchAndShouldAbort[number of fetch failed attempts for the stage exceeds the allowed number], the scheduler:DAGScheduler.md#abortStage[failed stage is aborted] with the reason:</p> <pre><code>[failedStage] ([name]) has failed the maximum allowable number of times: 4. Most recent failure reason: [failureMessage]\n</code></pre> <p>If there are no failed stages reported (scheduler:DAGScheduler.md#failedStages[DAGScheduler.failedStages] is empty), the following INFO shows in the logs:</p> <pre><code>Resubmitting [mapStage] ([mapStage.name]) and [failedStage] ([failedStage.name]) due to fetch failure\n</code></pre> <p>And the following code is executed:</p> <pre><code>messageScheduler.schedule(\n  new Runnable {\n    override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages)\n  }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS)\n</code></pre> <p>CAUTION: FIXME What does the above code do?</p> <p>For all the cases, the failed stage and map stages are both added to the internal scheduler:DAGScheduler.md#failedStages[registry of failed stages].</p> <p>If <code>mapId</code> (in the <code>FetchFailed</code> object for the case) is provided, the map stage output is cleaned up (as it is broken) using <code>mapStage.removeOutputLoc(mapId, bmAddress)</code> and scheduler:MapOutputTracker.md#unregisterMapOutput[MapOutputTrackerMaster.unregisterMapOutput(shuffleId, mapId, bmAddress)] methods.</p> <p>CAUTION: FIXME What does <code>mapStage.removeOutputLoc</code> do?</p> <p>If <code>BlockManagerId</code> (as <code>bmAddress</code> in the <code>FetchFailed</code> object) is defined, <code>handleTaskCompletion</code> &lt;&gt; (with <code>filesLost</code> enabled and <code>maybeEpoch</code> from the scheduler:Task.md#epoch[Task] that completed). <p><code>handleTaskCompletion</code> is used when:</p> <ul> <li>DAGSchedulerEventProcessLoop is requested to handle a CompletionEvent event.</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#executoradded-event-handler","title":"ExecutorAdded Event Handler <pre><code>handleExecutorAdded(\n  execId: String,\n  host: String): Unit\n</code></pre> <p><code>handleExecutorAdded</code>...FIXME</p> <p><code>handleExecutorAdded</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle an ExecutorAdded event.</p>","text":""},{"location":"scheduler/DAGScheduler/#executorlost-event-handler","title":"ExecutorLost Event Handler <pre><code>handleExecutorLost(\n  execId: String,\n  workerLost: Boolean): Unit\n</code></pre> <p><code>handleExecutorLost</code> checks whether the input optional <code>maybeEpoch</code> is defined and if not requests the scheduler:MapOutputTracker.md#getEpoch[current epoch from <code>MapOutputTrackerMaster</code>].</p> <p>NOTE: <code>MapOutputTrackerMaster</code> is passed in (as <code>mapOutputTracker</code>) when scheduler:DAGScheduler.md#creating-instance[DAGScheduler is created].</p> <p>CAUTION: FIXME When is <code>maybeEpoch</code> passed in?</p> <p>.DAGScheduler.handleExecutorLost image::dagscheduler-handleExecutorLost.png[align=\"center\"]</p> <p>Recurring <code>ExecutorLost</code> events lead to the following repeating DEBUG message in the logs:</p> <pre><code>DEBUG Additional executor lost message for [execId] (epoch [currentEpoch])\n</code></pre> <p>NOTE: <code>handleExecutorLost</code> handler uses <code>DAGScheduler</code>'s <code>failedEpoch</code> and FIXME internal registries.</p> <p>Otherwise, when the executor <code>execId</code> is not in the scheduler:DAGScheduler.md#failedEpoch[list of executor lost] or the executor failure's epoch is smaller than the input <code>maybeEpoch</code>, the executor's lost event is recorded in scheduler:DAGScheduler.md#failedEpoch[<code>failedEpoch</code> internal registry].</p> <p>CAUTION: FIXME Describe the case above in simpler non-technical words. Perhaps change the order, too.</p> <p>You should see the following INFO message in the logs:</p> <pre><code>INFO Executor lost: [execId] (epoch [epoch])\n</code></pre> <p>storage:BlockManagerMaster.md#removeExecutor[<code>BlockManagerMaster</code> is requested to remove the lost executor <code>execId</code>].</p> <p>CAUTION: FIXME Review what's <code>filesLost</code>.</p> <p><code>handleExecutorLost</code> exits unless the <code>ExecutorLost</code> event was for a map output fetch operation (and the input <code>filesLost</code> is <code>true</code>) or external shuffle service is not used.</p> <p>In such a case, you should see the following INFO message in the logs:</p> <pre><code>Shuffle files lost for executor: [execId] (epoch [epoch])\n</code></pre> <p><code>handleExecutorLost</code> walks over all scheduler:ShuffleMapStage.md[ShuffleMapStage]s in scheduler:DAGScheduler.md#shuffleToMapStage[DAGScheduler's <code>shuffleToMapStage</code> internal registry] and do the following (in order):</p> <ol> <li><code>ShuffleMapStage.removeOutputsOnExecutor(execId)</code> is called</li> <li>scheduler:MapOutputTrackerMaster.md#registerMapOutputs[MapOutputTrackerMaster.registerMapOutputs(shuffleId, stage.outputLocInMapOutputTrackerFormat(), changeEpoch = true)] is called.</li> </ol> <p>In case scheduler:DAGScheduler.md#shuffleToMapStage[DAGScheduler's <code>shuffleToMapStage</code> internal registry] has no shuffles registered,  scheduler:MapOutputTrackerMaster.md#incrementEpoch[<code>MapOutputTrackerMaster</code> is requested to increment epoch].</p> <p>Ultimatelly, DAGScheduler scheduler:DAGScheduler.md#clearCacheLocs[clears the internal cache of RDD partition locations].</p> <p><code>handleExecutorLost</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle an ExecutorLost event.</p>","text":""},{"location":"scheduler/DAGScheduler/#gettingresultevent-event-handler","title":"GettingResultEvent Event Handler <pre><code>handleGetTaskResult(\n  taskInfo: TaskInfo): Unit\n</code></pre> <p><code>handleGetTaskResult</code>...FIXME</p> <p><code>handleGetTaskResult</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a GettingResultEvent event.</p>","text":""},{"location":"scheduler/DAGScheduler/#jobcancelled-event-handler","title":"JobCancelled Event Handler <pre><code>handleJobCancellation(\n  jobId: Int,\n  reason: Option[String]): Unit\n</code></pre> <p><code>handleJobCancellation</code> looks up the active job for the input job ID (in jobIdToActiveJob internal registry) and fails it and all associated independent stages with failure reason:</p> <pre><code>Job [jobId] cancelled [reason]\n</code></pre> <p>When the input job ID is not found, <code>handleJobCancellation</code> prints out the following DEBUG message to the logs:</p> <pre><code>Trying to cancel unregistered job [jobId]\n</code></pre> <p><code>handleJobCancellation</code> is used when <code>DAGScheduler</code> is requested to handle a JobCancelled event, doCancelAllJobs, handleJobGroupCancelled, handleStageCancellation.</p>","text":""},{"location":"scheduler/DAGScheduler/#jobgroupcancelled-event-handler","title":"JobGroupCancelled Event Handler <pre><code>handleJobGroupCancelled(\n  groupId: String): Unit\n</code></pre> <p><code>handleJobGroupCancelled</code> finds active jobs in a group and cancels them.</p> <p>Internally, <code>handleJobGroupCancelled</code> computes all the active jobs (registered in the internal collection of active jobs) that have <code>spark.jobGroup.id</code> scheduling property set to <code>groupId</code>.</p> <p><code>handleJobGroupCancelled</code> then cancels every active job in the group one by one and the cancellation reason:</p> <pre><code>part of cancelled job group [groupId]\n</code></pre> <p><code>handleJobGroupCancelled</code> is used when <code>DAGScheduler</code> is requested to handle JobGroupCancelled event.</p>","text":""},{"location":"scheduler/DAGScheduler/#handleJobSubmitted","title":"Handling JobSubmitted Event <pre><code>handleJobSubmitted(\n  jobId: Int,\n  finalRDD: RDD[_],\n  func: (TaskContext, Iterator[_]) =&gt; _,\n  partitions: Array[Int],\n  callSite: CallSite,\n  listener: JobListener,\n  properties: Properties): Unit\n</code></pre> <p><code>handleJobSubmitted</code> creates a ResultStage (<code>finalStage</code>) for the given RDD, <code>func</code>, <code>partitions</code>, <code>jobId</code> and <code>callSite</code>.</p>  BarrierJobSlotsNumberCheckFailed Exception <p>Creating a ResultStage may fail with a BarrierJobSlotsNumberCheckFailed exception.</p>  <p></p> <p><code>handleJobSubmitted</code> removes the given <code>jobId</code> from the barrierJobIdToNumTasksCheckFailures.</p> <p><code>handleJobSubmitted</code> creates an ActiveJob for the ResultStage (with the given <code>jobId</code>, the <code>callSite</code>, the JobListener and the <code>properties</code>).</p> <p><code>handleJobSubmitted</code> clears the internal cache of RDD partition locations.</p>  FIXME Why is this clearing here so important?  <p><code>handleJobSubmitted</code> prints out the following INFO messages to the logs (with missingParentStages):</p> <pre><code>Got job [id] ([callSite]) with [number] output partitions\nFinal stage: [finalStage] ([name])\nParents of final stage: [parents]\nMissing parents: [missingParentStages]\n</code></pre> <p><code>handleJobSubmitted</code> registers the new <code>ActiveJob</code> in jobIdToActiveJob and activeJobs internal registries.</p> <p><code>handleJobSubmitted</code> requests the <code>ResultStage</code> to associate itself with the ActiveJob.</p> <p><code>handleJobSubmitted</code> uses the jobIdToStageIds internal registry to find all registered stages for the given <code>jobId</code>. <code>handleJobSubmitted</code> uses the stageIdToStage internal registry to request the <code>Stages</code> for the latestInfo.</p> <p>In the end, <code>handleJobSubmitted</code> posts a SparkListenerJobStart message to the LiveListenerBus and submits the ResultStage.</p>  <p><code>handleJobSubmitted</code> is used when:</p> <ul> <li><code>DAGSchedulerEventProcessLoop</code> is requested to handle a JobSubmitted event</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#handleJobSubmitted-BarrierJobSlotsNumberCheckFailed","title":"BarrierJobSlotsNumberCheckFailed <p>In case of a BarrierJobSlotsNumberCheckFailed exception while creating a ResultStage, <code>handleJobSubmitted</code> increments the number of failures in the barrierJobIdToNumTasksCheckFailures for the given <code>jobId</code>.</p> <p><code>handleJobSubmitted</code> prints out the following WARN message to the logs (with spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures):</p> <pre><code>Barrier stage in job [jobId] requires [requiredConcurrentTasks] slots, but only [maxConcurrentTasks] are available. Will retry up to [maxFailures] more times\n</code></pre> <p>If the number of failures is below the spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures threshold, <code>handleJobSubmitted</code> requests the messageScheduler to schedule a one-shot task that requests the DAGSchedulerEventProcessLoop to post a <code>JobSubmitted</code> event (after spark.scheduler.barrier.maxConcurrentTasksCheck.interval seconds).</p>  <p>Note</p> <p>Posting a <code>JobSubmitted</code> event is to request the <code>DAGScheduler</code> to re-consider the request, hoping that there will be enough resources to fulfill the resource requirements of a barrier job.</p>  <p>Otherwise, if the number of failures crossed the spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures threshold, <code>handleJobSubmitted</code> removes the <code>jobId</code> from the barrierJobIdToNumTasksCheckFailures and informs the given JobListener that the jobFailed.</p>","text":""},{"location":"scheduler/DAGScheduler/#mapstagesubmitted","title":"MapStageSubmitted <pre><code>handleMapStageSubmitted(\n  jobId: Int,\n  dependency: ShuffleDependency[_, _, _],\n  callSite: CallSite,\n  listener: JobListener,\n  properties: Properties): Unit\n</code></pre>   <p>Note</p> <p><code>MapStageSubmitted</code> event processing is very similar to JobSubmitted event's.</p>  <p><code>handleMapStageSubmitted</code> finds or creates a new ShuffleMapStage for the given ShuffleDependency and <code>jobId</code>.</p> <p><code>handleMapStageSubmitted</code> creates an ActiveJob (with the given <code>jobId</code>, the <code>ShuffleMapStage</code>, the given <code>JobListener</code>).</p> <p><code>handleMapStageSubmitted</code> clears the internal cache of RDD partition locations.</p> <p><code>handleMapStageSubmitted</code> prints out the following INFO messages to the logs:</p> <pre><code>Got map stage job [id] ([callSite]) with [number] output partitions\nFinal stage: [stage] ([name])\nParents of final stage: [parents]\nMissing parents: [missingParentStages]\n</code></pre> <p><code>handleMapStageSubmitted</code> adds the new <code>ActiveJob</code> to jobIdToActiveJob and activeJobs internal registries, and the ShuffleMapStage.</p>  <p>Note</p> <p><code>ShuffleMapStage</code> can have multiple <code>ActiveJob</code>s registered.</p>  <p><code>handleMapStageSubmitted</code> finds all the registered stages for the input <code>jobId</code> and collects their latest <code>StageInfo</code>.</p> <p>In the end, <code>handleMapStageSubmitted</code> posts a SparkListenerJobStart event to the LiveListenerBus and submits the ShuffleMapStage.</p> <p>When the ShuffleMapStage is available already, <code>handleMapStageSubmitted</code> marks the job finished.</p>  <p>When <code>handleMapStageSubmitted</code> could not find or create a <code>ShuffleMapStage</code>, <code>handleMapStageSubmitted</code> prints out the following WARN message to the logs.</p> <pre><code>Creating new stage failed due to exception - job: [id]\n</code></pre> <p><code>handleMapStageSubmitted</code> notifies the JobListener about the job failure and exits.</p>  <p><code>handleMapStageSubmitted</code> is used when:</p> <ul> <li>DAGSchedulerEventProcessLoop is requested to handle a MapStageSubmitted event</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#resubmitfailedstages-event-handler","title":"ResubmitFailedStages Event Handler <pre><code>resubmitFailedStages(): Unit\n</code></pre> <p><code>resubmitFailedStages</code> iterates over the internal collection of failed stages and submits them.</p>  <p>Note</p> <p><code>resubmitFailedStages</code> does nothing when there are no failed stages reported.</p>  <p><code>resubmitFailedStages</code> prints out the following INFO message to the logs:</p> <pre><code>Resubmitting failed stages\n</code></pre> <p><code>resubmitFailedStages</code> clears the internal cache of RDD partition locations and makes a copy of the collection of failed stages to track failed stages afresh.</p>  <p>Note</p> <p>At this point DAGScheduler has no failed stages reported.</p>  <p>The previously-reported failed stages are sorted by the corresponding job ids in incremental order and resubmitted.</p> <p><code>resubmitFailedStages</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a ResubmitFailedStages event.</p>","text":""},{"location":"scheduler/DAGScheduler/#speculativetasksubmitted-event-handler","title":"SpeculativeTaskSubmitted Event Handler <pre><code>handleSpeculativeTaskSubmitted(): Unit\n</code></pre> <p><code>handleSpeculativeTaskSubmitted</code>...FIXME</p> <p><code>handleSpeculativeTaskSubmitted</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a SpeculativeTaskSubmitted event.</p>","text":""},{"location":"scheduler/DAGScheduler/#stagecancelled-event-handler","title":"StageCancelled Event Handler <pre><code>handleStageCancellation(): Unit\n</code></pre> <p><code>handleStageCancellation</code>...FIXME</p> <p><code>handleStageCancellation</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a StageCancelled event.</p>","text":""},{"location":"scheduler/DAGScheduler/#tasksetfailed-event-handler","title":"TaskSetFailed Event Handler <pre><code>handleTaskSetFailed(): Unit\n</code></pre> <p><code>handleTaskSetFailed</code>...FIXME</p> <p><code>handleTaskSetFailed</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a TaskSetFailed event.</p>","text":""},{"location":"scheduler/DAGScheduler/#workerremoved-event-handler","title":"WorkerRemoved Event Handler <pre><code>handleWorkerRemoved(\n  workerId: String,\n  host: String,\n  message: String): Unit\n</code></pre> <p><code>handleWorkerRemoved</code>...FIXME</p> <p><code>handleWorkerRemoved</code> is used when <code>DAGSchedulerEventProcessLoop</code> is requested to handle a WorkerRemoved event.</p>","text":""},{"location":"scheduler/DAGScheduler/#internal-properties","title":"Internal Properties","text":""},{"location":"scheduler/DAGScheduler/#failedepoch","title":"failedEpoch <p>The lookup table of lost executors and the epoch of the event.</p>","text":""},{"location":"scheduler/DAGScheduler/#failedstages","title":"failedStages <p>Stages that failed due to fetch failures (when a DAGSchedulerEventProcessLoop.md#handleTaskCompletion-FetchFailed[task fails with <code>FetchFailed</code> exception]).</p>","text":""},{"location":"scheduler/DAGScheduler/#jobidtoactivejob","title":"jobIdToActiveJob <p>The lookup table of <code>ActiveJob</code>s per job id.</p>","text":""},{"location":"scheduler/DAGScheduler/#jobidtostageids","title":"jobIdToStageIds <p>The lookup table of all stages per <code>ActiveJob</code> id</p>","text":""},{"location":"scheduler/DAGScheduler/#nextjobid-counter","title":"nextJobId Counter <pre><code>nextJobId: AtomicInteger\n</code></pre> <p><code>nextJobId</code> is a Java AtomicInteger for job IDs.</p> <p><code>nextJobId</code> starts at <code>0</code>.</p> <p>Used when <code>DAGScheduler</code> is requested for numTotalJobs, to submitJob, runApproximateJob and submitMapStage.</p>","text":""},{"location":"scheduler/DAGScheduler/#nextstageid","title":"nextStageId <p>The next stage id counting from <code>0</code>.</p> <p>Used when DAGScheduler creates a &lt;&gt; and a &lt;&gt;. It is the key in stageIdToStage.","text":""},{"location":"scheduler/DAGScheduler/#runningstages","title":"runningStages <p>The set of stages that are currently \"running\".</p> <p>A stage is added when &lt;&gt; gets executed (without first checking if the stage has not already been added).","text":""},{"location":"scheduler/DAGScheduler/#shuffleidtomapstage","title":"shuffleIdToMapStage <p>A lookup table of ShuffleMapStages by ShuffleDependency</p>","text":""},{"location":"scheduler/DAGScheduler/#stageidtostage","title":"stageIdToStage <p>A lookup table of stages by stage ID</p> <p>Used when DAGScheduler creates a shuffle map stage, creates a result stage, cleans up job state and independent stages, is informed that a task is started, a taskset has failed, a job is submitted (to compute a <code>ResultStage</code>), a map stage was submitted, a task has completed or a stage was cancelled, updates accumulators, aborts a stage and fails a job and independent stages.</p>","text":""},{"location":"scheduler/DAGScheduler/#waitingstages","title":"waitingStages <p>Stages with parents to be computed</p>","text":""},{"location":"scheduler/DAGScheduler/#event-posting-methods","title":"Event Posting Methods","text":""},{"location":"scheduler/DAGScheduler/#posting-alljobscancelled","title":"Posting AllJobsCancelled <p>Posts an AllJobsCancelled</p> <p>Used when <code>SparkContext</code> is requested to cancel all running or scheduled Spark jobs</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-jobcancelled","title":"Posting JobCancelled <p>Posts a JobCancelled</p> <p>Used when SparkContext or JobWaiter are requested to cancel a Spark job</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-jobgroupcancelled","title":"Posting JobGroupCancelled <p>Posts a JobGroupCancelled</p> <p>Used when <code>SparkContext</code> is requested to cancel a job group</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-stagecancelled","title":"Posting StageCancelled <p>Posts a StageCancelled</p> <p>Used when <code>SparkContext</code> is requested to cancel a stage</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-executoradded","title":"Posting ExecutorAdded <p>Posts an ExecutorAdded</p> <p>Used when <code>TaskSchedulerImpl</code> is requested to handle resource offers (and a new executor is found in the resource offers)</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-executorlost","title":"Posting ExecutorLost <p>Posts a ExecutorLost</p> <p>Used when <code>TaskSchedulerImpl</code> is requested to handle a task status update (and a task gets lost which is used to indicate that the executor got broken and hence should be considered lost) or executorLost</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-jobsubmitted","title":"Posting JobSubmitted <p>Posts a JobSubmitted</p> <p>Used when <code>SparkContext</code> is requested to run an approximate job</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-speculativetasksubmitted","title":"Posting SpeculativeTaskSubmitted <p>Posts a SpeculativeTaskSubmitted</p> <p>Used when <code>TaskSetManager</code> is requested to checkAndSubmitSpeculatableTask</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-completionevent","title":"Posting CompletionEvent <p>Posts a CompletionEvent</p> <p>Used when <code>TaskSetManager</code> is requested to handleSuccessfulTask, handleFailedTask, and executorLost</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-gettingresultevent","title":"Posting GettingResultEvent <p>Posts a GettingResultEvent</p> <p>Used when <code>TaskSetManager</code> is requested to handle a task fetching result</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-tasksetfailed","title":"Posting TaskSetFailed <p>Posts a TaskSetFailed</p> <p>Used when <code>TaskSetManager</code> is requested to abort</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-beginevent","title":"Posting BeginEvent <p>Posts a BeginEvent</p> <p>Used when <code>TaskSetManager</code> is requested to start a task</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-workerremoved","title":"Posting WorkerRemoved <p>Posts a WorkerRemoved</p> <p>Used when <code>TaskSchedulerImpl</code> is requested to handle a removed worker event</p>","text":""},{"location":"scheduler/DAGScheduler/#updating-accumulators-of-completed-tasks","title":"Updating Accumulators of Completed Tasks <pre><code>updateAccumulators(\n  event: CompletionEvent): Unit\n</code></pre> <p><code>updateAccumulators</code> merges the partial values of accumulators from a completed task (based on the given CompletionEvent) into their \"source\" accumulators on the driver.</p> <p>For every AccumulatorV2 update (in the given CompletionEvent), <code>updateAccumulators</code> finds the corresponding accumulator on the driver and requests the <code>AccumulatorV2</code> to merge the updates.</p> <p><code>updateAccumulators</code>...FIXME</p> <p>For named accumulators with the update value being a non-zero value, i.e. not <code>Accumulable.zero</code>:</p> <ul> <li><code>stage.latestInfo.accumulables</code> for the <code>AccumulableInfo.id</code> is set</li> <li><code>CompletionEvent.taskInfo.accumulables</code> has a new AccumulableInfo added.</li> </ul> <p>CAUTION: FIXME Where are <code>Stage.latestInfo.accumulables</code> and <code>CompletionEvent.taskInfo.accumulables</code> used?</p> <p><code>updateAccumulators</code> is used when <code>DAGScheduler</code> is requested to handle a task completion.</p>","text":""},{"location":"scheduler/DAGScheduler/#posting-sparklistenertaskend-at-task-completion","title":"Posting SparkListenerTaskEnd (at Task Completion) <pre><code>postTaskEnd(\n  event: CompletionEvent): Unit\n</code></pre> <p><code>postTaskEnd</code> reconstructs task metrics (from the accumulator updates in the <code>CompletionEvent</code>).</p> <p>In the end, <code>postTaskEnd</code> creates a SparkListenerTaskEnd and requests the LiveListenerBus to post it.</p> <p><code>postTaskEnd</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handle a task completion</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#checkBarrierStageWithNumSlots","title":"checkBarrierStageWithNumSlots <pre><code>checkBarrierStageWithNumSlots(\n  rdd: RDD[_],\n  rp: ResourceProfile): Unit\n</code></pre>  Noop for Non-Barrier RDDs <p>Unless the given <code>RDD</code> is isBarrier, <code>checkBarrierStageWithNumSlots</code> does nothing (is a noop).</p>  <p><code>checkBarrierStageWithNumSlots</code> requests the given <code>RDD</code> for the number of partitions.</p> <p><code>checkBarrierStageWithNumSlots</code> requests the SparkContext for the maximum number of concurrent tasks for the given ResourceProfile.</p> <p>If the number of partitions (based on the RDD) is greater than the maximum number of concurrent tasks (based on the ResourceProfile), <code>checkBarrierStageWithNumSlots</code> reports a BarrierJobSlotsNumberCheckFailed exception.</p>  <p><code>checkBarrierStageWithNumSlots</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to create a ShuffleMapStage or a ResultStage stage</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#utilities","title":"Utilities  <p>Danger</p> <p>The section includes (hides) utility methods that do not really contribute to the understanding of how <code>DAGScheduler</code> works internally.</p> <p>It's very likely they should not even be part of this page.</p>","text":""},{"location":"scheduler/DAGScheduler/#getShuffleDependenciesAndResourceProfiles","title":"Finding Shuffle Dependencies and ResourceProfiles of RDD <pre><code>getShuffleDependenciesAndResourceProfiles(\n  rdd: RDD[_]): (HashSet[ShuffleDependency[_, _, _]], HashSet[ResourceProfile])\n</code></pre> <p><code>getShuffleDependenciesAndResourceProfiles</code> returns the direct ShuffleDependencies and all the ResourceProfiles of the given RDD and parent non-shuffle <code>RDD</code>s, if available.</p>  <p><code>getShuffleDependenciesAndResourceProfiles</code> collects ResourceProfiles of the given RDD and any parent <code>RDD</code>s, if available.</p> <p><code>getShuffleDependenciesAndResourceProfiles</code> collects direct ShuffleDependencies of the given RDD and any parent <code>RDD</code>s of non-<code>ShuffleDependency</code>ies, if available.</p>  <p><code>getShuffleDependenciesAndResourceProfiles</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to create a ShuffleMapStage and a ResultStage, and for the missing ShuffleDependencies of a RDD</li> </ul>","text":""},{"location":"scheduler/DAGScheduler/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.DAGScheduler</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.DAGScheduler=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/","title":"DAGSchedulerEvent","text":"<p><code>DAGSchedulerEvent</code> is an abstraction of events that are handled by the DAGScheduler (on dag-scheduler-event-loop daemon thread).</p>"},{"location":"scheduler/DAGSchedulerEvent/#alljobscancelled","title":"AllJobsCancelled <p>Carries no extra information</p> <p>Posted when <code>DAGScheduler</code> is requested to cancelAllJobs</p> <p>Event handler: doCancelAllJobs</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#beginevent","title":"BeginEvent <p>Carries the following:</p> <ul> <li>Task</li> <li>TaskInfo</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to taskStarted</p> <p>Event handler: handleBeginEvent</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#completionevent","title":"CompletionEvent <p>Carries the following:</p> <ul> <li> Completed Task <li> <code>TaskEndReason</code> <li> Result (value computed) <li> AccumulatorV2 Updates <li> Metric Peaks <li> TaskInfo  <p>Posted when <code>DAGScheduler</code> is requested to taskEnded</p> <p>Event handler: handleTaskCompletion</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#executoradded","title":"ExecutorAdded <p>Carries the following:</p> <ul> <li>Executor ID</li> <li>Host name</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to executorAdded</p> <p>Event handler: handleExecutorAdded</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#executorlost","title":"ExecutorLost <p>Carries the following:</p> <ul> <li>Executor ID</li> <li>Reason</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to executorLost</p> <p>Event handler: handleExecutorLost</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#gettingresultevent","title":"GettingResultEvent <p>Carries the following:</p> <ul> <li>TaskInfo</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to taskGettingResult</p> <p>Event handler: handleGetTaskResult</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#jobcancelled","title":"JobCancelled <p>JobCancelled event carries the following:</p> <ul> <li>Job ID</li> <li>Reason (optional)</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to cancelJob</p> <p>Event handler: handleJobCancellation</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#jobgroupcancelled","title":"JobGroupCancelled <p>Carries the following:</p> <ul> <li>Group ID</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to cancelJobGroup</p> <p>Event handler: handleJobGroupCancelled</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#jobsubmitted","title":"JobSubmitted <p>Carries the following:</p> <ul> <li>Job ID</li> <li>RDD</li> <li>Partition processing function (with a TaskContext and the partition data, i.e. <code>(TaskContext, Iterator[_]) =&gt; _</code>)</li> <li>Partition IDs to compute</li> <li><code>CallSite</code></li> <li>JobListener to keep updated about the status of the stage execution</li> <li>Execution properties</li> </ul> <p>Posted when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit a job, run an approximate job and handleJobSubmitted</li> </ul> <p>Event handler: handleJobSubmitted</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#mapstagesubmitted","title":"MapStageSubmitted <p>Carries the following:</p> <ul> <li>Job ID</li> <li>ShuffleDependency</li> <li>CallSite</li> <li>JobListener</li> <li>Execution properties</li> </ul> <p>Posted when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit a MapStage for execution</li> </ul> <p>Event handler: handleMapStageSubmitted</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#resubmitfailedstages","title":"ResubmitFailedStages <p>Carries no extra information.</p> <p>Posted when <code>DAGScheduler</code> is requested to handleTaskCompletion</p> <p>Event handler: resubmitFailedStages</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#shufflemergefinalized","title":"ShuffleMergeFinalized <p>Carries the following:</p> <ul> <li>ShuffleMapStage</li> </ul> <p>Posted when:</p> <ul> <li><code>DAGScheduler</code> is requested to finalizeShuffleMerge</li> </ul> <p>Event handler: handleShuffleMergeFinalized</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#speculativetasksubmitted","title":"SpeculativeTaskSubmitted <p>Carries the following:</p> <ul> <li>Task</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to speculativeTaskSubmitted</p> <p>Event handler: handleSpeculativeTaskSubmitted</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#stagecancelled","title":"StageCancelled <p>Carries the following:</p> <ul> <li>Stage ID</li> <li>Reason (optional)</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to cancelStage</p> <p>Event handler: handleStageCancellation</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#tasksetfailed","title":"TaskSetFailed <p>Carries the following:</p> <ul> <li>TaskSet</li> <li>Reason</li> <li>Exception (optional)</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to taskSetFailed</p> <p>Event handler: handleTaskSetFailed</p>","text":""},{"location":"scheduler/DAGSchedulerEvent/#workerremoved","title":"WorkerRemoved <p>Carries the following:</p> <ul> <li>Worked ID</li> <li>Host name</li> <li>Reason</li> </ul> <p>Posted when <code>DAGScheduler</code> is requested to workerRemoved</p> <p>Event handler: handleWorkerRemoved</p>","text":""},{"location":"scheduler/DAGSchedulerEventProcessLoop/","title":"DAGSchedulerEventProcessLoop","text":"<p><code>DAGSchedulerEventProcessLoop</code> is an event processing daemon thread to handle DAGSchedulerEvents (on a separate thread from the parent DAGScheduler's).</p> <p><code>DAGSchedulerEventProcessLoop</code> is registered under the name of dag-scheduler-event-loop.</p> <p><code>DAGSchedulerEventProcessLoop</code> uses java.util.concurrent.LinkedBlockingDeque blocking deque that can grow indefinitely.</p>"},{"location":"scheduler/DAGSchedulerEventProcessLoop/#creating-instance","title":"Creating Instance","text":"<p><code>DAGSchedulerEventProcessLoop</code> takes the following to be created:</p> <ul> <li> DAGScheduler <p><code>DAGSchedulerEventProcessLoop</code> is created\u00a0when:</p> <ul> <li><code>DAGScheduler</code> is created</li> </ul>"},{"location":"scheduler/DAGSchedulerEventProcessLoop/#processing-event","title":"Processing Event    DAGSchedulerEvent Event Handler     AllJobsCancelled doCancelAllJobs   BeginEvent handleBeginEvent   CompletionEvent handleTaskCompletion   ExecutorAdded handleExecutorAdded   ExecutorLost handleExecutorLost   GettingResultEvent handleGetTaskResult   JobCancelled handleJobCancellation   JobGroupCancelled handleJobGroupCancelled   JobSubmitted handleJobSubmitted   MapStageSubmitted handleMapStageSubmitted   ResubmitFailedStages resubmitFailedStages   SpeculativeTaskSubmitted handleSpeculativeTaskSubmitted   StageCancelled handleStageCancellation   TaskSetFailed handleTaskSetFailed   WorkerRemoved handleWorkerRemoved","text":""},{"location":"scheduler/DAGSchedulerEventProcessLoop/#shufflemergefinalized","title":"ShuffleMergeFinalized <ul> <li>Event: ShuffleMergeFinalized</li> <li>Event handler: handleShuffleMergeFinalized</li> </ul>","text":""},{"location":"scheduler/DAGSchedulerEventProcessLoop/#messageprocessingtime-timer","title":"messageProcessingTime Timer <p><code>DAGSchedulerEventProcessLoop</code> uses messageProcessingTime timer to measure time of processing events.</p>","text":""},{"location":"scheduler/DAGSchedulerSource/","title":"DAGSchedulerSource","text":"<p><code>DAGSchedulerSource</code> is the metrics source of DAGScheduler.</p> <p>The name of the source is DAGScheduler.</p> <p><code>DAGSchedulerSource</code> emits the following metrics:</p> <ul> <li>stage.failedStages - the number of failed stages</li> <li>stage.runningStages - the number of running stages</li> <li>stage.waitingStages - the number of waiting stages</li> <li>job.allJobs - the number of all jobs</li> <li>job.activeJobs - the number of active jobs</li> </ul>"},{"location":"scheduler/DriverEndpoint/","title":"DriverEndpoint","text":"<p><code>DriverEndpoint</code> is a ThreadSafeRpcEndpoint that is a message handler for CoarseGrainedSchedulerBackend to communicate with CoarseGrainedExecutorBackend.</p> <p></p> <p><code>DriverEndpoint</code> is registered under the name CoarseGrainedScheduler by CoarseGrainedSchedulerBackend.</p> <p><code>DriverEndpoint</code> uses executorDataMap internal registry of all the executors that registered with the driver. An executor sends a RegisterExecutor message to inform that it wants to register.</p> <p></p>"},{"location":"scheduler/DriverEndpoint/#creating-instance","title":"Creating Instance","text":"<p><code>DriverEndpoint</code> takes no arguments to be created.</p> <p><code>DriverEndpoint</code> is created when:</p> <ul> <li><code>CoarseGrainedSchedulerBackend</code> is created (and registers a CoarseGrainedScheduler RPC endpoint)</li> </ul>"},{"location":"scheduler/DriverEndpoint/#executorlogurlhandler","title":"ExecutorLogUrlHandler <pre><code>logUrlHandler: ExecutorLogUrlHandler\n</code></pre> <p><code>DriverEndpoint</code> creates an ExecutorLogUrlHandler (based on spark.ui.custom.executor.log.url configuration property) when created.</p> <p><code>DriverEndpoint</code> uses the <code>ExecutorLogUrlHandler</code> to create an ExecutorData when requested to handle a RegisterExecutor message.</p>","text":""},{"location":"scheduler/DriverEndpoint/#onStart","title":"Starting DriverEndpoint  RpcEndpoint <pre><code>onStart(): Unit\n</code></pre> <p><code>onStart</code> is part of the RpcEndpoint abstraction.</p>  <p><code>onStart</code> requests the Revive Messages Scheduler Service to schedule a periodic action that sends ReviveOffers messages every revive interval (based on spark.scheduler.revive.interval configuration property).</p>","text":""},{"location":"scheduler/DriverEndpoint/#makeOffers","title":"Launching Tasks <p>There are two <code>makeOffers</code> methods to launch tasks that differ by the number of active executor (from the executorDataMap registry) they work with:</p> <ul> <li>All Active Executors</li> <li>Single Executor</li> </ul>","text":""},{"location":"scheduler/DriverEndpoint/#on-all-active-executors","title":"On All Active Executors","text":"<pre><code>makeOffers(): Unit\n</code></pre> <p><code>makeOffers</code> builds WorkerOffers for every active executor (in the executorDataMap registry) and requests the TaskSchedulerImpl to generate tasks for the available worker offers (that creates TaskDescriptions).</p> <p>With tasks (<code>TaskDescription</code>s) to be launched, <code>makeOffers</code> launches them.</p> <p><code>makeOffers</code> is used when:</p> <ul> <li><code>DriverEndpoint</code> handles ReviveOffers messages</li> </ul>"},{"location":"scheduler/DriverEndpoint/#on-single-executor","title":"On Single Executor","text":"<pre><code>makeOffers(\n  executorId: String): Unit\n</code></pre> <p>Note</p> <p><code>makeOffers</code> with a single executor is makeOffers for all active executors for just one executor.</p> <p><code>makeOffers</code> is used when:</p> <ul> <li><code>DriverEndpoint</code> handles StatusUpdate and LaunchedExecutor messages</li> </ul>"},{"location":"scheduler/DriverEndpoint/#launchTasks","title":"Launching Tasks","text":"<pre><code>launchTasks(\n  tasks: Seq[Seq[TaskDescription]]): Unit\n</code></pre> <p>Note</p> <p>The input <code>tasks</code> collection contains one or more TaskDescriptions per executor (and the \"task partitioning\" per executor is of no use in <code>launchTasks</code> so it simply flattens the input data structure).</p> <p>For every TaskDescription (in the given <code>tasks</code> collection), <code>launchTasks</code> encodes it and makes sure that the encoded task size is below the allowed message size.</p> <p><code>launchTasks</code> looks up the <code>ExecutorData</code> of the executor that has been assigned to execute the task (in executorDataMap internal registry) and decreases the executor's free cores (based on spark.task.cpus configuration property).</p> <p>Note</p> <p>Scheduling in Spark relies on cores only (not memory), i.e. the number of tasks Spark can run on an executor is limited by the number of cores available only. When submitting a Spark application for execution both executor resources -- memory and cores -- can however be specified explicitly. It is the job of a cluster manager to monitor the memory and take action when its use exceeds what was assigned.</p> <p><code>launchTasks</code> prints out the following DEBUG message to the logs:</p> <pre><code>Launching task [taskId] on executor id: [executorId] hostname: [executorHost].\n</code></pre> <p>In the end, <code>launchTasks</code> sends the (serialized) task to the executor (by sending a LaunchTask message to the executor's RPC endpoint with the serialized task insize <code>SerializableBuffer</code>).</p> <p>Note</p> <p>This is the moment in a task's lifecycle when the driver sends the serialized task to an assigned executor.</p>"},{"location":"scheduler/DriverEndpoint/#task-exceeds-allowed-size","title":"Task Exceeds Allowed Size <p>In case the size of a serialized <code>TaskDescription</code> equals or exceeds the maximum allowed RPC message size, <code>launchTasks</code> looks up the TaskSetManager for the <code>TaskDescription</code> (in taskIdToTaskSetManager registry) and aborts it with the following message:</p> <pre><code>Serialized task [id]:[index] was [limit] bytes, which exceeds max allowed: spark.rpc.message.maxSize ([maxRpcMessageSize] bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.\n</code></pre>","text":""},{"location":"scheduler/DriverEndpoint/#messages","title":"Messages","text":""},{"location":"scheduler/DriverEndpoint/#killexecutorsonhost","title":"KillExecutorsOnHost <p><code>CoarseGrainedSchedulerBackend</code> is requested to kill all executors on a node</p>","text":""},{"location":"scheduler/DriverEndpoint/#killtask","title":"KillTask <p><code>CoarseGrainedSchedulerBackend</code> is requested to kill a task.</p> <pre><code>KillTask(\n  taskId: Long,\n  executor: String,\n  interruptThread: Boolean)\n</code></pre> <p><code>KillTask</code> is sent when <code>CoarseGrainedSchedulerBackend</code> kills a task.</p> <p>When <code>KillTask</code> is received, <code>DriverEndpoint</code> finds <code>executor</code> (in executorDataMap registry).</p> <p>If found, <code>DriverEndpoint</code> passes the message on to the executor (using its registered RPC endpoint for <code>CoarseGrainedExecutorBackend</code>).</p> <p>Otherwise, you should see the following WARN in the logs:</p> <pre><code>Attempted to kill task [taskId] for unknown executor [executor].\n</code></pre>","text":""},{"location":"scheduler/DriverEndpoint/#launchedexecutor","title":"LaunchedExecutor","text":""},{"location":"scheduler/DriverEndpoint/#registerexecutor","title":"RegisterExecutor <p><code>CoarseGrainedExecutorBackend</code> registers with the driver</p> <pre><code>RegisterExecutor(\n  executorId: String,\n  executorRef: RpcEndpointRef,\n  hostname: String,\n  cores: Int,\n  logUrls: Map[String, String])\n</code></pre> <p><code>RegisterExecutor</code> is sent when <code>CoarseGrainedExecutorBackend</code> RPC Endpoint is requested to start.</p> <p></p> <p>When received, <code>DriverEndpoint</code> makes sure that no other executors were registered under the input <code>executorId</code> and that the input <code>hostname</code> is not blacklisted.</p> <p>If the requirements hold, you should see the following INFO message in the logs:</p> <pre><code>Registered executor [executorRef] ([address]) with ID [executorId]\n</code></pre> <p><code>DriverEndpoint</code> does the bookkeeping:</p> <ul> <li>Registers <code>executorId</code> (in addressToExecutorId)</li> <li>Adds <code>cores</code> (in totalCoreCount)</li> <li>Increments totalRegisteredExecutors</li> <li>Creates and registers <code>ExecutorData</code> for <code>executorId</code> (in executorDataMap)</li> <li>Updates currentExecutorIdCounter if the input <code>executorId</code> is greater than the current value.</li> </ul> <p>If numPendingExecutors is greater than <code>0</code>, you should see the following DEBUG message in the logs and DriverEndpoint decrements <code>numPendingExecutors</code>.</p> <pre><code>Decremented number of pending executors ([numPendingExecutors] left)\n</code></pre> <p><code>DriverEndpoint</code> sends RegisteredExecutor message back (that is to confirm that the executor was registered successfully).</p> <p><code>DriverEndpoint</code> replies <code>true</code> (to acknowledge the message).</p> <p><code>DriverEndpoint</code> then announces the new executor by posting SparkListenerExecutorAdded to LiveListenerBus.</p> <p>In the end, <code>DriverEndpoint</code> makes executor resource offers (for launching tasks).</p> <p>If however there was already another executor registered under the input <code>executorId</code>, <code>DriverEndpoint</code> sends RegisterExecutorFailed message back with the reason:</p> <pre><code>Duplicate executor ID: [executorId]\n</code></pre> <p>If however the input <code>hostname</code> is blacklisted, you should see the following INFO message in the logs:</p> <pre><code>Rejecting [executorId] as it has been blacklisted.\n</code></pre> <p><code>DriverEndpoint</code> sends RegisterExecutorFailed message back with the reason:</p> <pre><code>Executor is blacklisted: [executorId]\n</code></pre>","text":""},{"location":"scheduler/DriverEndpoint/#removeexecutor","title":"RemoveExecutor","text":""},{"location":"scheduler/DriverEndpoint/#removeworker","title":"RemoveWorker","text":""},{"location":"scheduler/DriverEndpoint/#retrievesparkappconfig","title":"RetrieveSparkAppConfig <pre><code>RetrieveSparkAppConfig(\n  resourceProfileId: Int)\n</code></pre> <p>Posted when:</p> <ul> <li><code>CoarseGrainedExecutorBackend</code> standalone application is started</li> </ul> <p>When received, <code>DriverEndpoint</code> replies with a <code>SparkAppConfig</code> message with the following:</p> <ol> <li><code>spark</code>-prefixed configuration properties</li> <li>IO Encryption Key</li> <li>Delegation tokens</li> <li>Default profile</li> </ol>","text":""},{"location":"scheduler/DriverEndpoint/#reviveoffers","title":"ReviveOffers <p>Posted when:</p> <ul> <li>Periodically (every spark.scheduler.revive.interval) right after <code>DriverEndpoint</code> is requested to start</li> <li><code>CoarseGrainedSchedulerBackend</code> is requested to revive resource offers</li> </ul> <p>When received, <code>DriverEndpoint</code> makes executor resource offers.</p>","text":""},{"location":"scheduler/DriverEndpoint/#statusupdate","title":"StatusUpdate <p><code>CoarseGrainedExecutorBackend</code> sends task status updates to the driver</p> <pre><code>StatusUpdate(\n  executorId: String,\n  taskId: Long,\n  state: TaskState,\n  data: SerializableBuffer)\n</code></pre> <p><code>StatusUpdate</code> is sent when <code>CoarseGrainedExecutorBackend</code> sends task status updates to the driver.</p> <p>When <code>StatusUpdate</code> is received, DriverEndpoint requests the TaskSchedulerImpl to handle the task status update.</p> <p>If the task has finished, <code>DriverEndpoint</code> updates the number of cores available for work on the corresponding executor (registered in executorDataMap).</p> <p>DriverEndpoint makes an executor resource offer on the single executor.</p> <p>When <code>DriverEndpoint</code> found no executor (in executorDataMap), you should see the following WARN message in the logs:</p> <pre><code>Ignored task status update ([taskId] state [state]) from unknown executor with ID [executorId]\n</code></pre>","text":""},{"location":"scheduler/DriverEndpoint/#stopdriver","title":"StopDriver","text":""},{"location":"scheduler/DriverEndpoint/#stopexecutors","title":"StopExecutors <p><code>StopExecutors</code> message is receive-reply and blocking. When received, the following INFO message appears in the logs:</p> <pre><code>Asking each executor to shut down\n</code></pre> <p>It then sends a StopExecutor message to every registered executor (from <code>executorDataMap</code>).</p>","text":""},{"location":"scheduler/DriverEndpoint/#updatedelegationtokens","title":"UpdateDelegationTokens","text":""},{"location":"scheduler/DriverEndpoint/#removing-executor","title":"Removing Executor <pre><code>removeExecutor(\n  executorId: String,\n  reason: ExecutorLossReason): Unit\n</code></pre> <p>When <code>removeExecutor</code> is executed, you should see the following DEBUG message in the logs:</p> <pre><code>Asked to remove executor [executorId] with reason [reason]\n</code></pre> <p><code>removeExecutor</code> then tries to find the <code>executorId</code> executor (in executorDataMap internal registry).</p> <p>If the <code>executorId</code> executor was found, <code>removeExecutor</code> removes the executor from the following registries:</p> <ul> <li>addressToExecutorId</li> <li>executorDataMap</li> <li>&lt;&gt; <li>executorsPendingToRemove</li>  <p><code>removeExecutor</code> decrements:</p> <ul> <li>totalCoreCount by the executor's <code>totalCores</code></li> <li>totalRegisteredExecutors</li> </ul> <p>In the end, <code>removeExecutor</code> notifies <code>TaskSchedulerImpl</code> that an executor was lost.</p> <p><code>removeExecutor</code> posts SparkListenerExecutorRemoved to LiveListenerBus (with the <code>executorId</code> executor).</p> <p>If however the <code>executorId</code> executor could not be found, <code>removeExecutor</code> requests <code>BlockManagerMaster</code> to remove the executor asynchronously.</p>  <p>Note</p> <p><code>removeExecutor</code> uses <code>SparkEnv</code> to access the current <code>BlockManager</code> and then BlockManagerMaster.</p>  <p>You should see the following INFO message in the logs:</p> <pre><code>Asked to remove non-existent executor [executorId]\n</code></pre> <p><code>removeExecutor</code> is used when <code>DriverEndpoint</code> handles RemoveExecutor message and gets disassociated with a remote RPC endpoint of an executor.</p>","text":""},{"location":"scheduler/DriverEndpoint/#removing-worker","title":"Removing Worker <pre><code>removeWorker(\n  workerId: String,\n  host: String,\n  message: String): Unit\n</code></pre> <p><code>removeWorker</code> prints out the following DEBUG message to the logs:</p> <pre><code>Asked to remove worker [workerId] with reason [message]\n</code></pre> <p>In the end, <code>removeWorker</code> simply requests the TaskSchedulerImpl to workerRemoved.</p> <p><code>removeWorker</code> is used when <code>DriverEndpoint</code> is requested to handle a RemoveWorker event.</p>","text":""},{"location":"scheduler/DriverEndpoint/#processing-one-way-messages","title":"Processing One-Way Messages <pre><code>receive: PartialFunction[Any, Unit]\n</code></pre> <p><code>receive</code> is part of the RpcEndpoint abstraction.</p> <p><code>receive</code>...FIXME</p>","text":""},{"location":"scheduler/DriverEndpoint/#processing-two-way-messages","title":"Processing Two-Way Messages <pre><code>receiveAndReply(\n  context: RpcCallContext): PartialFunction[Any, Unit]\n</code></pre> <p><code>receiveAndReply</code> is part of the RpcEndpoint abstraction.</p> <p><code>receiveAndReply</code>...FIXME</p>","text":""},{"location":"scheduler/DriverEndpoint/#ondisconnected-callback","title":"onDisconnected Callback <p><code>onDisconnected</code> removes the worker from the internal addressToExecutorId registry (that effectively removes the worker from a cluster).</p> <p><code>onDisconnected</code> removes the executor with the reason being <code>SlaveLost</code> and message:</p> <pre><code>Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.\n</code></pre>","text":""},{"location":"scheduler/DriverEndpoint/#executors-by-rpcaddress-registry","title":"Executors by RpcAddress Registry <pre><code>addressToExecutorId: Map[RpcAddress, String]\n</code></pre> <p>Executor addresses (host and port) for executors.</p> <p>Set when an executor connects to register itself.</p>","text":""},{"location":"scheduler/DriverEndpoint/#disabling-executor","title":"Disabling Executor <pre><code>disableExecutor(\n  executorId: String): Boolean\n</code></pre> <p><code>disableExecutor</code> checks whether the executor is active:</p> <ul> <li>If so, <code>disableExecutor</code> adds the executor to the executorsPendingLossReason registry</li> <li>Otherwise, <code>disableExecutor</code> checks whether added to executorsPendingToRemove registry</li> </ul> <p><code>disableExecutor</code> determines whether the executor should really be disabled (as active or registered in executorsPendingToRemove registry).</p> <p>If the executor should be disabled, <code>disableExecutor</code> prints out the following INFO message to the logs and notifies the TaskSchedulerImpl that the executor is lost.</p> <pre><code>Disabling executor [executorId].\n</code></pre> <p><code>disableExecutor</code> returns the indication whether the executor should have been disabled or not.</p> <p><code>disableExecutor</code> is used when:</p> <ul> <li><code>KubernetesDriverEndpoint</code> is requested to handle <code>onDisconnected</code> event</li> <li><code>YarnDriverEndpoint</code> is requested to handle <code>onDisconnected</code> event</li> </ul>","text":""},{"location":"scheduler/DriverEndpoint/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverEndpoint</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverEndpoint=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/ExecutorData/","title":"ExecutorData","text":"<p><code>ExecutorData</code> is a metadata of an executor:</p> <ul> <li> Executor's RPC Endpoint <li> Executor's RpcAddress <li> Executor's Host <li> Executor's Free Cores <li> Executor's Total Cores <li> Executor's Log URLs (<code>Map[String, String]</code>) <li> Executor's Attributes (<code>Map[String, String]</code>) <li> Executor's Resources Info (<code>Map[String, ExecutorResourceInfo]</code>) <li> Executor's ResourceProfile ID <p><code>ExecutorData</code> is created for every executor registered (when <code>DriverEndpoint</code> is requested to handle a RegisterExecutor message).</p> <p><code>ExecutorData</code> is used by <code>CoarseGrainedSchedulerBackend</code> to track registered executors.</p> <p>Note</p> <p><code>ExecutorData</code> is posted as part of SparkListenerExecutorAdded event by DriverEndpoint every time an executor is registered.</p>"},{"location":"scheduler/ExternalClusterManager/","title":"ExternalClusterManager","text":"<p><code>ExternalClusterManager</code> is an abstraction of pluggable cluster managers that can create a SchedulerBackend and TaskScheduler for a given master URL (when SparkContext is created).</p> <p>Note</p> <p>The support for pluggable cluster managers was introduced in SPARK-13904 Add support for pluggable cluster manager.</p> <p><code>ExternalClusterManager</code> can be registered using the <code>java.util.ServiceLoader</code> mechanism (with service markers under <code>META-INF/services</code> directory).</p>"},{"location":"scheduler/ExternalClusterManager/#contract","title":"Contract","text":""},{"location":"scheduler/ExternalClusterManager/#checking-support-for-master-url","title":"Checking Support for Master URL <pre><code>canCreate(\n  masterURL: String): Boolean\n</code></pre> <p>Checks whether this cluster manager instance can create scheduler components for a given master URL</p> <p>Used when SparkContext is created (and requested for a cluster manager)</p>","text":""},{"location":"scheduler/ExternalClusterManager/#creating-schedulerbackend","title":"Creating SchedulerBackend <pre><code>createSchedulerBackend(\n  sc: SparkContext,\n  masterURL: String,\n  scheduler: TaskScheduler): SchedulerBackend\n</code></pre> <p>Creates a SchedulerBackend for a given SparkContext, master URL, and TaskScheduler.</p> <p>Used when SparkContext is created (and requested for a SchedulerBackend and TaskScheduler)</p>","text":""},{"location":"scheduler/ExternalClusterManager/#creating-taskscheduler","title":"Creating TaskScheduler <pre><code>createTaskScheduler(\n  sc: SparkContext,\n  masterURL: String): TaskScheduler\n</code></pre> <p>Creates a TaskScheduler for a given SparkContext and master URL</p> <p>Used when SparkContext is created (and requested for a SchedulerBackend and TaskScheduler)</p>","text":""},{"location":"scheduler/ExternalClusterManager/#initializing-scheduling-components","title":"Initializing Scheduling Components <pre><code>initialize(\n  scheduler: TaskScheduler,\n  backend: SchedulerBackend): Unit\n</code></pre> <p>Initializes the TaskScheduler and SchedulerBackend</p> <p>Used when SparkContext is created (and requested for a SchedulerBackend and TaskScheduler)</p>","text":""},{"location":"scheduler/ExternalClusterManager/#implementations","title":"Implementations","text":"<ul> <li><code>KubernetesClusterManager</code> (Spark on Kubernetes)</li> <li><code>MesosClusterManager</code></li> <li><code>YarnClusterManager</code></li> </ul>"},{"location":"scheduler/FIFOSchedulableBuilder/","title":"FIFOSchedulableBuilder","text":"<p>== FIFOSchedulableBuilder - SchedulableBuilder for FIFO Scheduling Mode</p> <p><code>FIFOSchedulableBuilder</code> is a &lt;&gt; that holds a single spark-scheduler-Pool.md[Pool] (that is given when &lt;FIFOSchedulableBuilder is created&gt;&gt;). <p>NOTE: <code>FIFOSchedulableBuilder</code> is the scheduler:TaskSchedulerImpl.md#creating-instance[default <code>SchedulableBuilder</code> for <code>TaskSchedulerImpl</code>].</p> <p>NOTE: When <code>FIFOSchedulableBuilder</code> is created, the <code>TaskSchedulerImpl</code> passes its own <code>rootPool</code> (a part of scheduler:TaskScheduler.md#contract[TaskScheduler Contract]).</p> <p><code>FIFOSchedulableBuilder</code> obeys the &lt;&gt; as follows: <ul> <li>&lt;&gt; does nothing. <li><code>addTaskSetManager</code> spark-scheduler-Pool.md#addSchedulable[passes the input <code>Schedulable</code> to the one and only rootPool Pool (using <code>addSchedulable</code>)] and completely disregards the properties of the Schedulable.</li> <p>=== [[creating-instance]] Creating FIFOSchedulableBuilder Instance</p> <p><code>FIFOSchedulableBuilder</code> takes the following when created:</p> <ul> <li>[[rootPool]] <code>rootPool</code> spark-scheduler-Pool.md[Pool]</li> </ul>"},{"location":"scheduler/FairSchedulableBuilder/","title":"FairSchedulableBuilder","text":"<p><code>FairSchedulableBuilder</code> is a &lt;&gt; that is &lt;&gt; exclusively for scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] for FAIR scheduling mode (when configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property is <code>FAIR</code>). <p>[[creating-instance]] <code>FairSchedulableBuilder</code> takes the following to be created:</p> <ul> <li>[[rootPool]] &lt;&gt; <li>[[conf]] SparkConf.md[]</li> <p>Once &lt;&gt;, <code>TaskSchedulerImpl</code> requests the <code>FairSchedulableBuilder</code> to &lt;&gt;. <p>[[DEFAULT_SCHEDULER_FILE]] <code>FairSchedulableBuilder</code> uses the pools defined in an &lt;&gt; that is assumed to be the value of the configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property or the default fairscheduler.xml (that is &lt;&gt;). <p>TIP: Use conf/fairscheduler.xml.template as a template for the &lt;&gt;. <p>[[DEFAULT_POOL_NAME]] <code>FairSchedulableBuilder</code> always has the default pool defined (and &lt;&gt; unless done in the &lt;&gt;). <p>[[FAIR_SCHEDULER_PROPERTIES]] [[spark.scheduler.pool]] <code>FairSchedulableBuilder</code> uses spark.scheduler.pool local property for the name of the pool to use when requested to &lt;&gt; (default: &lt;&gt;). <p>Note</p> <p>SparkContext.setLocalProperty lets you set local properties per thread to group jobs in logical groups, e.g. to allow <code>FairSchedulableBuilder</code> to use <code>spark.scheduler.pool</code> property and to group jobs from different threads to be submitted for execution on a non-&lt;&gt; pool."},{"location":"scheduler/FairSchedulableBuilder/#source-scala","title":"[source, scala]","text":"<p>scala&gt; :type sc org.apache.spark.SparkContext</p> <p>sc.setLocalProperty(\"spark.scheduler.pool\", \"production\")</p>"},{"location":"scheduler/FairSchedulableBuilder/#whatever-is-executed-afterwards-is-submitted-to-production-pool","title":"// whatever is executed afterwards is submitted to production pool","text":"<p>[[logging]] [TIP] ==== Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.FairSchedulableBuilder</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.FairSchedulableBuilder=ALL\n</code></pre>"},{"location":"scheduler/FairSchedulableBuilder/#refer-to","title":"Refer to &lt;&gt;. <p>=== [[allocations-file]] Allocation Pools Configuration File</p> <p>The allocation pools configuration file is an XML file.</p> <p>The default <code>conf/fairscheduler.xml.template</code> is as follows:</p>","text":""},{"location":"scheduler/FairSchedulableBuilder/#source-xml","title":"[source, xml]","text":"<p> FAIR 1 2 FIFO 2 3 </p> <p>TIP: The top-level element's name <code>allocations</code> can be anything. Spark does not insist on <code>allocations</code> and accepts any name.</p> <p>=== [[buildPools]] Building (Tree of) Pools of Schedulables -- <code>buildPools</code> Method</p>"},{"location":"scheduler/FairSchedulableBuilder/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/FairSchedulableBuilder/#buildpools-unit","title":"buildPools(): Unit","text":"<p>NOTE: <code>buildPools</code> is part of the &lt;&gt; to build a tree of &lt;&gt;. <p><code>buildPools</code> &lt;&gt; if available and then &lt;&gt;. <p><code>buildPools</code> prints out the following INFO message to the logs when the configuration file (per the configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property) could be read:</p> <pre><code>Creating Fair Scheduler pools from [file]\n</code></pre> <p><code>buildPools</code> prints out the following INFO message to the logs when the configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property was not used to define the configuration file and the &lt;&gt; is used instead: <pre><code>Creating Fair Scheduler pools from default file: [DEFAULT_SCHEDULER_FILE]\n</code></pre> <p>When neither configuration-properties.md#spark.scheduler.allocation.file[spark.scheduler.allocation.file] configuration property nor the &lt;&gt; could be used, <code>buildPools</code> prints out the following WARN message to the logs: <pre><code>Fair Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use fair scheduling, configure pools in [DEFAULT_SCHEDULER_FILE] or set spark.scheduler.allocation.file to a file that contains the configuration.\n</code></pre> <p>=== [[addTaskSetManager]] <code>addTaskSetManager</code> Method</p>"},{"location":"scheduler/FairSchedulableBuilder/#source-scala_2","title":"[source, scala]","text":""},{"location":"scheduler/FairSchedulableBuilder/#addtasksetmanagermanager-schedulable-properties-properties-unit","title":"addTaskSetManager(manager: Schedulable, properties: Properties): Unit","text":"<p>NOTE: <code>addTaskSetManager</code> is part of the &lt;&gt; to register a new &lt;&gt; with the &lt;&gt; <p><code>addTaskSetManager</code> finds the pool by name (in the given <code>Properties</code>) under the &lt;&gt; property or defaults to the &lt;&gt; pool if undefined. <p><code>addTaskSetManager</code> then requests the &lt;&gt; to &lt;&gt;. <p>Unless found, <code>addTaskSetManager</code> creates a new &lt;&gt; with the &lt;&gt; (as if the &lt;&gt; pool were used) and requests the &lt;&gt; to &lt;&gt;. In the end, <code>addTaskSetManager</code> prints out the following WARN message to the logs: <pre><code>A job was submitted with scheduler pool [poolName], which has not been configured. This can happen when the file that pools are read from isn't set, or when that file doesn't contain [poolName]. Created [poolName] with default configuration (schedulingMode: [mode], minShare: [minShare], weight: [weight])\n</code></pre> <p><code>addTaskSetManager</code> then requests the pool (found or newly-created) to &lt;&gt; the given &lt;&gt;. <p>In the end, <code>addTaskSetManager</code> prints out the following INFO message to the logs:</p> <pre><code>Added task set [name] tasks to pool [poolName]\n</code></pre> <p>=== [[buildDefaultPool]] Registering Default Pool -- <code>buildDefaultPool</code> Method</p>"},{"location":"scheduler/FairSchedulableBuilder/#source-scala_3","title":"[source, scala]","text":""},{"location":"scheduler/FairSchedulableBuilder/#builddefaultpool-unit","title":"buildDefaultPool(): Unit","text":"<p><code>buildDefaultPool</code> requests the &lt;&gt; to &lt;&gt; (one with the &lt;&gt; name). <p>Unless already available, <code>buildDefaultPool</code> creates a &lt;&gt; with the following: <ul> <li> <p>&lt;&gt; pool name <li> <p><code>FIFO</code> scheduling mode</p> </li> <li> <p><code>0</code> for the initial minimum share</p> </li> <li> <p><code>1</code> for the initial weight</p> </li> <p>In the end, <code>buildDefaultPool</code> requests the &lt;&gt; to &lt;&gt; followed by the INFO message in the logs: <pre><code>Created default pool: [name], schedulingMode: [mode], minShare: [minShare], weight: [weight]\n</code></pre> <p>NOTE: <code>buildDefaultPool</code> is used exclusively when <code>FairSchedulableBuilder</code> is requested to &lt;&gt;. <p>=== [[buildFairSchedulerPool]] Building Pools from XML Allocations File -- <code>buildFairSchedulerPool</code> Internal Method</p>"},{"location":"scheduler/FairSchedulableBuilder/#source-scala_4","title":"[source, scala]","text":"<p>buildFairSchedulerPool(   is: InputStream,   fileName: String): Unit</p> <p><code>buildFairSchedulerPool</code> starts by loading the XML file from the given <code>InputStream</code>.</p> <p>For every pool element, <code>buildFairSchedulerPool</code> creates a &lt;&gt; with the following: <ul> <li> <p>Pool name per name attribute</p> </li> <li> <p>Scheduling mode per schedulingMode element (case-insensitive with <code>FIFO</code> as the default)</p> </li> <li> <p>Initial minimum share per minShare element (default: <code>0</code>)</p> </li> <li> <p>Initial weight per weight element (default: <code>1</code>)</p> </li> </ul> <p>In the end, <code>buildFairSchedulerPool</code> requests the &lt;&gt; to &lt;&gt; followed by the INFO message in the logs: <pre><code>Created pool: [name], schedulingMode: [mode], minShare: [minShare], weight: [weight]\n</code></pre> <p>NOTE: <code>buildFairSchedulerPool</code> is used exclusively when <code>FairSchedulableBuilder</code> is requested to &lt;&gt;."},{"location":"scheduler/HighlyCompressedMapStatus/","title":"HighlyCompressedMapStatus","text":"<p><code>HighlyCompressedMapStatus</code> is...FIXME</p>"},{"location":"scheduler/JobListener/","title":"JobListener","text":"<p><code>JobListener</code> is an abstraction of listeners that listen for job completion or failure events (after submitting a job to the DAGScheduler).</p>"},{"location":"scheduler/JobListener/#contract","title":"Contract","text":""},{"location":"scheduler/JobListener/#tasksucceeded","title":"taskSucceeded <pre><code>taskSucceeded(\n  index: Int,\n  result: Any): Unit\n</code></pre> <p>Used when <code>DAGScheduler</code> is requested to handleTaskCompletion or markMapStageJobAsFinished</p>","text":""},{"location":"scheduler/JobListener/#jobfailed","title":"jobFailed <pre><code>jobFailed(\n  exception: Exception): Unit\n</code></pre> <p>Used when <code>DAGScheduler</code> is requested to cleanUpAfterSchedulerStop, handleJobSubmitted, handleMapStageSubmitted, handleTaskCompletion or failJobAndIndependentStages</p>","text":""},{"location":"scheduler/JobListener/#implementations","title":"Implementations","text":"<ul> <li>ApproximateActionListener</li> <li>JobWaiter</li> </ul>"},{"location":"scheduler/JobWaiter/","title":"JobWaiter","text":"<p><code>JobWaiter</code> is a JobListener to listen to task events and to know when all have finished successfully or not.</p>"},{"location":"scheduler/JobWaiter/#creating-instance","title":"Creating Instance","text":"<p><code>JobWaiter</code> takes the following to be created:</p> <ul> <li> DAGScheduler <li> Job ID <li> Total number of tasks <li> Result Handler Function (<code>(Int, T) =&gt; Unit</code>) <p><code>JobWaiter</code> is created\u00a0when <code>DAGScheduler</code> is requested to submit a job or a map stage.</p>"},{"location":"scheduler/JobWaiter/#scala-promise","title":"Scala Promise <pre><code>jobPromise: Promise[Unit]\n</code></pre> <p><code>jobPromise</code> is a Scala Promise that is completed when all tasks have finished successfully or failed with an exception.</p>","text":""},{"location":"scheduler/JobWaiter/#tasksucceeded","title":"taskSucceeded <pre><code>taskSucceeded(\n  index: Int,\n  result: Any): Unit\n</code></pre> <p><code>taskSucceeded</code> executes the Result Handler Function with the given <code>index</code> and <code>result</code>.</p> <p><code>taskSucceeded</code> marks the waiter finished successfully when all tasks have finished.</p> <p><code>taskSucceeded</code>\u00a0is part of the JobListener abstraction.</p>","text":""},{"location":"scheduler/JobWaiter/#jobfailed","title":"jobFailed <pre><code>jobFailed(\n  exception: Exception): Unit\n</code></pre> <p><code>jobFailed</code> marks the waiter failed.</p> <p><code>jobFailed</code>\u00a0is part of the JobListener abstraction.</p>","text":""},{"location":"scheduler/LiveListenerBus/","title":"LiveListenerBus","text":"<p><code>LiveListenerBus</code> is an event bus to dispatch Spark events to registered SparkListeners.</p> <p></p> <p><code>LiveListenerBus</code> is a single-JVM SparkListenerBus that uses listenerThread to poll events.</p> <p>Note</p> <p>The event queue is java.util.concurrent.LinkedBlockingQueue with capacity of 10000 <code>SparkListenerEvent</code> events.</p>"},{"location":"scheduler/LiveListenerBus/#creating-instance","title":"Creating Instance","text":"<p><code>LiveListenerBus</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>LiveListenerBus</code> is created (and started) when <code>SparkContext</code> is requested to initialize.</p>"},{"location":"scheduler/LiveListenerBus/#event-queues","title":"Event Queues <pre><code>queues: CopyOnWriteArrayList[AsyncEventQueue]\n</code></pre> <p><code>LiveListenerBus</code> manages <code>AsyncEventQueue</code>s.</p> <p><code>queues</code> is initialized empty when <code>LiveListenerBus</code> is created.</p> <p><code>queues</code> is used when:</p> <ul> <li>Registering Listener with Queue</li> <li>Posting Event to All Queues</li> <li>Deregistering Listener</li> <li>Starting LiveListenerBus</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#livelistenerbusmetrics","title":"LiveListenerBusMetrics <pre><code>metrics: LiveListenerBusMetrics\n</code></pre> <p><code>LiveListenerBus</code> creates a <code>LiveListenerBusMetrics</code> when created.</p> <p><code>metrics</code> is registered (with a MetricsSystem) when <code>LiveListenerBus</code> is started.</p> <p><code>metrics</code> is used to:</p> <ul> <li>Increment events posted every event posting</li> <li>Create a <code>AsyncEventQueue</code> when adding a listener to a queue</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#starting-livelistenerbus","title":"Starting LiveListenerBus <pre><code>start(\n  sc: SparkContext,\n  metricsSystem: MetricsSystem): Unit\n</code></pre> <p><code>start</code> starts <code>AsyncEventQueue</code>s (from the queues internal registry).</p> <p>In the end, <code>start</code> requests the given MetricsSystem to register the LiveListenerBusMetrics.</p> <p><code>start</code> is used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#posting-event-to-all-queues","title":"Posting Event to All Queues <pre><code>post(\n  event: SparkListenerEvent): Unit\n</code></pre> <p><code>post</code> puts the input <code>event</code> onto the internal <code>eventQueue</code> queue and releases the internal <code>eventLock</code> semaphore. If the event placement was not successful (and it could happen since it is tapped at 10000 events) onDropEvent method is called.</p> <p>The event publishing is only possible when <code>stopped</code> flag has been enabled.</p> <p><code>post</code> is used when...FIXME</p>","text":""},{"location":"scheduler/LiveListenerBus/#posttoqueues","title":"postToQueues <pre><code>postToQueues(\n  event: SparkListenerEvent): Unit\n</code></pre> <p><code>postToQueues</code>...FIXME</p>","text":""},{"location":"scheduler/LiveListenerBus/#event-dropped-callback","title":"Event Dropped Callback <pre><code>onDropEvent(\n  event: SparkListenerEvent): Unit\n</code></pre> <p><code>onDropEvent</code> is called when no further events can be added to the internal <code>eventQueue</code> queue (while posting a SparkListenerEvent event).</p> <p>It simply prints out the following ERROR message to the logs and ensures that it happens only once.</p> <pre><code>Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.\n</code></pre>","text":""},{"location":"scheduler/LiveListenerBus/#stopping-livelistenerbus","title":"Stopping LiveListenerBus <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> releases the internal <code>eventLock</code> semaphore and waits until listenerThread dies. It can only happen after all events were posted (and polling <code>eventQueue</code> gives nothing).</p> <p><code>stopped</code> flag is enabled.</p>","text":""},{"location":"scheduler/LiveListenerBus/#listenerthread-for-event-polling","title":"listenerThread for Event Polling <p><code>LiveListenerBus</code> uses a SparkListenerBus single-daemon thread that ensures that the polling events from the event queue is only after the listener was started and only one event at a time.</p>","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-status-queue","title":"Registering Listener with Status Queue <pre><code>addToStatusQueue(\n  listener: SparkListenerInterface): Unit\n</code></pre> <p><code>addToStatusQueue</code> adds the given SparkListenerInterface to appStatus queue.</p> <p><code>addToStatusQueue</code> is used when:</p> <ul> <li><code>BarrierCoordinator</code> is requested to <code>onStart</code></li> <li><code>SparkContext</code> is created</li> <li><code>HiveThriftServer2</code> utility is used to <code>createListenerAndUI</code></li> <li><code>SharedState</code> (Spark SQL) is requested to create a SQLAppStatusStore</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-shared-queue","title":"Registering Listener with Shared Queue <pre><code>addToSharedQueue(\n  listener: SparkListenerInterface): Unit\n</code></pre> <p><code>addToSharedQueue</code> adds the given SparkListenerInterface to shared queue.</p> <p><code>addToSharedQueue</code> is used when:</p> <ul> <li><code>SparkContext</code> is requested to register a SparkListener and register extra SparkListeners</li> <li><code>ExecutionListenerBus</code> (Spark Structured Streaming) is created</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-executormanagement-queue","title":"Registering Listener with executorManagement Queue <pre><code>addToManagementQueue(\n  listener: SparkListenerInterface): Unit\n</code></pre> <p><code>addToManagementQueue</code> adds the given SparkListenerInterface to executorManagement queue.</p> <p><code>addToManagementQueue</code> is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is requested to start</li> <li><code>HeartbeatReceiver</code> is created</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-eventlog-queue","title":"Registering Listener with eventLog Queue <pre><code>addToEventLogQueue(\n  listener: SparkListenerInterface): Unit\n</code></pre> <p><code>addToEventLogQueue</code> adds the given SparkListenerInterface to eventLog queue.</p> <p><code>addToEventLogQueue</code> is used when:</p> <ul> <li><code>SparkContext</code> is created (with event logging enabled)</li> </ul>","text":""},{"location":"scheduler/LiveListenerBus/#registering-listener-with-queue","title":"Registering Listener with Queue <pre><code>addToQueue(\n  listener: SparkListenerInterface,\n  queue: String): Unit\n</code></pre> <p><code>addToQueue</code> finds the queue in the queues internal registry.</p> <p>If found, <code>addToQueue</code> requests it to add the given listener</p> <p>If not found, <code>addToQueue</code> creates a <code>AsyncEventQueue</code> (with the given name, the LiveListenerBusMetrics, and this <code>LiveListenerBus</code>) and requests it to add the given listener. The <code>AsyncEventQueue</code> is started and added to the queues internal registry.</p> <p><code>addToQueue</code> is used when:</p> <ul> <li><code>LiveListenerBus</code> is requested to addToSharedQueue, addToManagementQueue, addToStatusQueue, addToEventLogQueue</li> <li><code>StreamingQueryListenerBus</code> (Spark Structured Streaming) is created</li> </ul>","text":""},{"location":"scheduler/MapOutputStatistics/","title":"MapOutputStatistics","text":"<p><code>MapOutputStatistics</code> holds statistics about the output partition sizes in a map stage.</p> <p><code>MapOutputStatistics</code> is the result of executing the following (currently internal APIs):</p> <ul> <li><code>SparkContext</code> is requested to submitMapStage</li> <li><code>DAGScheduler</code> is requested to submitMapStage</li> </ul>"},{"location":"scheduler/MapOutputStatistics/#creating-instance","title":"Creating Instance","text":"<p><code>MapOutputStatistics</code> takes the following to be created:</p> <ul> <li> Shuffle Id (of a ShuffleDependency) <li> Output Partition Sizes (<code>Array[Long]</code>) <p><code>MapOutputStatistics</code> is created when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested for the statistics (of a ShuffleDependency)</li> </ul>"},{"location":"scheduler/MapOutputTracker/","title":"MapOutputTracker","text":"<p><code>MapOutputTracker</code> is an base abstraction of shuffle map output location registries.</p>"},{"location":"scheduler/MapOutputTracker/#contract","title":"Contract","text":""},{"location":"scheduler/MapOutputTracker/#getmapsizesbyexecutorid","title":"getMapSizesByExecutorId <pre><code>getMapSizesByExecutorId(\n  shuffleId: Int,\n  startPartition: Int,\n  endPartition: Int): Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]\n</code></pre> <p>Used when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleReader</li> </ul>","text":""},{"location":"scheduler/MapOutputTracker/#getmapsizesbyrange","title":"getMapSizesByRange <pre><code>getMapSizesByRange(\n  shuffleId: Int,\n  startMapIndex: Int,\n  endMapIndex: Int,\n  startPartition: Int,\n  endPartition: Int): Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]\n</code></pre> <p>Used when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleReader</li> </ul>","text":""},{"location":"scheduler/MapOutputTracker/#unregistershuffle","title":"unregisterShuffle <pre><code>unregisterShuffle(\n  shuffleId: Int): Unit\n</code></pre> <p>Deletes map output status information for the specified shuffle stage</p> <p>Used when:</p> <ul> <li><code>ContextCleaner</code> is requested to doCleanupShuffle</li> <li><code>BlockManagerSlaveEndpoint</code> is requested to handle a RemoveShuffle message</li> </ul>","text":""},{"location":"scheduler/MapOutputTracker/#implementations","title":"Implementations","text":"<ul> <li>MapOutputTrackerMaster</li> <li>MapOutputTrackerWorker</li> </ul>"},{"location":"scheduler/MapOutputTracker/#creating-instance","title":"Creating Instance","text":"<p><code>MapOutputTracker</code> takes the following to be created:</p> <ul> <li> SparkConf Abstract Class <p><code>MapOutputTracker</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete MapOutputTrackers.</p>"},{"location":"scheduler/MapOutputTracker/#accessing-mapoutputtracker","title":"Accessing MapOutputTracker","text":"<p><code>MapOutputTracker</code> is available using SparkEnv (on the driver and executors).</p> <pre><code>SparkEnv.get.mapOutputTracker\n</code></pre>"},{"location":"scheduler/MapOutputTracker/#mapoutputtracker-rpc-endpoint","title":"MapOutputTracker RPC Endpoint <p><code>trackerEndpoint</code> is a RpcEndpointRef of the MapOutputTracker RPC endpoint.</p> <p><code>trackerEndpoint</code> is initialized (registered or looked up) when <code>SparkEnv</code> is created for the driver and executors.</p> <p><code>trackerEndpoint</code> is used to communicate (synchronously).</p> <p><code>trackerEndpoint</code> is cleared (<code>null</code>) when <code>MapOutputTrackerMaster</code> is requested to stop.</p>","text":""},{"location":"scheduler/MapOutputTracker/#deregistering-map-output-status-information-of-shuffle-stage","title":"Deregistering Map Output Status Information of Shuffle Stage <pre><code>unregisterShuffle(\n  shuffleId: Int): Unit\n</code></pre> <p>Deregisters map output status information for the given shuffle stage</p> <p>Used when:</p> <ul> <li> <p><code>ContextCleaner</code> is requested for shuffle cleanup</p> </li> <li> <p><code>BlockManagerSlaveEndpoint</code> is requested to remove a shuffle</p> </li> </ul>","text":""},{"location":"scheduler/MapOutputTracker/#stopping-mapoutputtracker","title":"Stopping MapOutputTracker <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> does nothing at all.</p> <p><code>stop</code> is used when <code>SparkEnv</code> is requested to stop (and stops all the services, incl. <code>MapOutputTracker</code>).</p>","text":""},{"location":"scheduler/MapOutputTracker/#converting-mapstatuses-to-blockmanagerids-with-shuffleblockids-and-their-sizes","title":"Converting MapStatuses To BlockManagerIds with ShuffleBlockIds and Their Sizes <pre><code>convertMapStatuses(\n  shuffleId: Int,\n  startPartition: Int,\n  endPartition: Int,\n  statuses: Array[MapStatus]): Seq[(BlockManagerId, Seq[(BlockId, Long)])]\n</code></pre> <p><code>convertMapStatuses</code> iterates over the input <code>statuses</code> array (of MapStatus entries indexed by map id) and creates a collection of BlockManagerIds (for each <code>MapStatus</code> entry) with a ShuffleBlockId (with the input <code>shuffleId</code>, a <code>mapId</code>, and <code>partition</code> ranging from the input <code>startPartition</code> and <code>endPartition</code>) and estimated size for the reduce block for every status and partitions.</p> <p>For any empty <code>MapStatus</code>, <code>convertMapStatuses</code> prints out the following ERROR message to the logs:</p> <pre><code>Missing an output location for shuffle [id]\n</code></pre> <p>And <code>convertMapStatuses</code> throws a <code>MetadataFetchFailedException</code> (with <code>shuffleId</code>, <code>startPartition</code>, and the above error message).</p> <p><code>convertMapStatuses</code> is used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested for the sizes of shuffle map outputs by executor and range</li> <li><code>MapOutputTrackerWorker</code> is requested to sizes of shuffle map outputs by executor and range</li> </ul>","text":""},{"location":"scheduler/MapOutputTracker/#sending-blocking-messages-to-trackerendpoint-rpcendpointref","title":"Sending Blocking Messages To trackerEndpoint RpcEndpointRef <pre><code>askTracker[T](message: Any): T\n</code></pre> <p><code>askTracker</code> sends the input <code>message</code> to trackerEndpoint RpcEndpointRef and waits for a result.</p> <p>When an exception happens, <code>askTracker</code> prints out the following ERROR message to the logs and throws a <code>SparkException</code>.</p> <pre><code>Error communicating with MapOutputTracker\n</code></pre> <p><code>askTracker</code> is used when <code>MapOutputTracker</code> is requested to fetches map outputs for <code>ShuffleDependency</code> remotely and sends a one-way message.</p>","text":""},{"location":"scheduler/MapOutputTracker/#epoch","title":"Epoch <p>Starts from <code>0</code> when <code>MapOutputTracker</code> is created.</p> <p>Can be updated (on <code>MapOutputTrackerWorkers</code>) or incremented (on the driver's <code>MapOutputTrackerMaster</code>).</p>","text":""},{"location":"scheduler/MapOutputTracker/#sendtracker","title":"sendTracker <pre><code>sendTracker(\n  message: Any): Unit\n</code></pre> <p><code>sendTracker</code>...FIXME</p> <p><code>sendTracker</code> is used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested to stop</li> </ul>","text":""},{"location":"scheduler/MapOutputTracker/#utilities","title":"Utilities","text":""},{"location":"scheduler/MapOutputTracker/#serializemapstatuses","title":"serializeMapStatuses <pre><code>serializeMapStatuses(\n  statuses: Array[MapStatus],\n  broadcastManager: BroadcastManager,\n  isLocal: Boolean,\n  minBroadcastSize: Int,\n  conf: SparkConf): (Array[Byte], Broadcast[Array[Byte]])\n</code></pre> <p><code>serializeMapStatuses</code> serializes the given array of map output locations into an efficient byte format (to send to reduce tasks). <code>serializeMapStatuses</code> compresses the serialized bytes using GZIP. They are supposed to be pretty compressible because many map outputs will be on the same hostname.</p> <p>Internally, <code>serializeMapStatuses</code> creates a Java ByteArrayOutputStream.</p> <p><code>serializeMapStatuses</code> writes out 0 (direct) first.</p> <p><code>serializeMapStatuses</code> creates a Java GZIPOutputStream (with the <code>ByteArrayOutputStream</code> created) and writes out the given statuses array.</p> <p><code>serializeMapStatuses</code> decides whether to return the output array (of the output stream) or use a broadcast variable based on the size of the byte array.</p> <p>If the size of the result byte array is the given <code>minBroadcastSize</code> threshold or bigger, <code>serializeMapStatuses</code> requests the input <code>BroadcastManager</code> to create a broadcast variable.</p> <p><code>serializeMapStatuses</code> resets the <code>ByteArrayOutputStream</code> and starts over.</p> <p><code>serializeMapStatuses</code> writes out 1 (broadcast) first.</p> <p><code>serializeMapStatuses</code> creates a new Java <code>GZIPOutputStream</code> (with the <code>ByteArrayOutputStream</code> created) and writes out the broadcast variable.</p> <p><code>serializeMapStatuses</code> prints out the following INFO message to the logs:</p> <pre><code>Broadcast mapstatuses size = [length], actual size = [length]\n</code></pre> <p><code>serializeMapStatuses</code> is used when <code>ShuffleStatus</code> is requested to serialize shuffle map output statuses.</p>","text":""},{"location":"scheduler/MapOutputTracker/#deserializemapstatuses","title":"deserializeMapStatuses <pre><code>deserializeMapStatuses(\n  bytes: Array[Byte],\n  conf: SparkConf): Array[MapStatus]\n</code></pre> <p><code>deserializeMapStatuses</code>...FIXME</p> <p><code>deserializeMapStatuses</code> is used when:</p> <ul> <li><code>MapOutputTrackerWorker</code> is requested to getStatuses</li> </ul>","text":""},{"location":"scheduler/MapOutputTrackerMaster/","title":"MapOutputTrackerMaster","text":"<p><code>MapOutputTrackerMaster</code> is a MapOutputTracker for the driver.</p> <p><code>MapOutputTrackerMaster</code> is the source of truth of shuffle map output locations.</p>"},{"location":"scheduler/MapOutputTrackerMaster/#creating-instance","title":"Creating Instance","text":"<p><code>MapOutputTrackerMaster</code> takes the following to be created:</p> <ul> <li> SparkConf <li>BroadcastManager</li> <li> <code>isLocal</code> flag (to indicate whether <code>MapOutputTrackerMaster</code> runs in local or a cluster) <p>When created, <code>MapOutputTrackerMaster</code> starts dispatcher threads on the map-output-dispatcher thread pool.</p> <p><code>MapOutputTrackerMaster</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv for the driver</li> </ul>"},{"location":"scheduler/MapOutputTrackerMaster/#maxrpcmessagesize","title":"maxRpcMessageSize <p><code>maxRpcMessageSize</code> is...FIXME</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#broadcastmanager","title":"BroadcastManager <p><code>MapOutputTrackerMaster</code> is given a BroadcastManager to be created.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#shuffle-map-output-status-registry","title":"Shuffle Map Output Status Registry <p><code>MapOutputTrackerMaster</code> uses an internal registry of ShuffleStatuses by shuffle stages.</p> <p><code>MapOutputTrackerMaster</code> adds a new shuffle when requested to register one (when <code>DAGScheduler</code> is requested to create a ShuffleMapStage for a ShuffleDependency).</p> <p><code>MapOutputTrackerMaster</code> uses the registry when requested for the following:</p> <ul> <li> <p>registerMapOutput</p> </li> <li> <p>getStatistics</p> </li> <li> <p>MessageLoop</p> </li> <li> <p>unregisterMapOutput, unregisterAllMapOutput, unregisterShuffle, removeOutputsOnHost, removeOutputsOnExecutor, containsShuffle, getNumAvailableOutputs, findMissingPartitions, getLocationsWithLargestOutputs, getMapSizesByExecutorId</p> </li> </ul> <p><code>MapOutputTrackerMaster</code> removes (clears) all shuffles when requested to stop.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#configuration-properties","title":"Configuration Properties <p><code>MapOutputTrackerMaster</code> uses the following configuration properties:</p> <ul> <li> <p> spark.shuffle.mapOutput.minSizeForBroadcast  <li> <p> spark.shuffle.mapOutput.dispatcher.numThreads  <li> <p> spark.shuffle.reduceLocality.enabled","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-and-reduce-task-thresholds-for-preferred-locations","title":"Map and Reduce Task Thresholds for Preferred Locations <p><code>MapOutputTrackerMaster</code> defines 1000 (tasks) as the hardcoded threshold of the number of map and reduce tasks when requested to compute preferred locations with spark.shuffle.reduceLocality.enabled.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-output-threshold-for-preferred-location-of-reduce-tasks","title":"Map Output Threshold for Preferred Location of Reduce Tasks <p><code>MapOutputTrackerMaster</code> defines <code>0.2</code> as the fraction of total map output that must be at a location for it to considered as a preferred location for a reduce task.</p> <p>Making this larger will focus on fewer locations where most data can be read locally, but may lead to more delay in scheduling if those locations are busy.</p> <p><code>MapOutputTrackerMaster</code> uses the fraction when requested for the preferred locations of shuffle RDDs.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#getmapoutputmessage-queue","title":"GetMapOutputMessage Queue <p><code>MapOutputTrackerMaster</code> uses a blocking queue (a Java LinkedBlockingQueue) for requests for map output statuses.</p> <pre><code>GetMapOutputMessage(\n  shuffleId: Int,\n  context: RpcCallContext)\n</code></pre> <p><code>GetMapOutputMessage</code> holds the shuffle ID and the <code>RpcCallContext</code> of the caller.</p> <p>A new <code>GetMapOutputMessage</code> is added to the queue when <code>MapOutputTrackerMaster</code> is requested to post one.</p> <p><code>MapOutputTrackerMaster</code> uses MessageLoop Dispatcher Threads to process <code>GetMapOutputMessages</code>.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#messageloop-dispatcher-thread","title":"MessageLoop Dispatcher Thread <p><code>MessageLoop</code> is a thread of execution to handle GetMapOutputMessages until a <code>PoisonPill</code> marker message arrives (when <code>MapOutputTrackerMaster</code> is requested to stop).</p> <p><code>MessageLoop</code> takes a <code>GetMapOutputMessage</code> and prints out the following DEBUG message to the logs:</p> <pre><code>Handling request to send map output locations for shuffle [shuffleId] to [hostPort]\n</code></pre> <p><code>MessageLoop</code> then finds the ShuffleStatus by the shuffle ID in the shuffleStatuses internal registry and replies back (to the RPC client) with a serialized map output status (with the BroadcastManager and spark.shuffle.mapOutput.minSizeForBroadcast configuration property).</p> <p><code>MessageLoop</code> threads run on the map-output-dispatcher Thread Pool.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-output-dispatcher-thread-pool","title":"map-output-dispatcher Thread Pool <pre><code>threadpool: ThreadPoolExecutor\n</code></pre> <p><code>threadpool</code> is a daemon fixed thread pool registered with map-output-dispatcher thread name prefix.</p> <p><code>threadpool</code> uses spark.shuffle.mapOutput.dispatcher.numThreads configuration property for the number of MessageLoop dispatcher threads to process received <code>GetMapOutputMessage</code> messages.</p> <p>The dispatcher threads are started immediately when <code>MapOutputTrackerMaster</code> is created.</p> <p>The thread pool is shut down when <code>MapOutputTrackerMaster</code> is requested to stop.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#epoch-number","title":"Epoch Number <p><code>MapOutputTrackerMaster</code> uses an epoch number to...FIXME</p> <p><code>getEpoch</code> is used when:</p> <ul> <li> <p><code>DAGScheduler</code> is requested to removeExecutorAndUnregisterOutputs</p> </li> <li> <p>TaskSetManager is created (and sets the epoch to tasks)</p> </li> </ul>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#enqueueing-getmapoutputmessage","title":"Enqueueing GetMapOutputMessage <pre><code>post(\n  message: GetMapOutputMessage): Unit\n</code></pre> <p><code>post</code> simply adds the input <code>GetMapOutputMessage</code> to the mapOutputRequests internal queue.</p> <p><code>post</code> is used when <code>MapOutputTrackerMasterEndpoint</code> is requested to handle a GetMapOutputStatuses message.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#stopping-mapoutputtrackermaster","title":"Stopping MapOutputTrackerMaster <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>...FIXME</p> <p><code>stop</code> is part of the MapOutputTracker abstraction.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#unregistering-shuffle-map-output","title":"Unregistering Shuffle Map Output <pre><code>unregisterMapOutput(\n  shuffleId: Int,\n  mapId: Int,\n  bmAddress: BlockManagerId): Unit\n</code></pre> <p><code>unregisterMapOutput</code>...FIXME</p> <p><code>unregisterMapOutput</code> is used when <code>DAGScheduler</code> is requested to handle a task completion (due to a fetch failure).</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#computing-preferred-locations","title":"Computing Preferred Locations <pre><code>getPreferredLocationsForShuffle(\n  dep: ShuffleDependency[_, _, _],\n  partitionId: Int): Seq[String]\n</code></pre> <p><code>getPreferredLocationsForShuffle</code> computes the locations (BlockManagers) with the most shuffle map outputs for the input ShuffleDependency and Partition.</p> <p><code>getPreferredLocationsForShuffle</code> computes the locations when all of the following are met:</p> <ul> <li> <p>spark.shuffle.reduceLocality.enabled configuration property is enabled</p> </li> <li> <p>The number of \"map\" partitions (of the RDD of the input ShuffleDependency) is below SHUFFLE_PREF_MAP_THRESHOLD</p> </li> <li> <p>The number of \"reduce\" partitions (of the Partitioner of the input ShuffleDependency) is below SHUFFLE_PREF_REDUCE_THRESHOLD</p> </li> </ul>  <p>Note</p> <p><code>getPreferredLocationsForShuffle</code> is simply getLocationsWithLargestOutputs with a guard condition.</p>  <p>Internally, <code>getPreferredLocationsForShuffle</code> checks whether spark.shuffle.reduceLocality.enabled configuration property is enabled with the number of partitions of the RDD of the input <code>ShuffleDependency</code> and partitions in the partitioner of the input <code>ShuffleDependency</code> both being less than <code>1000</code>.</p>  <p>Note</p> <p>The thresholds for the number of partitions in the RDD and of the partitioner when computing the preferred locations are <code>1000</code> and are not configurable.</p>  <p>If the condition holds, <code>getPreferredLocationsForShuffle</code> finds locations with the largest number of shuffle map outputs for the input <code>ShuffleDependency</code> and <code>partitionId</code> (with the number of partitions in the partitioner of the input <code>ShuffleDependency</code> and <code>0.2</code>) and returns the hosts of the preferred <code>BlockManagers</code>.</p>  <p>Note</p> <p><code>0.2</code> is the fraction of total map output that must be at a location to be considered as a preferred location for a reduce task. It is not configurable.</p>  <p><code>getPreferredLocationsForShuffle</code> is used when ShuffledRDD and Spark SQL's <code>ShuffledRowRDD</code> are requested for preferred locations of a partition.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-locations-with-largest-number-of-shuffle-map-outputs","title":"Finding Locations with Largest Number of Shuffle Map Outputs <pre><code>getLocationsWithLargestOutputs(\n  shuffleId: Int,\n  reducerId: Int,\n  numReducers: Int,\n  fractionThreshold: Double): Option[Array[BlockManagerId]]\n</code></pre> <p><code>getLocationsWithLargestOutputs</code> returns BlockManagerIds with the largest size (of all the shuffle blocks they manage) above the input <code>fractionThreshold</code> (given the total size of all the shuffle blocks for the shuffle across all BlockManagers).</p>  <p>Note</p> <p><code>getLocationsWithLargestOutputs</code> may return no <code>BlockManagerId</code> if their shuffle blocks do not total up above the input <code>fractionThreshold</code>.</p>   <p>Note</p> <p>The input <code>numReducers</code> is not used.</p>  <p>Internally, <code>getLocationsWithLargestOutputs</code> queries the mapStatuses internal cache for the input <code>shuffleId</code>.</p>  <p>Note</p> <p>One entry in <code>mapStatuses</code> internal cache is a MapStatus array indexed by partition id.</p> <p><code>MapStatus</code> includes information about the <code>BlockManager</code> (as <code>BlockManagerId</code>) and estimated size of the reduce blocks.</p>  <p><code>getLocationsWithLargestOutputs</code> iterates over the <code>MapStatus</code> array and builds an interim mapping between BlockManagerId and the cumulative sum of shuffle blocks across BlockManagers.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#incrementing-epoch","title":"Incrementing Epoch <pre><code>incrementEpoch(): Unit\n</code></pre> <p><code>incrementEpoch</code> increments the internal epoch.</p> <p><code>incrementEpoch</code> prints out the following DEBUG message to the logs:</p> <pre><code>Increasing epoch to [epoch]\n</code></pre> <p><code>incrementEpoch</code> is used when:</p> <ul> <li> <p><code>MapOutputTrackerMaster</code> is requested to unregisterMapOutput, unregisterAllMapOutput, removeOutputsOnHost and removeOutputsOnExecutor</p> </li> <li> <p><code>DAGScheduler</code> is requested to handle a ShuffleMapTask completion (of a <code>ShuffleMapStage</code>)</p> </li> </ul>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#checking-availability-of-shuffle-map-output-status","title":"Checking Availability of Shuffle Map Output Status <pre><code>containsShuffle(\n  shuffleId: Int): Boolean\n</code></pre> <p><code>containsShuffle</code> checks if the input <code>shuffleId</code> is registered in the cachedSerializedStatuses or mapStatuses internal caches.</p> <p><code>containsShuffle</code> is used when <code>DAGScheduler</code> is requested to create a createShuffleMapStage (for a ShuffleDependency).</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#registering-shuffle","title":"Registering Shuffle <pre><code>registerShuffle(\n  shuffleId: Int,\n  numMaps: Int): Unit\n</code></pre> <p><code>registerShuffle</code> registers a new ShuffleStatus (for the given shuffle ID and the number of partitions) to the shuffleStatuses internal registry.</p> <p><code>registerShuffle</code> throws an <code>IllegalArgumentException</code> when the shuffle ID has already been registered:</p> <pre><code>Shuffle ID [shuffleId] registered twice\n</code></pre> <p><code>registerShuffle</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to create a ShuffleMapStage (for a ShuffleDependency)</li> </ul>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#registering-map-outputs-for-shuffle-possibly-with-epoch-change","title":"Registering Map Outputs for Shuffle (Possibly with Epoch Change) <pre><code>registerMapOutputs(\n  shuffleId: Int,\n  statuses: Array[MapStatus],\n  changeEpoch: Boolean = false): Unit\n</code></pre> <p><code>registerMapOutputs</code> registers the input <code>statuses</code> (as the shuffle map output) with the input <code>shuffleId</code> in the mapStatuses internal cache.</p> <p><code>registerMapOutputs</code> increments epoch if the input <code>changeEpoch</code> is enabled (it is not by default).</p> <p><code>registerMapOutputs</code> is used when <code>DAGScheduler</code> handles successful <code>ShuffleMapTask</code> completion and executor lost events.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-serialized-map-output-statuses-and-possibly-broadcasting-them","title":"Finding Serialized Map Output Statuses (And Possibly Broadcasting Them) <pre><code>getSerializedMapOutputStatuses(\n  shuffleId: Int): Array[Byte]\n</code></pre> <p><code>getSerializedMapOutputStatuses</code> finds cached serialized map statuses for the input <code>shuffleId</code>.</p> <p>If found, <code>getSerializedMapOutputStatuses</code> returns the cached serialized map statuses.</p> <p>Otherwise, <code>getSerializedMapOutputStatuses</code> acquires the shuffle lock for <code>shuffleId</code> and finds cached serialized map statuses again since some other thread could not update the cachedSerializedStatuses internal cache.</p> <p><code>getSerializedMapOutputStatuses</code> returns the serialized map statuses if found.</p> <p>If not, <code>getSerializedMapOutputStatuses</code> serializes the local array of <code>MapStatuses</code> (from checkCachedStatuses).</p> <p><code>getSerializedMapOutputStatuses</code> prints out the following INFO message to the logs:</p> <pre><code>Size of output statuses for shuffle [shuffleId] is [bytes] bytes\n</code></pre> <p><code>getSerializedMapOutputStatuses</code> saves the serialized map output statuses in cachedSerializedStatuses internal cache if the epoch has not changed in the meantime. <code>getSerializedMapOutputStatuses</code> also saves its broadcast version in cachedSerializedBroadcast internal cache.</p> <p>If the epoch has changed in the meantime, the serialized map output statuses and their broadcast version are not saved, and <code>getSerializedMapOutputStatuses</code> prints out the following INFO message to the logs:</p> <pre><code>Epoch changed, not caching!\n</code></pre> <p><code>getSerializedMapOutputStatuses</code> removes the broadcast.</p> <p><code>getSerializedMapOutputStatuses</code> returns the serialized map statuses.</p> <p><code>getSerializedMapOutputStatuses</code> is used when MapOutputTrackerMaster responds to <code>GetMapOutputMessage</code> requests and <code>DAGScheduler</code> creates <code>ShuffleMapStage</code> for <code>ShuffleDependency</code> (copying the shuffle map output locations from previous jobs to avoid unnecessarily regenerating data).</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-cached-serialized-map-statuses","title":"Finding Cached Serialized Map Statuses <pre><code>checkCachedStatuses(): Boolean\n</code></pre> <p><code>checkCachedStatuses</code> is an internal helper method that &lt;&gt; uses to do some bookkeeping (when the &lt;&gt; and &lt;&gt; differ) and set local <code>statuses</code>, <code>retBytes</code> and <code>epochGotten</code> (that getSerializedMapOutputStatuses uses). <p>Internally, <code>checkCachedStatuses</code> acquires the MapOutputTracker.md#epochLock[<code>epochLock</code> lock] and checks the status of &lt;&gt; to &lt;cacheEpoch&gt;&gt;. <p>If <code>epoch</code> is younger (i.e. greater), <code>checkCachedStatuses</code> clears &lt;&gt; internal cache, &lt;&gt; and sets <code>cacheEpoch</code> to be <code>epoch</code>. <p><code>checkCachedStatuses</code> gets the serialized map output statuses for the <code>shuffleId</code> (of the owning &lt;&gt;). <p>When the serialized map output status is found, <code>checkCachedStatuses</code> saves it in a local <code>retBytes</code> and returns <code>true</code>.</p> <p>When not found, you should see the following DEBUG message in the logs:</p> <pre><code>cached status not found for : [shuffleId]\n</code></pre> <p><code>checkCachedStatuses</code> uses MapOutputTracker.md#mapStatuses[mapStatuses] internal cache to get map output statuses for the <code>shuffleId</code> (of the owning &lt;&gt;) or falls back to an empty array and sets it to a local <code>statuses</code>. <code>checkCachedStatuses</code> sets the local <code>epochGotten</code> to the current &lt;&gt; and returns <code>false</code>.","text":""},{"location":"scheduler/MapOutputTrackerMaster/#registering-shuffle-map-output","title":"Registering Shuffle Map Output <pre><code>registerMapOutput(\n  shuffleId: Int,\n  mapId: Int,\n  status: MapStatus): Unit\n</code></pre> <p><code>registerMapOutput</code> finds the ShuffleStatus by the given shuffle ID and adds the given MapStatus:</p> <ul> <li> <p>The given mapId is the partitionId of the ShuffleMapTask that finished.</p> </li> <li> <p>The given shuffleId is the shuffleId of the ShuffleDependency of the ShuffleMapStage (for which the <code>ShuffleMapTask</code> completed)</p> </li> </ul> <p><code>registerMapOutput</code> is used when <code>DAGScheduler</code> is requested to handle a ShuffleMapTask completion.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#map-output-statistics-for-shuffledependency","title":"Map Output Statistics for ShuffleDependency <pre><code>getStatistics(\n  dep: ShuffleDependency[_, _, _]): MapOutputStatistics\n</code></pre> <p><code>getStatistics</code> requests the input ShuffleDependency for the shuffle ID and looks up the corresponding ShuffleStatus (in the shuffleStatuses registry).</p> <p><code>getStatistics</code> assumes that the <code>ShuffleStatus</code> is in shuffleStatuses registry.</p> <p><code>getStatistics</code> requests the <code>ShuffleStatus</code> for the MapStatuses (of the <code>ShuffleDependency</code>).</p> <p><code>getStatistics</code> uses the spark.shuffle.mapOutput.parallelAggregationThreshold configuration property to decide on parallelism to calculate the statistics.</p> <p>With no parallelism, <code>getStatistics</code> simply traverses over the <code>MapStatus</code>es and requests them (one by one) for the size of every shuffle block.</p>  <p>Note</p> <p><code>getStatistics</code> requests the given <code>ShuffleDependency</code> for the Partitioner that in turn is requested for the number of partitions.</p> <p>The number of blocks is the number of <code>MapStatus</code>es multiplied by the number of partitions.</p> <p>And hence the need for parallelism based on the spark.shuffle.mapOutput.parallelAggregationThreshold configuration property.</p>  <p>In the end, <code>getStatistics</code> creates a <code>MapOutputStatistics</code> with the shuffle ID (of the given <code>ShuffleDependency</code>) and the total sizes (sumed up for every partition).</p> <p><code>getStatistics</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handle a successful ShuffleMapStage submission and markMapStageJobsAsFinished</li> </ul>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-all-map-outputs-of-shuffle-stage","title":"Deregistering All Map Outputs of Shuffle Stage <pre><code>unregisterAllMapOutput(\n  shuffleId: Int): Unit\n</code></pre> <p><code>unregisterAllMapOutput</code>...FIXME</p> <p><code>unregisterAllMapOutput</code> is used when <code>DAGScheduler</code> is requested to handle a task completion (due to a fetch failure).</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-shuffle","title":"Deregistering Shuffle <pre><code>unregisterShuffle(\n  shuffleId: Int): Unit\n</code></pre> <p><code>unregisterShuffle</code>...FIXME</p> <p><code>unregisterShuffle</code> is part of the MapOutputTracker abstraction.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-shuffle-outputs-associated-with-host","title":"Deregistering Shuffle Outputs Associated with Host <pre><code>removeOutputsOnHost(\n  host: String): Unit\n</code></pre> <p><code>removeOutputsOnHost</code>...FIXME</p> <p><code>removeOutputsOnHost</code> is used when <code>DAGScheduler</code> is requested to removeExecutorAndUnregisterOutputs and handle a worker removal.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#deregistering-shuffle-outputs-associated-with-executor","title":"Deregistering Shuffle Outputs Associated with Executor <pre><code>removeOutputsOnExecutor(\n  execId: String): Unit\n</code></pre> <p><code>removeOutputsOnExecutor</code>...FIXME</p> <p><code>removeOutputsOnExecutor</code> is used when <code>DAGScheduler</code> is requested to removeExecutorAndUnregisterOutputs.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#number-of-partitions-with-shuffle-map-outputs-available","title":"Number of Partitions with Shuffle Map Outputs Available <pre><code>getNumAvailableOutputs(\n  shuffleId: Int): Int\n</code></pre> <p><code>getNumAvailableOutputs</code>...FIXME</p> <p><code>getNumAvailableOutputs</code> is used when <code>ShuffleMapStage</code> is requested for the number of partitions with shuffle outputs available.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-missing-partitions","title":"Finding Missing Partitions <pre><code>findMissingPartitions(\n  shuffleId: Int): Option[Seq[Int]]\n</code></pre> <p><code>findMissingPartitions</code>...FIXME</p> <p><code>findMissingPartitions</code> is used when <code>ShuffleMapStage</code> is requested for missing partitions.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#finding-locations-with-blocks-and-sizes","title":"Finding Locations with Blocks and Sizes <pre><code>getMapSizesByExecutorId(\n  shuffleId: Int,\n  startPartition: Int,\n  endPartition: Int): Iterator[(BlockManagerId, Seq[(BlockId, Long)])]\n</code></pre> <p><code>getMapSizesByExecutorId</code> is part of the MapOutputTracker abstraction.</p> <p><code>getMapSizesByExecutorId</code> returns a collection of BlockManagerIds with their blocks and sizes.</p> <p>When executed, <code>getMapSizesByExecutorId</code> prints out the following DEBUG message to the logs:</p> <pre><code>Fetching outputs for shuffle [id], partitions [startPartition]-[endPartition]\n</code></pre> <p><code>getMapSizesByExecutorId</code> finds map outputs for the input <code>shuffleId</code>.</p>  <p>Note</p> <p><code>getMapSizesByExecutorId</code> gets the map outputs for all the partitions (despite the method's signature).</p>  <p>In the end, <code>getMapSizesByExecutorId</code> converts shuffle map outputs (as <code>MapStatuses</code>) into the collection of BlockManagerIds with their blocks and sizes.</p>","text":""},{"location":"scheduler/MapOutputTrackerMaster/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.MapOutputTrackerMaster</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.MapOutputTrackerMaster=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/","title":"MapOutputTrackerMasterEndpoint","text":"<p><code>MapOutputTrackerMasterEndpoint</code> is an RpcEndpoint for MapOutputTrackerMaster.</p> <p><code>MapOutputTrackerMasterEndpoint</code> is registered under the name of MapOutputTracker (on the driver).</p>"},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#creating-instance","title":"Creating Instance","text":"<p><code>MapOutputTrackerMasterEndpoint</code> takes the following to be created:</p> <ul> <li> RpcEnv <li> MapOutputTrackerMaster <li> SparkConf <p><code>MapOutputTrackerMasterEndpoint</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> is created (for the driver and executors)</li> </ul> <p>While being created, <code>MapOutputTrackerMasterEndpoint</code> prints out the following DEBUG message to the logs:</p> <pre><code>init\n</code></pre>"},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#messages","title":"Messages","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#getmapoutputstatuses","title":"GetMapOutputStatuses <pre><code>GetMapOutputStatuses(\n  shuffleId: Int)\n</code></pre> <p>Posted when <code>MapOutputTrackerWorker</code> is requested for shuffle map outputs for a given shuffle ID</p> <p>When received, <code>MapOutputTrackerMasterEndpoint</code> prints out the following INFO message to the logs:</p> <pre><code>Asked to send map output locations for shuffle [shuffleId] to [hostPort]\n</code></pre> <p>In the end, <code>MapOutputTrackerMasterEndpoint</code> requests the MapOutputTrackerMaster to post a <code>GetMapOutputMessage</code> (with the input <code>shuffleId</code>). Whatever is returned from <code>MapOutputTrackerMaster</code> becomes the response.</p>","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#stopmapoutputtracker","title":"StopMapOutputTracker <p>Posted when <code>MapOutputTrackerMaster</code> is requested to stop.</p> <p>When received, <code>MapOutputTrackerMasterEndpoint</code> prints out the following INFO message to the logs:</p> <pre><code>MapOutputTrackerMasterEndpoint stopped!\n</code></pre> <p><code>MapOutputTrackerMasterEndpoint</code> confirms the request (by replying <code>true</code>) and stops.</p>","text":""},{"location":"scheduler/MapOutputTrackerMasterEndpoint/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.MapOutputTrackerMasterEndpoint</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.MapOutputTrackerMasterEndpoint=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/MapOutputTrackerWorker/","title":"MapOutputTrackerWorker","text":"<p><code>MapOutputTrackerWorker</code> is the MapOutputTracker for executors.</p> <p><code>MapOutputTrackerWorker</code> uses Java's thread-safe java.util.concurrent.ConcurrentHashMap for mapStatuses internal cache and any lookup cache miss triggers a fetch from the driver's MapOutputTrackerMaster.</p> <p>== [[getStatuses]] Finding Shuffle Map Outputs</p>"},{"location":"scheduler/MapOutputTrackerWorker/#source-scala","title":"[source, scala]","text":"<p>getStatuses(   shuffleId: Int): Array[MapStatus]</p> <p><code>getStatuses</code> finds MapStatus.md[MapStatuses] for the input <code>shuffleId</code> in the &lt;&gt; internal cache and, when not available, fetches them from a remote MapOutputTrackerMaster.md[MapOutputTrackerMaster] (using RPC). <p>Internally, <code>getStatuses</code> first queries the &lt;mapStatuses internal cache&gt;&gt; and returns the map outputs if found. <p>If not found (in the <code>mapStatuses</code> internal cache), you should see the following INFO message in the logs:</p> <pre><code>Don't have map outputs for shuffle [id], fetching them\n</code></pre> <p>If some other process fetches the map outputs for the <code>shuffleId</code> (as recorded in <code>fetching</code> internal registry), <code>getStatuses</code> waits until it is done.</p> <p>When no other process fetches the map outputs, <code>getStatuses</code> registers the input <code>shuffleId</code> in <code>fetching</code> internal registry (of shuffle map outputs being fetched).</p> <p>You should see the following INFO message in the logs:</p> <pre><code>Doing the fetch; tracker endpoint = [trackerEndpoint]\n</code></pre> <p><code>getStatuses</code> sends a <code>GetMapOutputStatuses</code> RPC remote message for the input <code>shuffleId</code> to the trackerEndpoint expecting a <code>Array[Byte]</code>.</p> <p>NOTE: <code>getStatuses</code> requests shuffle map outputs remotely within a timeout and with retries. Refer to rpc:RpcEndpointRef.md[RpcEndpointRef].</p> <p><code>getStatuses</code> &lt;&gt; and records the result in the &lt;mapStatuses internal cache&gt;&gt;. <p>You should see the following INFO message in the logs:</p> <pre><code>Got the output locations\n</code></pre> <p><code>getStatuses</code> removes the input <code>shuffleId</code> from <code>fetching</code> internal registry.</p> <p>You should see the following DEBUG message in the logs:</p> <pre><code>Fetching map output statuses for shuffle [id] took [time] ms\n</code></pre> <p>If <code>getStatuses</code> could not find the map output locations for the input <code>shuffleId</code> (locally and remotely), you should see the following ERROR message in the logs and throws a <code>MetadataFetchFailedException</code>.</p> <pre><code>Missing all output locations for shuffle [id]\n</code></pre> <p>NOTE: <code>getStatuses</code> is used when MapOutputTracker &lt;&gt; and &lt;ShuffleDependency&gt;&gt;. <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.MapOutputTrackerWorker</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>"},{"location":"scheduler/MapOutputTrackerWorker/#source","title":"[source]","text":""},{"location":"scheduler/MapOutputTrackerWorker/#log4jloggerorgapachesparkmapoutputtrackerworkerall","title":"log4j.logger.org.apache.spark.MapOutputTrackerWorker=ALL","text":"<p>Refer to spark-logging.md[Logging].</p>"},{"location":"scheduler/MapStatus/","title":"MapStatus","text":"<p><code>MapStatus</code> is an abstraction of shuffle map output statuses with an estimated size, location and map Id.</p> <p><code>MapStatus</code> is a result of executing a ShuffleMapTask.</p> <p>After a ShuffleMapTask has finished execution successfully, <code>DAGScheduler</code> is requested to handle a ShuffleMapTask completion that in turn requests the MapOutputTrackerMaster to register the MapStatus.</p>"},{"location":"scheduler/MapStatus/#contract","title":"Contract","text":""},{"location":"scheduler/MapStatus/#estimated-size","title":"Estimated Size <pre><code>getSizeForBlock(\n  reduceId: Int): Long\n</code></pre> <p>Estimated size (in bytes)</p> <p>Used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested for a MapOutputStatistics and locations with the largest number of shuffle map outputs</li> <li><code>MapOutputTracker</code> utility is used to convert MapStatuses</li> <li><code>OptimizeSkewedJoin</code> (Spark SQL) physical optimization is executed</li> </ul>","text":""},{"location":"scheduler/MapStatus/#location","title":"Location <pre><code>location: BlockManagerId\n</code></pre> <p>BlockManagerId of the shuffle map output (i.e. the BlockManager where a <code>ShuffleMapTask</code> ran and the result is stored)</p> <p>Used when:</p> <ul> <li><code>ShuffleStatus</code> is requested to removeMapOutput and removeOutputsByFilter</li> <li><code>MapOutputTrackerMaster</code> is requested for locations with the largest number of shuffle map outputs and getMapLocation</li> <li><code>MapOutputTracker</code> utility is used to convert MapStatuses</li> <li><code>DAGScheduler</code> is requested to handle a ShuffleMapTask completion</li> </ul>","text":""},{"location":"scheduler/MapStatus/#map-id","title":"Map Id <pre><code>mapId: Long\n</code></pre> <p>Map Id of the shuffle map output</p> <p>Used when:</p> <ul> <li><code>MapOutputTracker</code> utility is used to convert MapStatuses</li> </ul>","text":""},{"location":"scheduler/MapStatus/#implementations","title":"Implementations","text":"<ul> <li>CompressedMapStatus</li> <li>HighlyCompressedMapStatus</li> </ul> Sealed Trait <p><code>MapStatus</code> is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).</p>"},{"location":"scheduler/MapStatus/#sparkshuffleminnumpartitionstohighlycompress","title":"spark.shuffle.minNumPartitionsToHighlyCompress <p><code>MapStatus</code> utility uses spark.shuffle.minNumPartitionsToHighlyCompress internal configuration property for the minimum number of partitions to prefer a HighlyCompressedMapStatus.</p>","text":""},{"location":"scheduler/MapStatus/#creating-mapstatus","title":"Creating MapStatus <pre><code>apply(\n  loc: BlockManagerId,\n  uncompressedSizes: Array[Long],\n  mapTaskId: Long): MapStatus\n</code></pre> <p><code>apply</code> creates a HighlyCompressedMapStatus when the number of <code>uncompressedSizes</code> is above minPartitionsToUseHighlyCompressMapStatus threshold. Otherwise, <code>apply</code> creates a CompressedMapStatus.</p> <p><code>apply</code> is used when:</p> <ul> <li><code>SortShuffleWriter</code> is requested to write records</li> <li><code>BypassMergeSortShuffleWriter</code> is requested to write records</li> <li><code>UnsafeShuffleWriter</code> is requested to close resources and write out merged spill files</li> </ul>","text":""},{"location":"scheduler/Pool/","title":"Pool","text":"<p>== [[Pool]] Schedulable Pool</p> <p><code>Pool</code> is a scheduler:spark-scheduler-Schedulable.md[Schedulable] entity that represents a tree of scheduler:TaskSetManager.md[TaskSetManagers], i.e. it contains a collection of <code>TaskSetManagers</code> or the <code>Pools</code> thereof.</p> <p>A <code>Pool</code> has a mandatory name, a spark-scheduler-SchedulingMode.md[scheduling mode], initial <code>minShare</code> and <code>weight</code> that are defined when it is created.</p> <p>NOTE: An instance of <code>Pool</code> is created when scheduler:TaskSchedulerImpl.md#initialize[TaskSchedulerImpl is initialized].</p> <p>NOTE: The scheduler:TaskScheduler.md#contract[TaskScheduler Contract] and spark-scheduler-Schedulable.md#contract[Schedulable Contract] both require that their entities have <code>rootPool</code> of type <code>Pool</code>.</p> <p>=== [[increaseRunningTasks]] <code>increaseRunningTasks</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[decreaseRunningTasks]] <code>decreaseRunningTasks</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[taskSetSchedulingAlgorithm]] <code>taskSetSchedulingAlgorithm</code> Attribute</p> <p>Using the spark-scheduler-SchedulingMode.md[scheduling mode] (given when a <code>Pool</code> object is created), <code>Pool</code> selects &lt;&gt; and sets <code>taskSetSchedulingAlgorithm</code>: <ul> <li>&lt;&gt; for FIFO scheduling mode. <li>&lt;&gt; for FAIR scheduling mode. <p>It throws an <code>IllegalArgumentException</code> when unsupported scheduling mode is passed on:</p> <pre><code>Unsupported spark.scheduler.mode: [schedulingMode]\n</code></pre> <p>TIP: Read about the scheduling modes in spark-scheduler-SchedulingMode.md[SchedulingMode].</p> <p>NOTE: <code>taskSetSchedulingAlgorithm</code> is used in &lt;&gt;. <p>=== [[getSortedTaskSetQueue]] Getting TaskSetManagers Sorted -- <code>getSortedTaskSetQueue</code> Method</p> <p>NOTE: <code>getSortedTaskSetQueue</code> is part of the spark-scheduler-Schedulable.md#contract[Schedulable Contract].</p> <p><code>getSortedTaskSetQueue</code> sorts all the spark-scheduler-Schedulable.md[Schedulables] in spark-scheduler-Schedulable.md#contract[schedulableQueue] queue by a &lt;&gt; (from the internal &lt;&gt;). <p>NOTE: It is called when scheduler:TaskSchedulerImpl.md#resourceOffers[<code>TaskSchedulerImpl</code> processes executor resource offers].</p> <p>=== [[schedulableNameToSchedulable]] Schedulables by Name -- <code>schedulableNameToSchedulable</code> Registry</p>"},{"location":"scheduler/Pool/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/Pool/#schedulablenametoschedulable-new-concurrenthashmapstring-schedulable","title":"schedulableNameToSchedulable = new ConcurrentHashMap[String, Schedulable]","text":"<p><code>schedulableNameToSchedulable</code> is a lookup table of spark-scheduler-Schedulable.md[Schedulable] objects by their names.</p> <p>Beside the obvious usage in the housekeeping methods like <code>addSchedulable</code>, <code>removeSchedulable</code>, <code>getSchedulableByName</code> from the spark-scheduler-Schedulable.md#contract[Schedulable Contract], it is exclusively used in SparkContext.md#getPoolForName[SparkContext.getPoolForName].</p> <p>=== [[addSchedulable]] <code>addSchedulable</code> Method</p> <p>NOTE: <code>addSchedulable</code> is part of the spark-scheduler-Schedulable.md#contract[Schedulable Contract].</p> <p><code>addSchedulable</code> adds a <code>Schedulable</code> to the spark-scheduler-Schedulable.md#contract[schedulableQueue] and &lt;&gt;. <p>More importantly, it sets the <code>Schedulable</code> entity's spark-scheduler-Schedulable.md#contract[parent] to itself.</p> <p>=== [[removeSchedulable]] <code>removeSchedulable</code> Method</p> <p>NOTE: <code>removeSchedulable</code> is part of the spark-scheduler-Schedulable.md#contract[Schedulable Contract].</p> <p><code>removeSchedulable</code> removes a <code>Schedulable</code> from the spark-scheduler-Schedulable.md#contract[schedulableQueue] and &lt;&gt;. <p>NOTE: <code>removeSchedulable</code> is the opposite to &lt;addSchedulable method&gt;&gt;. <p>=== [[SchedulingAlgorithm]] SchedulingAlgorithm</p> <p><code>SchedulingAlgorithm</code> is the interface for a sorting algorithm to sort spark-scheduler-Schedulable.md[Schedulables].</p> <p>There are currently two <code>SchedulingAlgorithms</code>:</p> <ul> <li>&lt;&gt; for FIFO scheduling mode. <li>&lt;&gt; for FAIR scheduling mode. <p>==== [[FIFOSchedulingAlgorithm]] FIFOSchedulingAlgorithm</p> <p><code>FIFOSchedulingAlgorithm</code> is a scheduling algorithm that compares <code>Schedulables</code> by their <code>priority</code> first and, when equal, by their <code>stageId</code>.</p> <p>NOTE: <code>priority</code> and <code>stageId</code> are part of spark-scheduler-Schedulable.md#contract[Schedulable Contract].</p> <p>CAUTION: FIXME A picture is worth a thousand words. How to picture the algorithm?</p> <p>==== [[FairSchedulingAlgorithm]] FairSchedulingAlgorithm</p> <p><code>FairSchedulingAlgorithm</code> is a scheduling algorithm that compares <code>Schedulables</code> by their <code>minShare</code>, <code>runningTasks</code>, and <code>weight</code>.</p> <p>NOTE: <code>minShare</code>, <code>runningTasks</code>, and <code>weight</code> are part of spark-scheduler-Schedulable.md#contract[Schedulable Contract].</p> <p>.FairSchedulingAlgorithm image::spark-pool-FairSchedulingAlgorithm.png[align=\"center\"]</p> <p>For each input <code>Schedulable</code>, <code>minShareRatio</code> is computed as <code>runningTasks</code> by <code>minShare</code> (but at least <code>1</code>) while <code>taskToWeightRatio</code> is <code>runningTasks</code> by <code>weight</code>.</p> <p>=== [[getSchedulableByName]] Finding Schedulable by Name -- <code>getSchedulableByName</code> Method</p>"},{"location":"scheduler/Pool/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/Pool/#getschedulablebynameschedulablename-string-schedulable","title":"getSchedulableByName(schedulableName: String): Schedulable","text":"<p>NOTE: <code>getSchedulableByName</code> is part of the &lt;&gt; to find a &lt;&gt; by name. <p><code>getSchedulableByName</code>...FIXME</p>"},{"location":"scheduler/ResultStage/","title":"ResultStage","text":"<p><code>ResultStage</code> is the final stage in a job that applies a function to one or many partitions of the target RDD to compute the result of an action.</p> <p></p> <p>The partitions are given as a collection of partition ids (<code>partitions</code>) and the function <code>func: (TaskContext, Iterator[_]) =&gt; _</code>.</p> <p></p> <p>== [[findMissingPartitions]] Finding Missing Partitions</p>"},{"location":"scheduler/ResultStage/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/ResultStage/#findmissingpartitions-seqint","title":"findMissingPartitions(): Seq[Int]","text":"<p>NOTE: findMissingPartitions is part of the scheduler:Stage.md#findMissingPartitions[Stage] abstraction.</p> <p>findMissingPartitions...FIXME</p> <p>.ResultStage.findMissingPartitions and ActiveJob image::resultstage-findMissingPartitions.png[align=\"center\"]</p> <p>In the above figure, partitions 1 and 2 are not finished (<code>F</code> is false while <code>T</code> is true).</p> <p>== [[func]] <code>func</code> Property</p> <p>CAUTION: FIXME</p> <p>== [[setActiveJob]] <code>setActiveJob</code> Method</p> <p>CAUTION: FIXME</p> <p>== [[removeActiveJob]] <code>removeActiveJob</code> Method</p> <p>CAUTION: FIXME</p> <p>== [[activeJob]] <code>activeJob</code> Method</p>"},{"location":"scheduler/ResultStage/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/ResultStage/#activejob-optionactivejob","title":"activeJob: Option[ActiveJob]","text":"<p><code>activeJob</code> returns the optional <code>ActiveJob</code> associated with a <code>ResultStage</code>.</p> <p>CAUTION: FIXME When/why would that be <code>NONE</code> (empty)?</p>"},{"location":"scheduler/ResultTask/","title":"ResultTask","text":"<p><code>ResultTask[T, U]</code> is a Task that executes a partition processing function on a partition with records (of type <code>T</code>) to produce a result (of type <code>U</code>) that is sent back to the driver.</p> <pre><code>T -- [ResultTask] --&gt; U\n</code></pre>"},{"location":"scheduler/ResultTask/#creating-instance","title":"Creating Instance","text":"<p><code>ResultTask</code> takes the following to be created:</p> <ul> <li> Stage ID <li> Stage Attempt ID <li> Broadcast variable with a serialized task (<code>Broadcast[Array[Byte]]</code>) <li> Partition to compute <li> TaskLocation <li> Output ID <li> Local Properties <li> Serialized TaskMetrics (<code>Array[Byte]</code>) <li> ActiveJob ID (optional) <li> Application ID (optional) <li> Application Attempt ID (optional) <li> <code>isBarrier</code> flag (default: <code>false</code>) <p><code>ResultTask</code> is created\u00a0when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit missing tasks of a ResultStage</li> </ul>"},{"location":"scheduler/ResultTask/#running-task","title":"Running Task <pre><code>runTask(\n  context: TaskContext): U\n</code></pre> <p><code>runTask</code>\u00a0is part of the Task abstraction.</p> <p><code>runTask</code> deserializes a RDD and a partition processing function from the broadcast variable (using the Closure Serializer).</p> <p>In the end, <code>runTask</code> executes the function (on the records from the partition of the <code>RDD</code>).</p>","text":""},{"location":"scheduler/Schedulable/","title":"Schedulable","text":"<p>== [[Schedulable]] Schedulable Contract -- Schedulable Entities</p> <p><code>Schedulable</code> is the &lt;&gt; of &lt;&gt; that manages the &lt;&gt; and can &lt;&gt;. <p>[[contract]] .Schedulable Contract [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| addSchedulable a| [[addSchedulable]]</p>"},{"location":"scheduler/Schedulable/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#addschedulableschedulable-schedulable-unit","title":"addSchedulable(schedulable: Schedulable): Unit","text":"<p>Registers a &lt;&gt; <p>Used when:</p> <ul> <li> <p><code>FIFOSchedulableBuilder</code> is requested to &lt;&gt; <li> <p><code>FairSchedulableBuilder</code> is requested to &lt;&gt;, &lt;&gt;, and &lt;&gt; <p>| checkSpeculatableTasks a| [[checkSpeculatableTasks]]</p>"},{"location":"scheduler/Schedulable/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#checkspeculatabletasksmintimetospeculation-int-boolean","title":"checkSpeculatableTasks(minTimeToSpeculation: Int): Boolean","text":"<p>Used when...FIXME</p> <p>| executorLost a| [[executorLost]]</p>"},{"location":"scheduler/Schedulable/#source-scala_2","title":"[source, scala]","text":"<p>executorLost(   executorId: String,   host: String,   reason: ExecutorLossReason): Unit</p> <p>Handles an executor lost event</p> <p>Used when:</p> <ul> <li> <p><code>Pool</code> is requested to &lt;&gt; <li> <p><code>TaskSchedulerImpl</code> is requested to scheduler:TaskSchedulerImpl.md#removeExecutor[removeExecutor]</p> </li> <p>| getSchedulableByName a| [[getSchedulableByName]]</p>"},{"location":"scheduler/Schedulable/#source-scala_3","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#getschedulablebynamename-string-schedulable","title":"getSchedulableByName(name: String): Schedulable","text":"<p>Finds a &lt;&gt; by name <p>Used when...FIXME</p> <p>| getSortedTaskSetQueue a| [[getSortedTaskSetQueue]]</p>"},{"location":"scheduler/Schedulable/#source-scala_4","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#getsortedtasksetqueue-arraybuffertasksetmanager","title":"getSortedTaskSetQueue: ArrayBuffer[TaskSetManager]","text":"<p>Builds a collection of scheduler:TaskSetManager.md[TaskSetManagers] sorted by &lt;&gt; <p>Used when:</p> <ul> <li> <p><code>Pool</code> is requested to &lt;&gt; (recursively) <li> <p><code>TaskSchedulerImpl</code> is requested to scheduler:TaskSchedulerImpl.md#resourceOffers[resourceOffers]</p> </li> <p>| minShare a| [[minShare]]</p>"},{"location":"scheduler/Schedulable/#source-scala_5","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#minshare-int","title":"minShare: Int","text":"<p>Used when...FIXME</p> <p>| name a| [[name]]</p>"},{"location":"scheduler/Schedulable/#source-scala_6","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#name-string","title":"name: String","text":"<p>Used when...FIXME</p> <p>| parent a| [[parent]]</p>"},{"location":"scheduler/Schedulable/#source-scala_7","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#parent-pool","title":"parent: Pool","text":"<p>Used when...FIXME</p> <p>| priority a| [[priority]]</p>"},{"location":"scheduler/Schedulable/#source-scala_8","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#priority-int","title":"priority: Int","text":"<p>Used when...FIXME</p> <p>| removeSchedulable a| [[removeSchedulable]]</p>"},{"location":"scheduler/Schedulable/#source-scala_9","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#removeschedulableschedulable-schedulable-unit","title":"removeSchedulable(schedulable: Schedulable): Unit","text":"<p>Used when...FIXME</p> <p>| runningTasks a| [[runningTasks]]</p>"},{"location":"scheduler/Schedulable/#source-scala_10","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#runningtasks-int","title":"runningTasks: Int","text":"<p>Used when...FIXME</p> <p>| schedulableQueue a| [[schedulableQueue]]</p>"},{"location":"scheduler/Schedulable/#source-scala_11","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#schedulablequeue-concurrentlinkedqueueschedulable","title":"schedulableQueue: ConcurrentLinkedQueue[Schedulable]","text":"<p>Queue of &lt;&gt; (as https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html[ConcurrentLinkedQueue]) <p>Used when:</p> <ul> <li> <p><code>SparkContext</code> is requested to SparkContext.md#getAllPools[getAllPools]</p> </li> <li> <p><code>Pool</code> is requested to &lt;&gt;, &lt;&gt;, &lt;&gt;, &lt;&gt;, &lt;&gt;, and &lt;&gt; <p>| schedulingMode a| [[schedulingMode]]</p>"},{"location":"scheduler/Schedulable/#source-scala_12","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#schedulingmode-schedulingmode","title":"schedulingMode: SchedulingMode","text":"<p>&lt;&gt; <p>Used when:</p> <ul> <li> <p><code>Pool</code> is &lt;&gt; <li> <p>web UI's <code>PoolTable</code> is requested to render a page with pools (<code>poolRow</code>)</p> </li> <p>| stageId a| [[stageId]]</p>"},{"location":"scheduler/Schedulable/#source-scala_13","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#stageid-int","title":"stageId: Int","text":"<p>Used when...FIXME</p> <p>| weight a| [[weight]]</p>"},{"location":"scheduler/Schedulable/#source-scala_14","title":"[source, scala]","text":""},{"location":"scheduler/Schedulable/#weight-int","title":"weight: Int","text":"<p>Used when...FIXME</p> <p>|===</p> <p>[[implementations]] .Schedulables [cols=\"1,3\",options=\"header\",width=\"100%\"] |=== | Schedulable | Description</p> <p>| &lt;&gt; | [[Pool]] Pool of &lt;&gt; (i.e. a recursive data structure for prioritizing task sets) <p>| scheduler:TaskSetManager.md[TaskSetManager] | [[TaskSetManager]] Manages scheduling of tasks of a scheduler:TaskSet.md[TaskSet]</p> <p>|===</p>"},{"location":"scheduler/SchedulableBuilder/","title":"SchedulableBuilder","text":"<p>== [[SchedulableBuilder]] SchedulableBuilder Contract -- Builders of Schedulable Pools</p> <p><code>SchedulableBuilder</code> is the &lt;&gt; of &lt;&gt; that manage a &lt;&gt;, which is to &lt;&gt; and &lt;&gt;. <p><code>SchedulableBuilder</code> is a <code>private[spark]</code> Scala trait that is used exclusively by scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] (the default Spark scheduler). When requested to scheduler:TaskSchedulerImpl.md#initialize[initialize], <code>TaskSchedulerImpl</code> uses the configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property (default: <code>FIFO</code>) to select one of the &lt;&gt;. <p>[[contract]] .SchedulableBuilder Contract [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| addTaskSetManager a| [[addTaskSetManager]]</p>"},{"location":"scheduler/SchedulableBuilder/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/SchedulableBuilder/#addtasksetmanagermanager-schedulable-properties-properties-unit","title":"addTaskSetManager(manager: Schedulable, properties: Properties): Unit","text":"<p>Registers a new &lt;&gt; with the &lt;&gt; <p>Used exclusively when <code>TaskSchedulerImpl</code> is requested to scheduler:TaskSchedulerImpl.md#submitTasks[submit tasks (of TaskSet) for execution] (and registers a new scheduler:TaskSetManager.md[TaskSetManager] for the <code>TaskSet</code>)</p> <p>| buildPools a| [[buildPools]]</p>"},{"location":"scheduler/SchedulableBuilder/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/SchedulableBuilder/#buildpools-unit","title":"buildPools(): Unit","text":"<p>Builds a tree of &lt;&gt; <p>Used exclusively when <code>TaskSchedulerImpl</code> is requested to scheduler:TaskSchedulerImpl.md#initialize[initialize] (and creates a scheduler:TaskSchedulerImpl.md#schedulableBuilder[SchedulableBuilder] per configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property)</p> <p>| rootPool a| [[rootPool]]</p>"},{"location":"scheduler/SchedulableBuilder/#source-scala_2","title":"[source, scala]","text":""},{"location":"scheduler/SchedulableBuilder/#rootpool-pool","title":"rootPool: Pool","text":"<p>Root (top-level) &lt;&gt; <p>Used when:</p> <ul> <li> <p><code>FIFOSchedulableBuilder</code> is requested to &lt;&gt; <li> <p><code>FairSchedulableBuilder</code> is requested to &lt;&gt;, &lt;&gt;, and &lt;&gt; <p>|===</p> <p>[[implementations]] .SchedulableBuilders [cols=\"1,3\",options=\"header\",width=\"100%\"] |=== | SchedulableBuilder | Description</p> <p>| &lt;&gt; | [[FairSchedulableBuilder]] Used when the configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property is <code>FAIR</code> <p>| &lt;&gt; | [[FIFOSchedulableBuilder]] Default <code>SchedulableBuilder</code> that is used when the configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] configuration property is <code>FIFO</code> (default) <p>|===</p>"},{"location":"scheduler/SchedulerBackend/","title":"SchedulerBackend","text":"<p><code>SchedulerBackend</code> is an abstraction of task scheduling backends that can revive resource offers from cluster managers.</p> <p><code>SchedulerBackend</code> abstraction allows <code>TaskSchedulerImpl</code> to use variety of cluster managers (with their own resource offers and task scheduling modes).</p> <p>Note</p> <p>Being a scheduler backend system assumes a Apache Mesos-like scheduling model in which \"an application\" gets resource offers as machines become available so it is possible to launch tasks on them. Once required resource allocation is obtained, the scheduler backend can start executors.</p>"},{"location":"scheduler/SchedulerBackend/#contract","title":"Contract","text":""},{"location":"scheduler/SchedulerBackend/#applicationattemptid","title":"applicationAttemptId <pre><code>applicationAttemptId(): Option[String]\n</code></pre> <p>Execution attempt ID of this Spark application</p> <p>Default: <code>None</code> (undefined)</p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested for the execution attempt ID of a Spark application</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#applicationid","title":"applicationId <pre><code>applicationId(): String\n</code></pre> <p>Unique identifier of this Spark application</p> <p>Default: <code>spark-application-[currentTimeMillis]</code></p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested for the unique identifier of a Spark application</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#default-parallelism","title":"Default Parallelism <pre><code>defaultParallelism(): Int\n</code></pre> <p>Default parallelism, i.e. a hint for the number of tasks in stages while sizing jobs</p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested for the default parallelism</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#getdriverattributes","title":"getDriverAttributes <pre><code>getDriverAttributes: Option[Map[String, String]]\n</code></pre> <p>Default: <code>None</code></p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested to postApplicationStart</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#getdriverlogurls","title":"getDriverLogUrls <pre><code>getDriverLogUrls: Option[Map[String, String]]\n</code></pre> <p>Driver log URLs</p> <p>Default: <code>None</code> (undefined)</p> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested to postApplicationStart</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#isready","title":"isReady <pre><code>isReady(): Boolean\n</code></pre> <p>Controls whether this <code>SchedulerBackend</code> is ready (<code>true</code>) or not (<code>false</code>)</p> <p>Default: <code>true</code></p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to wait until scheduling backend is ready</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#killing-task","title":"Killing Task <pre><code>killTask(\n  taskId: Long,\n  executorId: String,\n  interruptThread: Boolean,\n  reason: String): Unit\n</code></pre> <p>Kills a given task</p> <p>Default: <code>UnsupportedOperationException</code></p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to killTaskAttempt and killAllTaskAttempts</li> <li><code>TaskSetManager</code> is requested to handle a successful task attempt</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#maxNumConcurrentTasks","title":"Maximum Number of Concurrent Tasks <pre><code>maxNumConcurrentTasks(\n  rp: ResourceProfile): Int\n</code></pre> <p>Maximum number of concurrent tasks that can be launched (based on the given ResourceProfile)</p> <p>See:</p> <ul> <li>CoarseGrainedSchedulerBackend</li> <li>LocalSchedulerBackend</li> </ul> <p>Used when:</p> <ul> <li><code>SparkContext</code> is requested for the maximum number of concurrent tasks</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#reviveoffers","title":"reviveOffers <pre><code>reviveOffers(): Unit\n</code></pre> <p>Handles resource allocation offers (from the scheduling system)</p> <p>Used when <code>TaskSchedulerImpl</code> is requested to:</p> <ul> <li> <p>Submit tasks (from a TaskSet)</p> </li> <li> <p>Handle a task status update</p> </li> <li> <p>Notify the TaskSetManager that a task has failed</p> </li> <li> <p>Check for speculatable tasks</p> </li> <li> <p>Handle a lost executor event</p> </li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#starting-schedulerbackend","title":"Starting SchedulerBackend <pre><code>start(): Unit\n</code></pre> <p>Starts this <code>SchedulerBackend</code></p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to start</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#stop","title":"stop <pre><code>stop(): Unit\n</code></pre> <p>Stops this <code>SchedulerBackend</code></p> <p>Used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to stop</li> </ul>","text":""},{"location":"scheduler/SchedulerBackend/#implementations","title":"Implementations","text":"<ul> <li>CoarseGrainedSchedulerBackend</li> <li>LocalSchedulerBackend</li> <li>MesosFineGrainedSchedulerBackend</li> </ul>"},{"location":"scheduler/SchedulerBackendUtils/","title":"SchedulerBackendUtils Utility","text":""},{"location":"scheduler/SchedulerBackendUtils/#default-number-of-executors","title":"Default Number of Executors <p><code>SchedulerBackendUtils</code> defaults to <code>2</code> as the default number of executors.</p>","text":""},{"location":"scheduler/SchedulerBackendUtils/#getinitialtargetexecutornumber","title":"getInitialTargetExecutorNumber <pre><code>getInitialTargetExecutorNumber(\n  conf: SparkConf,\n  numExecutors: Int = DEFAULT_NUMBER_EXECUTORS): Int\n</code></pre> <p><code>getInitialTargetExecutorNumber</code> branches off based on whether Dynamic Allocation of Executors is enabled or not.</p> <p>With no Dynamic Allocation of Executors, <code>getInitialTargetExecutorNumber</code> uses the spark.executor.instances configuration property (if defined) or uses the given <code>numExecutors</code> (and the DEFAULT_NUMBER_EXECUTORS).</p> <p>With Dynamic Allocation of Executors enabled, <code>getInitialTargetExecutorNumber</code> getDynamicAllocationInitialExecutors and makes sure that the value is between the following configuration properties:</p> <ul> <li>spark.dynamicAllocation.minExecutors</li> <li>spark.dynamicAllocation.maxExecutors</li> </ul> <p><code>getInitialTargetExecutorNumber</code> is used when:</p> <ul> <li><code>KubernetesClusterSchedulerBackend</code> (Spark on Kubernetes) is created</li> <li>Spark on YARN's <code>YarnAllocator</code>, <code>YarnClientSchedulerBackend</code> and <code>YarnClusterSchedulerBackend</code> are used</li> </ul>","text":""},{"location":"scheduler/SchedulingMode/","title":"SchedulingMode","text":"<p>== [[SchedulingMode]] Scheduling Mode -- <code>spark.scheduler.mode</code> Spark Property</p> <p>Scheduling Mode (aka order task policy or scheduling policy or scheduling order) defines a policy to sort tasks in order for execution.</p> <p>The scheduling mode <code>schedulingMode</code> attribute is part of the scheduler:TaskScheduler.md#schedulingMode[TaskScheduler Contract].</p> <p>The only implementation of the <code>TaskScheduler</code> contract in Spark -- scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] -- uses configuration-properties.md#spark.scheduler.mode[spark.scheduler.mode] setting to configure <code>schedulingMode</code> that is merely used to set up the scheduler:TaskScheduler.md#rootPool[rootPool] attribute (with <code>FIFO</code> being the default). It happens when scheduler:TaskSchedulerImpl.md#initialize[<code>TaskSchedulerImpl</code> is initialized].</p> <p>There are three acceptable scheduling modes:</p> <ul> <li>[[FIFO]] <code>FIFO</code> with no pools but a single top-level unnamed pool with elements being scheduler:TaskSetManager.md[TaskSetManager] objects; lower priority gets scheduler:spark-scheduler-Schedulable.md[Schedulable] sooner or earlier stage wins.</li> <li>[[FAIR]] <code>FAIR</code> with a scheduler:spark-scheduler-FairSchedulableBuilder.md#buildPools[hierarchy of <code>Schedulable</code> (sub)pools] with the scheduler:TaskScheduler.md#rootPool[rootPool] at the top.</li> <li>[[NONE]] NONE (not used)</li> </ul> <p>NOTE: Out of three possible <code>SchedulingMode</code> policies only <code>FIFO</code> and <code>FAIR</code> modes are supported by scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl].</p>"},{"location":"scheduler/SchedulingMode/#note","title":"[NOTE]","text":"<p>After the root pool is initialized, the scheduling mode is no longer relevant (since the spark-scheduler-Schedulable.md[Schedulable] that represents the root pool is fully set up).</p>"},{"location":"scheduler/SchedulingMode/#the-root-pool-is-later-used-when-schedulertaskschedulerimplmdsubmittaskstaskschedulerimpl-submits-tasks-as-tasksets-for-execution","title":"The root pool is later used when scheduler:TaskSchedulerImpl.md#submitTasks[<code>TaskSchedulerImpl</code> submits tasks (as <code>TaskSets</code>) for execution].","text":"<p>NOTE: The scheduler:TaskScheduler.md#rootPool[root pool] is a <code>Schedulable</code>. Refer to spark-scheduler-Schedulable.md[Schedulable].</p> <p>=== [[fair-scheduling-sparkui]] Monitoring FAIR Scheduling Mode using Spark UI</p> <p>CAUTION: FIXME Describe me...</p>"},{"location":"scheduler/ShuffleMapStage/","title":"ShuffleMapStage","text":"<p><code>ShuffleMapStage</code> (shuffle map stage or simply map stage) is a Stage.</p> <p><code>ShuffleMapStage</code> corresponds to (and is associated with) a ShuffleDependency.</p> <p><code>ShuffleMapStage</code> can be submitted independently but it is usually an intermediate step in a physical execution plan (with the final step being a ResultStage).</p>"},{"location":"scheduler/ShuffleMapStage/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleMapStage</code> takes the following to be created:</p> <ul> <li> Stage ID <li> RDD (of the ShuffleDependency) <li> Number of tasks <li> Parent Stages <li> First Job ID (of the ActiveJob that created it) <li> <code>CallSite</code> <li> ShuffleDependency <li> MapOutputTrackerMaster <li> Resource Profile ID <p><code>ShuffleMapStage</code> is created when:</p> <ul> <li><code>DAGScheduler</code> is requested to plan a ShuffleDependency for execution</li> </ul>"},{"location":"scheduler/ShuffleMapStage/#missing-partitions","title":"Missing Partitions <pre><code>findMissingPartitions(): Seq[Int]\n</code></pre> <p><code>findMissingPartitions</code> requests the MapOutputTrackerMaster for the missing partitions (of the ShuffleDependency) and returns them.</p> <p>If not available (<code>MapOutputTrackerMaster</code> does not track the <code>ShuffleDependency</code>), <code>findMissingPartitions</code> simply assumes that all the partitions are missing.</p> <p><code>findMissingPartitions</code> is part of the Stage abstraction.</p>","text":""},{"location":"scheduler/ShuffleMapStage/#shufflemapstage-ready","title":"ShuffleMapStage Ready <p>When \"executed\", a <code>ShuffleMapStage</code> saves map output files (for reduce tasks).</p> <p>When all partitions have shuffle map outputs available, <code>ShuffleMapStage</code> is considered ready (done or available).</p>","text":""},{"location":"scheduler/ShuffleMapStage/#isavailable","title":"isAvailable <pre><code>isAvailable: Boolean\n</code></pre> <p><code>isAvailable</code> is <code>true</code> when the <code>ShuffleMapStage</code> is ready and all partitions have shuffle outputs (i.e. the numAvailableOutputs is exactly the numPartitions).</p> <p><code>isAvailable</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to getMissingParentStages, handleMapStageSubmitted, submitMissingTasks, processShuffleMapStageCompletion, markMapStageJobsAsFinished and stageDependsOn</li> </ul>","text":""},{"location":"scheduler/ShuffleMapStage/#available-outputs","title":"Available Outputs <pre><code>numAvailableOutputs: Int\n</code></pre> <p><code>numAvailableOutputs</code> requests the MapOutputTrackerMaster to getNumAvailableOutputs (for the shuffleId of the ShuffleDependency).</p> <p><code>numAvailableOutputs</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to submitMissingTasks</li> <li><code>ShuffleMapStage</code> is requested to isAvailable</li> </ul>","text":""},{"location":"scheduler/ShuffleMapStage/#active-jobs","title":"Active Jobs <p><code>ShuffleMapStage</code> defines <code>_mapStageJobs</code> internal registry of ActiveJobs to track jobs that were submitted to execute the stage independently.</p> <p>A new job is registered (added) in addActiveJob.</p> <p>An active job is deregistered (removed) in removeActiveJob.</p>","text":""},{"location":"scheduler/ShuffleMapStage/#addactivejob","title":"addActiveJob <pre><code>addActiveJob(\n  job: ActiveJob): Unit\n</code></pre> <p><code>addActiveJob</code> adds the given ActiveJob to (the front of) the _mapStageJobs list.</p> <p><code>addActiveJob</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handleMapStageSubmitted</li> </ul>","text":""},{"location":"scheduler/ShuffleMapStage/#removeactivejob","title":"removeActiveJob <pre><code>removeActiveJob(\n  job: ActiveJob): Unit\n</code></pre> <p><code>removeActiveJob</code> removes the ActiveJob from the _mapStageJobs registry.</p> <p><code>removeActiveJob</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to cleanupStateForJobAndIndependentStages</li> </ul>","text":""},{"location":"scheduler/ShuffleMapStage/#mapstagejobs","title":"mapStageJobs <pre><code>mapStageJobs: Seq[ActiveJob]\n</code></pre> <p><code>mapStageJobs</code> returns the _mapStageJobs list.</p> <p><code>mapStageJobs</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to markMapStageJobsAsFinished</li> </ul>","text":""},{"location":"scheduler/ShuffleMapStage/#demo-shufflemapstage-sharing","title":"Demo: ShuffleMapStage Sharing <p>A <code>ShuffleMapStage</code> can be shared across multiple jobs (if these jobs reuse the same RDDs).</p> <p></p> <pre><code>val keyValuePairs = sc.parallelize(0 to 5).map((_, 1))\nval rdd = keyValuePairs.sortByKey()  // (1)\n\nscala&gt; println(rdd.toDebugString)\n(6) ShuffledRDD[4] at sortByKey at &lt;console&gt;:39 []\n +-(16) MapPartitionsRDD[1] at map at &lt;console&gt;:39 []\n    |   ParallelCollectionRDD[0] at parallelize at &lt;console&gt;:39 []\n\nrdd.count  // (2)\nrdd.count  // (3)\n</code></pre> <ol> <li>Shuffle at <code>sortByKey()</code></li> <li>Submits a job with two stages (and two to be executed)</li> <li>Intentionally repeat the last action that submits a new job with two stages with one being shared as already-computed</li> </ol>","text":""},{"location":"scheduler/ShuffleMapStage/#map-output-files","title":"Map Output Files <p><code>ShuffleMapStage</code> writes out map output files (for a shuffle).</p>","text":""},{"location":"scheduler/ShuffleMapTask/","title":"ShuffleMapTask","text":"<p><code>ShuffleMapTask</code> is a Task to produce a MapStatus (<code>Task[MapStatus]</code>).</p> <p><code>ShuffleMapTask</code> is one of the two types of Tasks. When executed, <code>ShuffleMapTask</code> writes the result of executing a serialized task code over the records (of a RDD partition) to the shuffle system and returns a MapStatus (with the BlockManager and estimated size of the result shuffle blocks).</p> <p></p>"},{"location":"scheduler/ShuffleMapTask/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleMapTask</code> takes the following to be created:</p> <ul> <li> Stage ID <li> Stage Attempt ID <li>Broadcast variable with a serialized task binary</li> <li> Partition <li> TaskLocations <li> Local Properties <li> Serialized task metrics <li> Job ID (default: <code>None</code>) <li> Application ID (default: <code>None</code>) <li> Application Attempt ID (default: <code>None</code>) <li>isBarrier flag</li> <p><code>ShuffleMapTask</code> is created when <code>DAGScheduler</code> is requested to submit tasks for all missing partitions of a ShuffleMapStage.</p>"},{"location":"scheduler/ShuffleMapTask/#isBarrier","title":"isBarrier Flag","text":"<p><code>ShuffleMapTask</code> can be given <code>isBarrier</code> flag when created. Unless given, <code>isBarrier</code> is assumed disabled (<code>false</code>).</p> <p><code>isBarrier</code> flag is passed to the parent Task.</p>"},{"location":"scheduler/ShuffleMapTask/#serialized-task-binary","title":"Serialized Task Binary <pre><code>taskBinary: Broadcast[Array[Byte]]\n</code></pre> <p><code>ShuffleMapTask</code> is given a broadcast variable with a reference to a serialized task binary.</p> <p>runTask expects that the serialized task binary is a tuple of an RDD and a ShuffleDependency.</p>","text":""},{"location":"scheduler/ShuffleMapTask/#preferred-locations","title":"Preferred Locations  Signature <pre><code>preferredLocations: Seq[TaskLocation]\n</code></pre> <p><code>preferredLocations</code> is part of the Task abstraction.</p>  <p><code>preferredLocations</code> returns <code>preferredLocs</code> internal property.</p> <p><code>ShuffleMapTask</code> tracks TaskLocations as unique entries in the given locs (with the only rule that when <code>locs</code> is not defined, it is empty, and no task location preferences are defined).</p> <p><code>ShuffleMapTask</code> initializes the <code>preferredLocs</code> internal property when created</p>","text":""},{"location":"scheduler/ShuffleMapTask/#running-task","title":"Running Task  Signature <pre><code>runTask(\n  context: TaskContext): MapStatus\n</code></pre> <p><code>runTask</code> is part of the Task abstraction.</p>  <p></p> <p><code>runTask</code> writes the result (records) of executing the serialized task code over the records (in the RDD partition) to the shuffle system and returns a MapStatus (with the BlockManager and an estimated size of the result shuffle blocks).</p> <p>Internally, <code>runTask</code> requests the SparkEnv for the new instance of closure serializer and requests it to deserialize the serialized task code (into a tuple of a RDD and a ShuffleDependency).</p> <p><code>runTask</code> measures the thread and CPU deserialization times.</p> <p><code>runTask</code> requests the SparkEnv for the ShuffleManager and requests it for a ShuffleWriter (for the ShuffleHandle and the partition).</p> <p><code>runTask</code> then requests the RDD for the records (of the partition) that the <code>ShuffleWriter</code> is requested to write out (to the shuffle system).</p> <p>In the end, <code>runTask</code> requests the <code>ShuffleWriter</code> to stop (with the <code>success</code> flag on) and returns the shuffle map output status.</p>  <p>Note</p> <p>This is the moment in <code>Task</code>'s lifecycle (and its corresponding RDD) when a RDD partition is computed and in turn becomes a sequence of records (i.e. real data) on an executor.</p>  <p>In case of any exceptions, <code>runTask</code> requests the <code>ShuffleWriter</code> to stop (with the <code>success</code> flag off) and (re)throws the exception.</p> <p><code>runTask</code> may also print out the following DEBUG message to the logs when the <code>ShuffleWriter</code> could not be stopped.</p> <pre><code>Could not stop writer\n</code></pre>","text":""},{"location":"scheduler/ShuffleMapTask/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.ShuffleMapTask</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.ShuffleMapTask.name = org.apache.spark.scheduler.ShuffleMapTask\nlogger.ShuffleMapTask.level = all\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/ShuffleStatus/","title":"ShuffleStatus","text":"<p><code>ShuffleStatus</code> is a registry of MapStatuses per Partition of a ShuffleMapStage.</p> <p><code>ShuffleStatus</code> is used by MapOutputTrackerMaster.</p>"},{"location":"scheduler/ShuffleStatus/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleStatus</code> takes the following to be created:</p> <ul> <li> Number of Partitions (of the RDD of the ShuffleDependency of a ShuffleMapStage) <p><code>ShuffleStatus</code> is created\u00a0when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested to register a shuffle (when <code>DAGScheduler</code> is requested to create a ShuffleMapStage)</li> </ul>"},{"location":"scheduler/ShuffleStatus/#mapstatuses-per-partition","title":"MapStatuses per Partition <p><code>ShuffleStatus</code> creates a <code>mapStatuses</code> internal registry of MapStatuses per partition (using the numPartitions) when created.</p> <p>A missing partition is when there is no <code>MapStatus</code> for a partition (<code>null</code> at the index of the partition ID) and can be requested using findMissingPartitions.</p> <p><code>mapStatuses</code> is all <code>null</code> (for every partition) initially (and so all partitions are missing / uncomputed).</p> <p>A new <code>MapStatus</code> is added in addMapOutput and updateMapOutput.</p> <p>A <code>MapStatus</code> is removed (<code>null</code>ed) in removeMapOutput and removeOutputsByFilter.</p> <p>The number of available <code>MapStatus</code>es is tracked by _numAvailableMapOutputs internal counter.</p> <p>Used when:</p> <ul> <li>serializedMapStatus and withMapStatuses</li> </ul>","text":""},{"location":"scheduler/ShuffleStatus/#registering-shuffle-map-output","title":"Registering Shuffle Map Output <pre><code>addMapOutput(\n  mapIndex: Int,\n  status: MapStatus): Unit\n</code></pre> <p><code>addMapOutput</code> adds the MapStatus to the mapStatuses internal registry.</p> <p>In case the mapStatuses internal registry had no <code>MapStatus</code> for the <code>mapIndex</code> already available, <code>addMapOutput</code> increments the _numAvailableMapOutputs internal counter and invalidateSerializedMapOutputStatusCache.</p> <p><code>addMapOutput</code>\u00a0is used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested to registerMapOutput</li> </ul>","text":""},{"location":"scheduler/ShuffleStatus/#deregistering-shuffle-map-output","title":"Deregistering Shuffle Map Output <pre><code>removeMapOutput(\n  mapIndex: Int,\n  bmAddress: BlockManagerId): Unit\n</code></pre> <p><code>removeMapOutput</code>...FIXME</p> <p><code>removeMapOutput</code>\u00a0is used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested to unregisterMapOutput</li> </ul>","text":""},{"location":"scheduler/ShuffleStatus/#missing-partitions","title":"Missing Partitions <pre><code>findMissingPartitions(): Seq[Int]\n</code></pre> <p><code>findMissingPartitions</code>...FIXME</p> <p><code>findMissingPartitions</code>\u00a0is used when:</p> <ul> <li><code>MapOutputTrackerMaster</code> is requested to findMissingPartitions</li> </ul>","text":""},{"location":"scheduler/ShuffleStatus/#serializing-shuffle-map-output-statuses","title":"Serializing Shuffle Map Output Statuses <pre><code>serializedMapStatus(\n  broadcastManager: BroadcastManager,\n  isLocal: Boolean,\n  minBroadcastSize: Int,\n  conf: SparkConf): Array[Byte]\n</code></pre> <p><code>serializedMapStatus</code>...FIXME</p> <p><code>serializedMapStatus</code>\u00a0is used when:</p> <ul> <li><code>MessageLoop</code> (of the MapOutputTrackerMaster) is requested to send map output locations for shuffle</li> </ul>","text":""},{"location":"scheduler/ShuffleStatus/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.ShuffleStatus</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.ShuffleStatus=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/Stage/","title":"Stage","text":"<p><code>Stage</code> is an abstraction of steps in a physical execution plan.</p> <p>Note</p> <p>The logical DAG or logical execution plan is the RDD lineage.</p> <p>Indirectly, a <code>Stage</code> is a set of parallel tasks - one task per partition (of an RDD that computes partial results of a function executed as part of a Spark job).</p> <p></p> <p>In other words, a Spark job is a computation \"sliced\" (not to use the reserved term partitioned) into stages.</p>"},{"location":"scheduler/Stage/#contract","title":"Contract","text":""},{"location":"scheduler/Stage/#missing-partitions","title":"Missing Partitions <pre><code>findMissingPartitions(): Seq[Int]\n</code></pre> <p>Missing partitions (IDs of the partitions of the RDD that are missing and need to be computed)</p> <p>Used when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit missing tasks</li> </ul>","text":""},{"location":"scheduler/Stage/#implementations","title":"Implementations","text":"<ul> <li>ResultStage</li> <li>ShuffleMapStage</li> </ul>"},{"location":"scheduler/Stage/#creating-instance","title":"Creating Instance","text":"<p><code>Stage</code> takes the following to be created:</p> <ul> <li>Stage ID</li> <li>RDD</li> <li> Number of tasks <li> Parent <code>Stage</code>s <li> First Job ID <li> <code>CallSite</code> <li> Resource Profile ID <p>Abstract Class</p> <p><code>Stage</code> is an abstract class and cannot be created directly. It is created indirectly for the concrete Stages.</p>"},{"location":"scheduler/Stage/#rdd","title":"RDD <p><code>Stage</code> is given a RDD when created.</p>","text":""},{"location":"scheduler/Stage/#stage-id","title":"Stage ID <p><code>Stage</code> is given an unique ID when created.</p>  <p>Note</p> <p><code>DAGScheduler</code> uses nextStageId internal counter to track the number of stage submissions.</p>","text":""},{"location":"scheduler/Stage/#making-new-stage-attempt","title":"Making New Stage Attempt <pre><code>makeNewStageAttempt(\n  numPartitionsToCompute: Int,\n  taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): Unit\n</code></pre> <p><code>makeNewStageAttempt</code> creates a new TaskMetrics and requests it to register itself with the SparkContext of the RDD.</p> <p><code>makeNewStageAttempt</code> creates a StageInfo from this <code>Stage</code> (and the nextAttemptId). This <code>StageInfo</code> is saved in the _latestInfo internal registry.</p> <p>In the end, <code>makeNewStageAttempt</code> increments the nextAttemptId internal counter.</p>  <p>Note</p> <p><code>makeNewStageAttempt</code> returns <code>Unit</code> (nothing) and its purpose is to update the latest StageInfo internal registry.</p>  <p><code>makeNewStageAttempt</code>\u00a0is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit the missing tasks of a stage</li> </ul>","text":""},{"location":"scheduler/StageInfo/","title":"StageInfo","text":"<p><code>StageInfo</code> is a metadata about a stage to pass from the scheduler to SparkListeners.</p>"},{"location":"scheduler/StageInfo/#creating-instance","title":"Creating Instance","text":"<p><code>StageInfo</code> takes the following to be created:</p> <ul> <li> Stage ID <li> Stage Attempt ID <li> Name <li> Number of Tasks <li> RDDInfos <li> Parent IDs <li> Details <li> TaskMetrics (default: <code>null</code>) <li> Task Locality Preferences (default: empty) <li> Optional Shuffle Dependency ID (default: undefined) <p><code>StageInfo</code> is created\u00a0when:</p> <ul> <li><code>StageInfo</code> utility is used to fromStage</li> <li><code>JsonProtocol</code> (History Server) is used to stageInfoFromJson</li> </ul>"},{"location":"scheduler/StageInfo/#fromstage-utility","title":"fromStage Utility <pre><code>fromStage(\n  stage: Stage,\n  attemptId: Int,\n  numTasks: Option[Int] = None,\n  taskMetrics: TaskMetrics = null,\n  taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): StageInfo\n</code></pre> <p><code>fromStage</code>...FIXME</p> <p><code>fromStage</code>\u00a0is used when:</p> <ul> <li><code>Stage</code> is created and make a new Stage attempt</li> </ul>","text":""},{"location":"scheduler/Task/","title":"Task","text":"<p><code>Task</code> is an abstraction of the smallest individual units of execution that can be executed (to compute an RDD partition).</p> <p></p>"},{"location":"scheduler/Task/#contract","title":"Contract","text":""},{"location":"scheduler/Task/#running-task","title":"Running Task <pre><code>runTask(\n  context: TaskContext): T\n</code></pre> <p>Runs the task (in a TaskContext)</p> <p>Used when <code>Task</code> is requested to run</p>","text":""},{"location":"scheduler/Task/#implementations","title":"Implementations","text":"<ul> <li>ResultTask</li> <li>ShuffleMapTask</li> </ul>"},{"location":"scheduler/Task/#creating-instance","title":"Creating Instance","text":"<p><code>Task</code> takes the following to be created:</p> <ul> <li> Stage ID <li> Stage (execution) Attempt ID <li> Partition ID to compute <li> Local Properties <li> Serialized TaskMetrics (<code>Array[Byte]</code>) <li> ActiveJob ID (default: <code>None</code>) <li> Application ID (default: <code>None</code>) <li> Application Attempt ID (default: <code>None</code>) <li>isBarrier flag</li> <p><code>Task</code> is created when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit missing tasks of a stage</li> </ul> Abstract Class <p><code>Task</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete Tasks.</p>"},{"location":"scheduler/Task/#isBarrier","title":"isBarrier Flag <p><code>Task</code> can be given <code>isBarrier</code> flag when created. Unless given, <code>isBarrier</code> is assumed disabled (<code>false</code>).</p> <p><code>isBarrier</code> flag indicates whether this <code>Task</code> belongs to a Barrier Stage in Barrier Execution Mode.</p> <p><code>isBarrier</code> flag is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to handleTaskCompletion (of a <code>FetchFailed</code> task) to fail the parent stage (and retry a barrier stage when one of the barrier tasks fails)</li> <li><code>Task</code> is requested to run (to create a BarrierTaskContext)</li> <li><code>TaskSetManager</code> is requested to isBarrier and handleFailedTask</li> </ul>","text":""},{"location":"scheduler/Task/#taskmemorymanager","title":"TaskMemoryManager <p><code>Task</code> is given a TaskMemoryManager when <code>TaskRunner</code> is requested to run a task (right after deserializing the task for execution).</p> <p><code>Task</code> uses the <code>TaskMemoryManager</code> to create a TaskContextImpl (when requested to run).</p>","text":""},{"location":"scheduler/Task/#serializable","title":"Serializable <p><code>Task</code> is a <code>Serializable</code> (Java) so it can be serialized (to bytes) and send over the wire for execution from the driver to executors.</p>","text":""},{"location":"scheduler/Task/#preferred-locations","title":"Preferred Locations <pre><code>preferredLocations: Seq[TaskLocation]\n</code></pre> <p>TaskLocations that represent preferred locations (executors) to execute the task on.</p> <p>Empty by default and so no task location preferences are defined that says the task could be launched on any executor.</p>  <p>Note</p> <p>Defined by the concrete tasks (i.e. ShuffleMapTask and ResultTask).</p>  <p><code>preferredLocations</code> is used when <code>TaskSetManager</code> is requested to register a task as pending execution and dequeueSpeculativeTask.</p>","text":""},{"location":"scheduler/Task/#run","title":"Running Task <pre><code>run(\n  taskAttemptId: Long,\n  attemptNumber: Int,\n  metricsSystem: MetricsSystem,\n  resources: Map[String, ResourceInformation],\n  plugins: Option[PluginContainer]): T\n</code></pre> <p><code>run</code> registers the task (attempt) with the BlockManager.</p> <p><code>run</code> creates a TaskContextImpl (and perhaps a BarrierTaskContext too when the given <code>isBarrier</code> flag is enabled) that in turn becomes the task's TaskContext.</p> <p><code>run</code> checks _killed flag and, if enabled, kills the task (with <code>interruptThread</code> flag disabled).</p> <p><code>run</code> creates a Hadoop <code>CallerContext</code> and sets it.</p> <p><code>run</code> informs the given <code>PluginContainer</code> that the task is started.</p> <p><code>run</code> runs the task.</p>  <p>Note</p> <p>This is the moment when the custom <code>Task</code>'s runTask is executed.</p>  <p>In the end, <code>run</code> notifies <code>TaskContextImpl</code> that the task has completed (regardless of the final outcome -- a success or a failure).</p> <p>In case of any exceptions, <code>run</code> notifies <code>TaskContextImpl</code> that the task has failed. <code>run</code> requests <code>MemoryStore</code> to release unroll memory for this task (for both <code>ON_HEAP</code> and <code>OFF_HEAP</code> memory modes).</p>  <p>Note</p> <p><code>run</code> uses <code>SparkEnv</code> to access the current BlockManager that it uses to access MemoryStore.</p>  <p><code>run</code> requests <code>MemoryManager</code> to notify any tasks waiting for execution memory to be freed to wake up and try to acquire memory again.</p> <p><code>run</code> unsets the task's <code>TaskContext</code>.</p>  <p>Note</p> <p><code>run</code> uses <code>SparkEnv</code> to access the current MemoryManager.</p>   <p><code>run</code> is used when:</p> <ul> <li><code>TaskRunner</code> is requested to run (when <code>Executor</code> is requested to launch a task (on \"Executor task launch worker\" thread pool sometime in the future))</li> </ul>","text":""},{"location":"scheduler/Task/#task-states","title":"Task States <p><code>Task</code> can be in one of the following states (as described by <code>TaskState</code> enumeration):</p> <ul> <li><code>LAUNCHING</code></li> <li><code>RUNNING</code> when the task is being started.</li> <li><code>FINISHED</code> when the task finished with the serialized result.</li> <li><code>FAILED</code> when the task fails, e.g. when FetchFailedException, <code>CommitDeniedException</code> or any <code>Throwable</code> occurs</li> <li><code>KILLED</code> when an executor kills a task.</li> <li><code>LOST</code></li> </ul> <p>States are the values of <code>org.apache.spark.TaskState</code>.</p>  <p>Note</p> <p>Task status updates are sent from executors to the driver through ExecutorBackend.</p>  <p>Task is finished when it is in one of <code>FINISHED</code>, <code>FAILED</code>, <code>KILLED</code>, <code>LOST</code>.</p> <p><code>LOST</code> and <code>FAILED</code> states are considered failures.</p>","text":""},{"location":"scheduler/Task/#collecting-latest-values-of-accumulators","title":"Collecting Latest Values of Accumulators <pre><code>collectAccumulatorUpdates(\n  taskFailed: Boolean = false): Seq[AccumulableInfo]\n</code></pre> <p><code>collectAccumulatorUpdates</code> collects the latest values of internal and external accumulators from a task (and returns the values as a collection of AccumulableInfo).</p> <p>Internally, <code>collectAccumulatorUpdates</code> takes <code>TaskMetrics</code>.</p>  <p>Note</p> <p><code>collectAccumulatorUpdates</code> uses TaskContextImpl to access the task's <code>TaskMetrics</code>.</p>  <p><code>collectAccumulatorUpdates</code> collects the latest values of:</p> <ul> <li> <p>internal accumulators whose current value is not the zero value and the <code>RESULT_SIZE</code> accumulator (regardless whether the value is its zero or not).</p> </li> <li> <p>external accumulators when <code>taskFailed</code> is disabled (<code>false</code>) or which should be included on failures.</p> </li> </ul> <p><code>collectAccumulatorUpdates</code> returns an empty collection when TaskContextImpl is not initialized.</p> <p><code>collectAccumulatorUpdates</code> is used when <code>TaskRunner</code> runs a task (and sends a task's final results back to the driver).</p>","text":""},{"location":"scheduler/Task/#killing-task","title":"Killing Task <pre><code>kill(\n  interruptThread: Boolean): Unit\n</code></pre> <p><code>kill</code> marks the task to be killed, i.e. it sets the internal <code>_killed</code> flag to <code>true</code>.</p> <p><code>kill</code> calls TaskContextImpl.markInterrupted when <code>context</code> is set.</p> <p>If <code>interruptThread</code> is enabled and the internal <code>taskThread</code> is available, <code>kill</code> interrupts it.</p> <p>CAUTION: FIXME When could <code>context</code> and <code>interruptThread</code> not be set?</p>","text":""},{"location":"scheduler/TaskContext/","title":"TaskContext","text":"<p><code>TaskContext</code> is an abstraction of task contexts.</p>"},{"location":"scheduler/TaskContext/#contract-subset","title":"Contract (Subset)","text":""},{"location":"scheduler/TaskContext/#addtaskcompletionlistener","title":"addTaskCompletionListener <pre><code>addTaskCompletionListener[U](\n  f: (TaskContext) =&gt; U): TaskContext\naddTaskCompletionListener(\n  listener: TaskCompletionListener): TaskContext\n</code></pre> <p>Registers a TaskCompletionListener</p> <pre><code>val rdd = sc.range(0, 5, numSlices = 1)\n\nimport org.apache.spark.TaskContext\nval printTaskInfo = (tc: TaskContext) =&gt; {\n  val msg = s\"\"\"|-------------------\n                |partitionId:   ${tc.partitionId}\n                |stageId:       ${tc.stageId}\n                |attemptNum:    ${tc.attemptNumber}\n                |taskAttemptId: ${tc.taskAttemptId}\n                |-------------------\"\"\".stripMargin\n  println(msg)\n}\n\nrdd.foreachPartition { _ =&gt;\n  val tc = TaskContext.get\n  tc.addTaskCompletionListener(printTaskInfo)\n}\n</code></pre>","text":""},{"location":"scheduler/TaskContext/#addtaskfailurelistener","title":"addTaskFailureListener <pre><code>addTaskFailureListener(\n  f: (TaskContext, Throwable) =&gt; Unit): TaskContext\naddTaskFailureListener(\n  listener: TaskFailureListener): TaskContext\n</code></pre> <p>Registers a TaskFailureListener</p> <pre><code>val rdd = sc.range(0, 2, numSlices = 2)\n\nimport org.apache.spark.TaskContext\nval printTaskErrorInfo = (tc: TaskContext, error: Throwable) =&gt; {\n  val msg = s\"\"\"|-------------------\n                |partitionId:   ${tc.partitionId}\n                |stageId:       ${tc.stageId}\n                |attemptNum:    ${tc.attemptNumber}\n                |taskAttemptId: ${tc.taskAttemptId}\n                |error:         ${error.toString}\n                |-------------------\"\"\".stripMargin\n  println(msg)\n}\n\nval throwExceptionForOddNumber = (n: Long) =&gt; {\n  if (n % 2 == 1) {\n    throw new Exception(s\"No way it will pass for odd number: $n\")\n  }\n}\n\n// FIXME It won't work.\nrdd.map(throwExceptionForOddNumber).foreachPartition { _ =&gt;\n  val tc = TaskContext.get\n  tc.addTaskFailureListener(printTaskErrorInfo)\n}\n\n// Listener registration matters.\nrdd.mapPartitions { (it: Iterator[Long]) =&gt;\n  val tc = TaskContext.get\n  tc.addTaskFailureListener(printTaskErrorInfo)\n  it\n}.map(throwExceptionForOddNumber).count\n</code></pre>","text":""},{"location":"scheduler/TaskContext/#fetchfailed","title":"fetchFailed <pre><code>fetchFailed: Option[FetchFailedException]\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run</li> </ul>","text":""},{"location":"scheduler/TaskContext/#getkillreason","title":"getKillReason <pre><code>getKillReason(): Option[String]\n</code></pre>","text":""},{"location":"scheduler/TaskContext/#getlocalproperty","title":"getLocalProperty <pre><code>getLocalProperty(\n  key: String): String\n</code></pre> <p>Looks up a local property by <code>key</code></p>","text":""},{"location":"scheduler/TaskContext/#getmetricssources","title":"getMetricsSources <pre><code>getMetricsSources(\n  sourceName: String): Seq[Source]\n</code></pre> <p>Looks up Sources by name</p>","text":""},{"location":"scheduler/TaskContext/#registering-accumulator","title":"Registering Accumulator <pre><code>registerAccumulator(\n  a: AccumulatorV2[_, _]): Unit\n</code></pre> <p>Registers a AccumulatorV2</p> <p>Used when:</p> <ul> <li><code>AccumulatorV2</code> is requested to deserialize itself</li> </ul>","text":""},{"location":"scheduler/TaskContext/#resources","title":"Resources <pre><code>resources(): Map[String, ResourceInformation]\n</code></pre> <p>Resources (names) allocated to this task</p> <p>See:</p> <ul> <li>TaskContextImpl</li> </ul>","text":""},{"location":"scheduler/TaskContext/#taskmetrics","title":"taskMetrics <pre><code>taskMetrics(): TaskMetrics\n</code></pre> <p>TaskMetrics</p>","text":""},{"location":"scheduler/TaskContext/#others","title":"others  <p>Important</p> <p>There are other methods, but don't seem very interesting.</p>","text":""},{"location":"scheduler/TaskContext/#implementations","title":"Implementations","text":"<ul> <li>BarrierTaskContext</li> <li>TaskContextImpl</li> </ul>"},{"location":"scheduler/TaskContext/#serializable","title":"Serializable <p><code>TaskContext</code> is a <code>Serializable</code> (Java).</p>","text":""},{"location":"scheduler/TaskContext/#accessing-taskcontext","title":"Accessing TaskContext <pre><code>get(): TaskContext\n</code></pre> <p><code>get</code> returns the thread-local <code>TaskContext</code> instance.</p> <pre><code>import org.apache.spark.TaskContext\nval tc = TaskContext.get\n</code></pre> <pre><code>val rdd = sc.range(0, 3, numSlices = 3)\n\nassert(rdd.partitions.size == 3)\n\nrdd.foreach { n =&gt;\n  import org.apache.spark.TaskContext\n  val tc = TaskContext.get\n  val msg = s\"\"\"|-------------------\n                |partitionId:   ${tc.partitionId}\n                |stageId:       ${tc.stageId}\n                |attemptNum:    ${tc.attemptNumber}\n                |taskAttemptId: ${tc.taskAttemptId}\n                |-------------------\"\"\".stripMargin\n  println(msg)\n}\n</code></pre>","text":""},{"location":"scheduler/TaskContextImpl/","title":"TaskContextImpl","text":"<p><code>TaskContextImpl</code> is a concrete TaskContext.</p>"},{"location":"scheduler/TaskContextImpl/#creating-instance","title":"Creating Instance","text":"<p><code>TaskContextImpl</code> takes the following to be created:</p> <ul> <li> Stage ID <li> <code>Stage</code> Execution Attempt ID <li> Partition ID <li> Task Execution Attempt ID <li> Attempt Number <li> TaskMemoryManager <li> Local Properties <li> MetricsSystem <li> TaskMetrics <li>Resources</li> <p><code>TaskContextImpl</code> is created\u00a0when:</p> <ul> <li><code>Task</code> is requested to run</li> </ul>"},{"location":"scheduler/TaskContextImpl/#resources","title":"Resources","text":"TaskContext <pre><code>resources: Map[String, ResourceInformation]\n</code></pre> <p><code>resources</code> is part of the TaskContext abstraction.</p> <p><code>TaskContextImpl</code> can be given resources (names) when created.</p> <p>The resources are given when a <code>Task</code> is requested to run that in turn come from a TaskDescription (of a TaskRunner).</p>"},{"location":"scheduler/TaskContextImpl/#barriertaskcontext","title":"BarrierTaskContext <p><code>TaskContextImpl</code> is available to barrier tasks as a BarrierTaskContext.</p>","text":""},{"location":"scheduler/TaskDescription/","title":"TaskDescription","text":"<p><code>TaskDescription</code> is a metadata of a Task.</p>"},{"location":"scheduler/TaskDescription/#creating-instance","title":"Creating Instance","text":"<p><code>TaskDescription</code> takes the following to be created:</p> <ul> <li> Task ID <li> Task attempt number <li> Executor ID <li>Task name</li> <li> Task index (within the TaskSet) <li> Partition ID <li> Added files (as <code>Map[String, Long]</code>) <li> Added JAR files (as <code>Map[String, Long]</code>) <li> <code>Properties</code> <li>Resources</li> <li> Serialized task (as <code>ByteBuffer</code>) <p><code>TaskDescription</code> is created when:</p> <ul> <li><code>TaskSetManager</code> is requested to find a task ready for execution (given a resource offer)</li> </ul>"},{"location":"scheduler/TaskDescription/#resources","title":"Resources","text":"<pre><code>resources: Map[String, ResourceInformation]\n</code></pre> <p><code>TaskDescription</code> is given resources when created.</p> <p>The resources are either specified when <code>TaskSetManager</code> is requested to resourceOffer (and prepareLaunchingTask) or decoded from bytes.</p>"},{"location":"scheduler/TaskDescription/#text-representation","title":"Text Representation <pre><code>toString: String\n</code></pre> <p><code>toString</code> uses the taskId and index as follows:</p> <pre><code>TaskDescription(TID=[taskId], index=[index])\n</code></pre>","text":""},{"location":"scheduler/TaskDescription/#decoding-taskdescription-from-serialized-format","title":"Decoding TaskDescription (from Serialized Format) <pre><code>decode(\n  byteBuffer: ByteBuffer): TaskDescription\n</code></pre> <p><code>decode</code> simply decodes (&lt;&gt;) a <code>TaskDescription</code> from the serialized format (<code>ByteBuffer</code>). <p>Internally, <code>decode</code>...FIXME</p> <p><code>decode</code> is used when:</p> <ul> <li> <p><code>CoarseGrainedExecutorBackend</code> is requested to CoarseGrainedExecutorBackend.md#LaunchTask[handle a LaunchTask message]</p> </li> <li> <p>Spark on Mesos' <code>MesosExecutorBackend</code> is requested to spark-on-mesos:spark-executor-backends-MesosExecutorBackend.md#launchTask[launch a task]</p> </li> </ul>","text":""},{"location":"scheduler/TaskDescription/#encoding-taskdescription-to-serialized-format","title":"Encoding TaskDescription (to Serialized Format) <pre><code>encode(\n  taskDescription: TaskDescription): ByteBuffer\n</code></pre> <p><code>encode</code> simply encodes the <code>TaskDescription</code> to a serialized format (<code>ByteBuffer</code>).</p> <p>Internally, <code>encode</code>...FIXME</p> <p><code>encode</code> is used when:</p> <ul> <li><code>DriverEndpoint</code> (of <code>CoarseGrainedSchedulerBackend</code>) is requested to launchTasks</li> </ul>","text":""},{"location":"scheduler/TaskDescription/#task-name","title":"Task Name <p>The name of the task is of the format:</p> <pre><code>task [taskID] in stage [taskSetID]\n</code></pre>","text":""},{"location":"scheduler/TaskInfo/","title":"TaskInfo","text":"<p>== [[TaskInfo]] TaskInfo</p> <p><code>TaskInfo</code> is information about a running task attempt inside a scheduler:TaskSet.md[TaskSet].</p> <p><code>TaskInfo</code> is created when:</p> <ul> <li> <p>scheduler:TaskSetManager.md#resourceOffer[<code>TaskSetManager</code> dequeues a task for execution (given resource offer)] (and records the task as running)</p> </li> <li> <p>TaskUIData does <code>dropInternalAndSQLAccumulables</code></p> </li> <li> <p>JsonProtocol utility is used to spark-history-server:JsonProtocol.md#taskInfoFromJson[re-create a task details from JSON]</p> </li> </ul> <p>NOTE: Back then, at the commit 63051dd2bcc4bf09d413ff7cf89a37967edc33ba, when <code>TaskInfo</code> was first merged to Apache Spark on 07/06/12, <code>TaskInfo</code> was part of <code>spark.scheduler.mesos</code> package -- note \"Mesos\" in the name of the package that shows how much Spark and Mesos influenced each other at that time.</p> <p>[[internal-registries]] .TaskInfo's Internal Registries and Counters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| [[finishTime]] <code>finishTime</code> | Time when <code>TaskInfo</code> was &lt;&gt;. <p>Used when...FIXME |===</p> <p>=== [[creating-instance]] Creating TaskInfo Instance</p> <p><code>TaskInfo</code> takes the following when created:</p> <ul> <li>[[taskId]] Task ID</li> <li>[[index]] Index of the task within its scheduler:TaskSet.md[TaskSet] that may not necessarily be the same as the ID of the RDD partition that the task is computing.</li> <li>[[attemptNumber]] Task attempt ID</li> <li>[[launchTime]] Time when the task was dequeued for execution</li> <li>[[executorId]] Executor that has been offered (as a resource) to run the task</li> <li>[[host]] Host of the &lt;&gt; <li>[[taskLocality]] scheduler:TaskSchedulerImpl.md#TaskLocality[TaskLocality], i.e. locality preference of the task</li> <li>[[speculative]] Flag whether a task is speculative or not</li> <p><code>TaskInfo</code> initializes the &lt;&gt;. <p>=== [[markFinished]] Marking Task As Finished (Successfully or Not) -- <code>markFinished</code> Method</p>"},{"location":"scheduler/TaskInfo/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/TaskInfo/#markfinishedstate-taskstate-time-long-systemcurrenttimemillis-unit","title":"markFinished(state: TaskState, time: Long = System.currentTimeMillis): Unit","text":"<p><code>markFinished</code> records the input <code>time</code> as &lt;&gt;. <p><code>markFinished</code> marks <code>TaskInfo</code> as &lt;&gt; when the input <code>state</code> is <code>FAILED</code> or &lt;&gt; for <code>state</code> being <code>KILLED</code>. <p>NOTE: <code>markFinished</code> is used when <code>TaskSetManager</code> is notified that a task has finished scheduler:TaskSetManager.md#handleSuccessfulTask[successfully] or scheduler:TaskSetManager.md#handleFailedTask[failed].</p>"},{"location":"scheduler/TaskLocation/","title":"TaskLocation","text":"<p>TaskLocation represents a placement preference of an RDD partition, i.e. a hint of the location to submit scheduler:Task.md[tasks] for execution.</p> <p>TaskLocations are tracked by scheduler:DAGScheduler.md#cacheLocs[DAGScheduler] for scheduler:DAGScheduler.md#submitMissingTasks[submitting missing tasks of a stage].</p> <p>TaskLocation is available as scheduler:Task.md#preferredLocations[preferredLocations] of a task.</p> <p>[[host]] Every TaskLocation describes the location by host name, but could also use other location-related metadata.</p> <p>TaskLocations of an RDD and a partition is available using SparkContext.md#getPreferredLocs[SparkContext.getPreferredLocs] method.</p> Sealed <p><code>TaskLocation</code> is a Scala <code>private[spark] sealed</code> trait so all the available implementations of TaskLocation trait are in a single Scala file.</p> <p>== [[ExecutorCacheTaskLocation]] ExecutorCacheTaskLocation</p> <p>ExecutorCacheTaskLocation describes a &lt;&gt; and an executor. <p>ExecutorCacheTaskLocation informs the Scheduler to prefer a given executor, but the next level of preference is any executor on the same host if this is not possible.</p> <p>== [[HDFSCacheTaskLocation]] HDFSCacheTaskLocation</p> <p>HDFSCacheTaskLocation describes a &lt;&gt; that is cached by HDFS. <p>Used exclusively when rdd:HadoopRDD.md#getPreferredLocations[HadoopRDD] and rdd:NewHadoopRDD.md#getPreferredLocations[NewHadoopRDD] are requested for their placement preferences (aka preferred locations).</p> <p>== [[HostTaskLocation]] HostTaskLocation</p> <p>HostTaskLocation describes a &lt;&gt; only."},{"location":"scheduler/TaskResult/","title":"TaskResult","text":"<p><code>TaskResult</code> is an abstraction of task results (of type <code>T</code>).</p> <p>The decision what <code>TaskResult</code> type to use is made when <code>TaskRunner</code> finishes running a task.</p> Sealed Trait <p><code>TaskResult</code> is a Scala sealed trait which means that all of the implementations are in the same compilation unit (a single file).</p>"},{"location":"scheduler/TaskResult/#directtaskresult","title":"DirectTaskResult <p><code>DirectTaskResult</code> is a <code>TaskResult</code> to be serialized and sent over the wire to the driver together with the following:</p> <ul> <li> Value Bytes (java.nio.ByteBuffer) <li> Accumulator updates <li> Metric Peaks  <p><code>DirectTaskResult</code> is used when the size of a task result is below spark.driver.maxResultSize and the maximum size of direct results.</p>","text":""},{"location":"scheduler/TaskResult/#indirecttaskresult","title":"IndirectTaskResult <p><code>IndirectTaskResult</code> is a \"pointer\" to a task result that is available in a BlockManager:</p> <ul> <li> BlockId <li> Size  <p><code>IndirectTaskResult</code> is a java.io.Serializable.</p>","text":""},{"location":"scheduler/TaskResult/#externalizable","title":"Externalizable <p><code>DirectTaskResult</code> is an <code>Externalizable</code> (Java).</p>","text":""},{"location":"scheduler/TaskResultGetter/","title":"TaskResultGetter","text":"<p><code>TaskResultGetter</code> is a helper class of scheduler:TaskSchedulerImpl.md#statusUpdate[TaskSchedulerImpl] for asynchronous deserialization of &lt;&gt; (possibly fetching remote blocks) or &lt;&gt;. <p>CAUTION: FIXME Image with the dependencies</p> <p>TIP: Consult scheduler:Task.md#states[Task States] in Tasks to learn about the different task states.</p> <p>NOTE: The only instance of <code>TaskResultGetter</code> is created while scheduler:TaskSchedulerImpl.md#creating-instance[<code>TaskSchedulerImpl</code> is created].</p> <p><code>TaskResultGetter</code> requires a core:SparkEnv.md[SparkEnv] and scheduler:TaskSchedulerImpl.md[TaskSchedulerImpl] to be created and is stopped when scheduler:TaskSchedulerImpl.md#stop[<code>TaskSchedulerImpl</code> stops].</p> <p><code>TaskResultGetter</code> uses &lt;task-result-getter asynchronous task executor&gt;&gt; for operation."},{"location":"scheduler/TaskResultGetter/#tip","title":"[TIP]","text":"<p>Enable <code>DEBUG</code> logging level for <code>org.apache.spark.scheduler.TaskResultGetter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.TaskResultGetter=DEBUG\n</code></pre>"},{"location":"scheduler/TaskResultGetter/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":"<p>=== [[getTaskResultExecutor]][[task-result-getter]] <code>task-result-getter</code> Asynchronous Task Executor</p>"},{"location":"scheduler/TaskResultGetter/#source-scala","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#gettaskresultexecutor-executorservice","title":"getTaskResultExecutor: ExecutorService","text":"<p><code>getTaskResultExecutor</code> creates a daemon thread pool with &lt;&gt; threads and <code>task-result-getter</code> prefix. <p>TIP: Read up on https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html[java.util.concurrent.ThreadPoolExecutor] that <code>getTaskResultExecutor</code> uses under the covers.</p> <p>=== [[stop]] <code>stop</code> Method</p>"},{"location":"scheduler/TaskResultGetter/#source-scala_1","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#stop-unit","title":"stop(): Unit","text":"<p><code>stop</code> stops the internal &lt;task-result-getter asynchronous task executor&gt;&gt;. <p>=== [[serializer]] <code>serializer</code> Attribute</p>"},{"location":"scheduler/TaskResultGetter/#source-scala_2","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#serializer-threadlocalserializerinstance","title":"serializer: ThreadLocal[SerializerInstance]","text":"<p><code>serializer</code> is a thread-local serializer:SerializerInstance.md[SerializerInstance] that <code>TaskResultGetter</code> uses to deserialize byte buffers (with <code>TaskResult</code>s or a <code>TaskEndReason</code>).</p> <p>When created for a new thread, <code>serializer</code> is initialized with a new instance of <code>Serializer</code> (using core:SparkEnv.md#closureSerializer[SparkEnv.closureSerializer]).</p> <p>NOTE: <code>TaskResultGetter</code> uses https://docs.oracle.com/javase/8/docs/api/java/lang/ThreadLocal.html[java.lang.ThreadLocal] for the thread-local <code>SerializerInstance</code> variable.</p> <p>=== [[taskResultSerializer]] <code>taskResultSerializer</code> Attribute</p>"},{"location":"scheduler/TaskResultGetter/#source-scala_3","title":"[source, scala]","text":""},{"location":"scheduler/TaskResultGetter/#taskresultserializer-threadlocalserializerinstance","title":"taskResultSerializer: ThreadLocal[SerializerInstance]","text":"<p><code>taskResultSerializer</code> is a thread-local serializer:SerializerInstance.md[SerializerInstance] that <code>TaskResultGetter</code> uses to...</p> <p>When created for a new thread, <code>taskResultSerializer</code> is initialized with a new instance of <code>Serializer</code> (using core:SparkEnv.md#serializer[SparkEnv.serializer]).</p> <p>NOTE: <code>TaskResultGetter</code> uses https://docs.oracle.com/javase/8/docs/api/java/lang/ThreadLocal.html[java.lang.ThreadLocal] for the thread-local <code>SerializerInstance</code> variable.</p>"},{"location":"scheduler/TaskResultGetter/#enqueuing-successful-task","title":"Enqueuing Successful Task <pre><code>enqueueSuccessfulTask(\n  taskSetManager: TaskSetManager,\n  tid: Long,\n  serializedData: ByteBuffer): Unit\n</code></pre> <p><code>enqueueSuccessfulTask</code> submits an asynchronous task (to &lt;&gt; asynchronous task executor) that first deserializes <code>serializedData</code> to a <code>DirectTaskResult</code>, then updates the internal accumulator (with the size of the <code>DirectTaskResult</code>) and ultimately notifies the <code>TaskSchedulerImpl</code> that the <code>tid</code> task was completed and scheduler:TaskSchedulerImpl.md#handleSuccessfulTask[the task result was received successfully] or scheduler:TaskSchedulerImpl.md#handleFailedTask[not]. <p>NOTE: <code>enqueueSuccessfulTask</code> is just the asynchronous task enqueued for execution by &lt;&gt; asynchronous task executor at some point in the future. <p>Internally, the enqueued task first deserializes <code>serializedData</code> to a <code>TaskResult</code> (using the internal thread-local &lt;&gt;). <p>For a DirectTaskResult, the task scheduler:TaskSetManager.md#canFetchMoreResults[checks the available memory for the task result] and, when the size overflows configuration-properties.md#spark.driver.maxResultSize[spark.driver.maxResultSize], it simply returns.</p>  <p>Note</p> <p><code>enqueueSuccessfulTask</code> is a mere thread so returning from a thread is to do nothing else. That is why the check for quota does abort when there is not enough memory.</p>  <p>Otherwise, when there is enough memory to hold the task result, it deserializes the <code>DirectTaskResult</code> (using the internal thread-local &lt;&gt;). <p>For an IndirectTaskResult, the task checks the available memory for the task result and, when the size could overflow the maximum result size, it storage:BlockManagerMaster.md#removeBlock[removes the block] and simply returns.</p> <p>Otherwise, when there is enough memory to hold the task result, you should see the following DEBUG message in the logs:</p> <pre><code>Fetching indirect task result for TID [tid]\n</code></pre> <p>The task scheduler:TaskSchedulerImpl.md#handleTaskGettingResult[notifies <code>TaskSchedulerImpl</code> that it is about to fetch a remote block for a task result]. It then storage:BlockManager.md#getRemoteBytes[gets the block from remote block managers (as serialized bytes)].</p> <p>When the block could not be fetched, scheduler:TaskSchedulerImpl.md#handleFailedTask[<code>TaskSchedulerImpl</code> is informed] (with <code>TaskResultLost</code> task failure reason) and the task simply returns.</p> <p>NOTE: <code>enqueueSuccessfulTask</code> is a mere thread so returning from a thread is to do nothing else and so the real handling is when scheduler:TaskSchedulerImpl.md#handleFailedTask[<code>TaskSchedulerImpl</code> is informed].</p> <p>The task result (as a serialized byte buffer) is then deserialized to a DirectTaskResult (using the internal thread-local &lt;&gt;) and deserialized again using the internal thread-local &lt;&gt; (just like for the <code>DirectTaskResult</code> case). The  storage:BlockManagerMaster.md#removeBlock[block is removed from <code>BlockManagerMaster</code>] and simply returns.  <p>Note</p> <p>A IndirectTaskResult is deserialized twice to become the final deserialized task result (using &lt;&gt; for a <code>DirectTaskResult</code>). Compare it to a <code>DirectTaskResult</code> task result that is deserialized once only.  <p>With no exceptions thrown, <code>enqueueSuccessfulTask</code> scheduler:TaskSchedulerImpl.md#handleSuccessfulTask[informs the <code>TaskSchedulerImpl</code> that the <code>tid</code> task was completed and the task result was received].</p> <p>A <code>ClassNotFoundException</code> leads to scheduler:TaskSetManager.md#abort[aborting the <code>TaskSet</code>] (with <code>ClassNotFound with classloader: [loader]</code> error message) while any non-fatal exception shows the following ERROR message in the logs followed by scheduler:TaskSetManager.md#abort[aborting the <code>TaskSet</code>].</p> <pre><code>Exception while getting task result\n</code></pre> <p><code>enqueueSuccessfulTask</code> is used when <code>TaskSchedulerImpl</code> is requested to handle task status update (and the task has finished successfully).</p> <p>=== [[enqueueFailedTask]] Deserializing TaskFailedReason and Notifying TaskSchedulerImpl -- <code>enqueueFailedTask</code> Method</p>","text":""},{"location":"scheduler/TaskResultGetter/#source-scala_4","title":"[source, scala] <p>enqueueFailedTask(   taskSetManager: TaskSetManager,   tid: Long,   taskState: TaskState.TaskState,   serializedData: ByteBuffer): Unit</p>  <p><code>enqueueFailedTask</code> submits an asynchronous task (to &lt;task-result-getter asynchronous task executor&gt;&gt;) that first attempts to deserialize a <code>TaskFailedReason</code> from <code>serializedData</code> (using the internal thread-local &lt;&gt;) and then scheduler:TaskSchedulerImpl.md#handleFailedTask[notifies <code>TaskSchedulerImpl</code> that the task has failed]. <p>Any <code>ClassNotFoundException</code> leads to the following ERROR message in the logs (without breaking the flow of <code>enqueueFailedTask</code>):</p> <pre><code>ERROR Could not deserialize TaskEndReason: ClassNotFound with classloader [loader]\n</code></pre> <p>NOTE: <code>enqueueFailedTask</code> is called when scheduler:TaskSchedulerImpl.md#statusUpdate[<code>TaskSchedulerImpl</code> is notified about a task that has failed (and is in <code>FAILED</code>, <code>KILLED</code> or <code>LOST</code> state)].</p> <p>=== [[settings]] Settings</p> <p>.Spark Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Spark Property | Default Value | Description | [[spark_resultGetter_threads]] <code>spark.resultGetter.threads</code> | <code>4</code> | The number of threads for <code>TaskResultGetter</code>. |===</p>","text":""},{"location":"scheduler/TaskScheduler/","title":"TaskScheduler","text":"<p><code>TaskScheduler</code> is an abstraction of &lt;&gt; that can &lt;&gt; in a Spark application (per &lt;&gt;). <p></p> <p>NOTE: TaskScheduler works closely with scheduler:DAGScheduler.md[DAGScheduler] that &lt;&gt; (for every stage in a Spark job). <p>TaskScheduler can track the executors available in a Spark application using &lt;&gt; and &lt;&gt; interceptors (that inform about active and lost executors, respectively). <p>== [[submitTasks]] Submitting Tasks for Execution</p>"},{"location":"scheduler/TaskScheduler/#source-scala","title":"[source, scala]","text":"<p>submitTasks(   taskSet: TaskSet): Unit</p> <p>Submits the tasks (of the given scheduler:TaskSet.md[TaskSet]) for execution.</p> <p>Used when DAGScheduler is requested to scheduler:DAGScheduler.md#submitMissingTasks[submit missing tasks (of a stage)].</p> <p>== [[executorHeartbeatReceived]] Handling Executor Heartbeat</p>"},{"location":"scheduler/TaskScheduler/#source-scala_1","title":"[source, scala]","text":"<p>executorHeartbeatReceived(   execId: String,   accumUpdates: Array[(Long, Seq[AccumulatorV2[_, _]])],   blockManagerId: BlockManagerId): Boolean</p> <p>Handles a heartbeat from an executor</p> <p>Returns <code>true</code> when the <code>execId</code> executor is managed by the TaskScheduler. <code>false</code> indicates that the executor:Executor.md#reportHeartBeat[block manager (on the executor) should re-register].</p> <p>Used when HeartbeatReceiver RPC endpoint is requested to handle a Heartbeat (with task metrics) from an executor</p> <p>== [[killTaskAttempt]] Killing Task</p>"},{"location":"scheduler/TaskScheduler/#source-scala_2","title":"[source, scala]","text":"<p>killTaskAttempt(   taskId: Long,   interruptThread: Boolean,   reason: String): Boolean</p> <p>Kills a task (attempt)</p> <p>Used when DAGScheduler is requested to scheduler:DAGScheduler.md#killTaskAttempt[kill a task]</p> <p>== [[workerRemoved]] workerRemoved Notification</p>"},{"location":"scheduler/TaskScheduler/#source-scala_3","title":"[source, scala]","text":"<p>workerRemoved(   workerId: String,   host: String,   message: String): Unit</p> <p>Used when <code>DriverEndpoint</code> is requested to handle a RemoveWorker event</p> <p>== [[contract]] Contract</p> <p>[cols=\"30m,70\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| applicationAttemptId a| [[applicationAttemptId]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_4","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#applicationattemptid-optionstring","title":"applicationAttemptId(): Option[String]","text":"<p>Unique identifier of an (execution) attempt of the Spark application</p> <p>Used when SparkContext is created</p> <p>| cancelTasks a| [[cancelTasks]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_5","title":"[source, scala]","text":"<p>cancelTasks(   stageId: Int,   interruptThread: Boolean): Unit</p> <p>Cancels all the tasks of a given Stage.md[stage]</p> <p>Used when DAGScheduler is requested to DAGScheduler.md#failJobAndIndependentStages[failJobAndIndependentStages]</p> <p>| defaultParallelism a| [[defaultParallelism]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_6","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#defaultparallelism-int","title":"defaultParallelism(): Int","text":"<p>Default level of parallelism</p> <p>Used when <code>SparkContext</code> is requested for the default level of parallelism</p> <p>| executorLost a| [[executorLost]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_7","title":"[source, scala]","text":"<p>executorLost(   executorId: String,   reason: ExecutorLossReason): Unit</p> <p>Handles an executor lost event</p> <p>Used when:</p> <ul> <li> <p><code>HeartbeatReceiver</code> RPC endpoint is requested to expireDeadHosts</p> </li> <li> <p><code>DriverEndpoint</code> RPC endpoint is requested to removes (forgets) and disables a malfunctioning executor (i.e. either lost or blacklisted for some reason)</p> </li> </ul> <p>| killAllTaskAttempts a| [[killAllTaskAttempts]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_8","title":"[source, scala]","text":"<p>killAllTaskAttempts(   stageId: Int,   interruptThread: Boolean,   reason: String): Unit</p> <p>Used when:</p> <ul> <li> <p>DAGScheduler is requested to DAGScheduler.md#handleTaskCompletion[handleTaskCompletion]</p> </li> <li> <p><code>TaskSchedulerImpl</code> is requested to TaskSchedulerImpl.md#cancelTasks[cancel all the tasks of a stage]</p> </li> </ul> <p>| rootPool a| [[rootPool]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_9","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#rootpool-pool","title":"rootPool: Pool","text":"<p>Top-level (root) scheduler:spark-scheduler-Pool.md[schedulable pool]</p> <p>Used when:</p> <ul> <li> <p><code>TaskSchedulerImpl</code> is requested to scheduler:TaskSchedulerImpl.md#initialize[initialize]</p> </li> <li> <p><code>SparkContext</code> is requested to SparkContext.md#getAllPools[getAllPools] and SparkContext.md#getPoolForName[getPoolForName]</p> </li> <li> <p><code>TaskSchedulerImpl</code> is requested to scheduler:TaskSchedulerImpl.md#resourceOffers[resourceOffers], scheduler:TaskSchedulerImpl.md#checkSpeculatableTasks[checkSpeculatableTasks], and scheduler:TaskSchedulerImpl.md#removeExecutor[removeExecutor]</p> </li> </ul> <p>| schedulingMode a| [[schedulingMode]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_10","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#schedulingmode-schedulingmode","title":"schedulingMode: SchedulingMode","text":"<p>scheduler:spark-scheduler-SchedulingMode.md[Scheduling mode]</p> <p>Used when:</p> <ul> <li> <p><code>TaskSchedulerImpl</code> is scheduler:TaskSchedulerImpl.md#rootPool[created] and scheduler:TaskSchedulerImpl.md#initialize[initialized]</p> </li> <li> <p><code>SparkContext</code> is requested to SparkContext.md#getSchedulingMode[getSchedulingMode]</p> </li> </ul> <p>| setDAGScheduler a| [[setDAGScheduler]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_11","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#setdagschedulerdagscheduler-dagscheduler-unit","title":"setDAGScheduler(dagScheduler: DAGScheduler): Unit","text":"<p>Associates a scheduler:DAGScheduler.md[DAGScheduler]</p> <p>Used when DAGScheduler is scheduler:DAGScheduler.md#creating-instance[created]</p> <p>| start a| [[start]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_12","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#start-unit","title":"start(): Unit","text":"<p>Starts the TaskScheduler</p> <p>Used when SparkContext is created</p> <p>| stop a| [[stop]]</p>"},{"location":"scheduler/TaskScheduler/#source-scala_13","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#stop-unit","title":"stop(): Unit","text":"<p>Stops the TaskScheduler</p> <p>Used when DAGScheduler is requested to scheduler:DAGScheduler.md#stop[stop]</p> <p>|===</p>"},{"location":"scheduler/TaskScheduler/#lifecycle","title":"Lifecycle","text":"<p>A <code>TaskScheduler</code> is created while SparkContext is being created (by calling SparkContext.createTaskScheduler for a given master URL and deploy mode).</p> <p></p> <p>At this point in SparkContext's lifecycle, the internal <code>_taskScheduler</code> points at the TaskScheduler (and it is \"announced\" by sending a blocking <code>TaskSchedulerIsSet</code> message to HeartbeatReceiver RPC endpoint).</p> <p>The &lt;&gt; right after the blocking <code>TaskSchedulerIsSet</code> message receives a response. <p>The &lt;&gt; and the &lt;&gt; are set at this point (and <code>SparkContext</code> uses the application id to set SparkConf.md#spark.app.id[spark.app.id] Spark property, and configure webui:spark-webui-SparkUI.md[SparkUI], and storage:BlockManager.md[BlockManager]). <p>CAUTION: FIXME The application id is described as \"associated with the job.\" in TaskScheduler, but I think it is \"associated with the application\" and you can have many jobs per application.</p> <p>Right before SparkContext is fully initialized, &lt;&gt; is called. <p>The internal <code>_taskScheduler</code> is cleared (i.e. set to <code>null</code>) while SparkContext.md#stop[SparkContext is being stopped].</p> <p>&lt;&gt; while scheduler:DAGScheduler.md#stop[DAGScheduler is being stopped]. <p>WARNING: FIXME If it is SparkContext to start a TaskScheduler, shouldn't SparkContext stop it too? Why is this the way it is now?</p> <p>== [[postStartHook]] Post-Start Initialization</p>"},{"location":"scheduler/TaskScheduler/#source-scala_14","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#poststarthook-unit","title":"postStartHook(): Unit","text":"<p><code>postStartHook</code> does nothing by default, but allows &lt;&gt; for some additional post-start initialization. <p><code>postStartHook</code> is used when:</p> <ul> <li> <p>SparkContext is created</p> </li> <li> <p>Spark on YARN's <code>YarnClusterScheduler</code> is requested to spark-on-yarn:spark-yarn-yarnclusterscheduler.md#postStartHook[postStartHook]</p> </li> </ul> <p>== [[applicationId]][[appId]] Unique Identifier of Spark Application</p>"},{"location":"scheduler/TaskScheduler/#source-scala_15","title":"[source, scala]","text":""},{"location":"scheduler/TaskScheduler/#applicationid-string","title":"applicationId(): String","text":"<p><code>applicationId</code> is the unique identifier of the Spark application and defaults to spark-application-[currentTimeMillis].</p> <p><code>applicationId</code> is used when SparkContext is created.</p>"},{"location":"scheduler/TaskSchedulerImpl/","title":"TaskSchedulerImpl","text":"<p><code>TaskSchedulerImpl</code> is a TaskScheduler that uses a SchedulerBackend to schedule tasks (for execution on a cluster manager).</p> <p>When a Spark application starts (and so an instance of <code>SparkContext</code> is created) <code>TaskSchedulerImpl</code> with a SchedulerBackend and DAGScheduler are created and soon started.</p> <p></p> <p><code>TaskSchedulerImpl</code> generates tasks based on executor resource offers.</p> <p><code>TaskSchedulerImpl</code> can track racks per host and port (that however is only used with Hadoop YARN cluster manager).</p> <p>Using spark.scheduler.mode configuration property you can select the scheduling policy.</p> <p><code>TaskSchedulerImpl</code> submits tasks using SchedulableBuilders.</p>"},{"location":"scheduler/TaskSchedulerImpl/#creating-instance","title":"Creating Instance","text":"<p><code>TaskSchedulerImpl</code> takes the following to be created:</p> <ul> <li> SparkContext <li>Maximum Number of Task Failures</li> <li> <code>isLocal</code> flag (default: <code>false</code>) <li> <code>Clock</code> (default: <code>SystemClock</code>) <p>While being created, <code>TaskSchedulerImpl</code> sets schedulingMode to the value of spark.scheduler.mode configuration property.</p> <p>Note</p> <p><code>schedulingMode</code> is part of the TaskScheduler abstraction.</p> <p><code>TaskSchedulerImpl</code> throws a <code>SparkException</code> for unrecognized scheduling mode:</p> <pre><code>Unrecognized spark.scheduler.mode: [schedulingModeConf]\n</code></pre> <p>In the end, <code>TaskSchedulerImpl</code> creates a TaskResultGetter.</p> <p><code>TaskSchedulerImpl</code> is created\u00a0when:</p> <ul> <li><code>SparkContext</code> is requested for a TaskScheduler (for <code>local</code> and <code>spark</code> master URLs)</li> <li><code>KubernetesClusterManager</code> and <code>MesosClusterManager</code> are requested for a <code>TaskScheduler</code></li> </ul>"},{"location":"scheduler/TaskSchedulerImpl/#maxTaskFailures","title":"Maximum Number of Task Failures","text":"<p><code>TaskSchedulerImpl</code> can be given the maximum number of task failures when created or default to spark.task.maxFailures configuration property.</p> <p>The number of task failures is used when submitting tasks (to create a TaskSetManager).</p>"},{"location":"scheduler/TaskSchedulerImpl/#sparktaskcpus","title":"spark.task.cpus <p><code>TaskSchedulerImpl</code> uses spark.task.cpus configuration property for...FIXME</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#backend","title":"SchedulerBackend <pre><code>backend: SchedulerBackend\n</code></pre> <p><code>TaskSchedulerImpl</code> is given a SchedulerBackend when requested to initialize.</p> <p>The lifecycle of the <code>SchedulerBackend</code> is tightly coupled to the lifecycle of the <code>TaskSchedulerImpl</code>:</p> <ul> <li>It is started when <code>TaskSchedulerImpl</code> is</li> <li>It is stopped when <code>TaskSchedulerImpl</code> is</li> </ul> <p><code>TaskSchedulerImpl</code> waits until the SchedulerBackend is ready before requesting it for the following:</p> <ul> <li> <p>Reviving resource offers when requested to submitTasks, statusUpdate, handleFailedTask, checkSpeculatableTasks, and executorLost</p> </li> <li> <p>Killing tasks when requested to killTaskAttempt and killAllTaskAttempts</p> </li> <li> <p>Default parallelism, applicationId and applicationAttemptId when requested for the defaultParallelism, applicationId and applicationAttemptId, respectively</p> </li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#unique-identifier-of-spark-application","title":"Unique Identifier of Spark Application <pre><code>applicationId(): String\n</code></pre> <p><code>applicationId</code> is part of the TaskScheduler abstraction.</p> <p><code>applicationId</code> simply request the SchedulerBackend for the applicationId.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#cancelling-all-tasks-of-stage","title":"Cancelling All Tasks of Stage <pre><code>cancelTasks(\n  stageId: Int,\n  interruptThread: Boolean): Unit\n</code></pre> <p><code>cancelTasks</code> is part of the TaskScheduler abstraction.</p> <p><code>cancelTasks</code> cancels all tasks submitted for execution in a stage <code>stageId</code>.</p> <p><code>cancelTasks</code> is used when:</p> <ul> <li><code>DAGScheduler</code> is requested to failJobAndIndependentStages</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#handlesuccessfultask","title":"handleSuccessfulTask <pre><code>handleSuccessfulTask(\n  taskSetManager: TaskSetManager,\n  tid: Long,\n  taskResult: DirectTaskResult[_]): Unit\n</code></pre> <p><code>handleSuccessfulTask</code> requests the given TaskSetManager to handleSuccessfulTask (with the given <code>tid</code> and <code>taskResult</code>).</p> <p><code>handleSuccessfulTask</code> is used when:</p> <ul> <li><code>TaskResultGetter</code> is requested to enqueueSuccessfulTask</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#handletaskgettingresult","title":"handleTaskGettingResult <pre><code>handleTaskGettingResult(\n  taskSetManager: TaskSetManager,\n  tid: Long): Unit\n</code></pre> <p><code>handleTaskGettingResult</code> requests the given TaskSetManager to handleTaskGettingResult.</p> <p><code>handleTaskGettingResult</code> is used when:</p> <ul> <li><code>TaskResultGetter</code> is requested to enqueueSuccessfulTask</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#initializing","title":"Initializing <pre><code>initialize(\n  backend: SchedulerBackend): Unit\n</code></pre> <p><code>initialize</code> initializes the <code>TaskSchedulerImpl</code> with the given SchedulerBackend.</p> <p></p> <p><code>initialize</code> saves the given SchedulerBackend.</p> <p><code>initialize</code> then sets &lt;Pool&gt;&gt; as an empty-named Pool.md[Pool] (passing in &lt;&gt;, <code>initMinShare</code> and <code>initWeight</code> as <code>0</code>). <p>NOTE: &lt;&gt; and &lt;&gt; are a part of scheduler:TaskScheduler.md#contract[TaskScheduler Contract]. <p><code>initialize</code> sets &lt;&gt; (based on &lt;&gt;): <ul> <li>FIFOSchedulableBuilder.md[FIFOSchedulableBuilder] for <code>FIFO</code> scheduling mode</li> <li>FairSchedulableBuilder.md[FairSchedulableBuilder] for <code>FAIR</code> scheduling mode</li> </ul> <p><code>initialize</code> SchedulableBuilder.md#buildPools[requests <code>SchedulableBuilder</code> to build pools].</p> <p>CAUTION: FIXME Why are <code>rootPool</code> and <code>schedulableBuilder</code> created only now? What do they need that it is not available when TaskSchedulerImpl is created?</p> <p>NOTE: <code>initialize</code> is called while SparkContext.md#createTaskScheduler[SparkContext is created and creates SchedulerBackend and <code>TaskScheduler</code>].</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#starting-taskschedulerimpl","title":"Starting TaskSchedulerImpl <pre><code>start(): Unit\n</code></pre> <p><code>start</code> starts the SchedulerBackend and the task-scheduler-speculation executor service.</p> <p></p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#handling-task-status-update","title":"Handling Task Status Update <pre><code>statusUpdate(\n  tid: Long,\n  state: TaskState,\n  serializedData: ByteBuffer): Unit\n</code></pre> <p><code>statusUpdate</code> finds TaskSetManager for the input <code>tid</code> task (in &lt;&gt;). <p>When <code>state</code> is <code>LOST</code>, <code>statusUpdate</code>...FIXME</p> <p>NOTE: <code>TaskState.LOST</code> is only used by the deprecated Mesos fine-grained scheduling mode.</p> <p>When <code>state</code> is one of the scheduler:Task.md#states[finished states], i.e. <code>FINISHED</code>, <code>FAILED</code>, <code>KILLED</code> or <code>LOST</code>, <code>statusUpdate</code> &lt;&gt; for the input <code>tid</code>. <p><code>statusUpdate</code> scheduler:TaskSetManager.md#removeRunningTask[requests <code>TaskSetManager</code> to unregister <code>tid</code> from running tasks].</p> <p><code>statusUpdate</code> requests &lt;&gt; to scheduler:TaskResultGetter.md#enqueueSuccessfulTask[schedule an asynchrounous task to deserialize the task result (and notify TaskSchedulerImpl back)] for <code>tid</code> in <code>FINISHED</code> state and scheduler:TaskResultGetter.md#enqueueFailedTask[schedule an asynchrounous task to deserialize <code>TaskFailedReason</code> (and notify TaskSchedulerImpl back)] for <code>tid</code> in the other finished states (i.e. <code>FAILED</code>, <code>KILLED</code>, <code>LOST</code>). <p>If a task is in <code>LOST</code> state, <code>statusUpdate</code> scheduler:DAGScheduler.md#executorLost[notifies <code>DAGScheduler</code> that the executor was lost] (with <code>SlaveLost</code> and the reason <code>Task [tid] was lost, so marking the executor as lost as well.</code>) and scheduler:SchedulerBackend.md#reviveOffers[requests SchedulerBackend to revive offers].</p> <p>In case the <code>TaskSetManager</code> for <code>tid</code> could not be found (in &lt;&gt; registry), you should see the following ERROR message in the logs: <pre><code>Ignoring update with state [state] for TID [tid] because its task set is gone (this is likely the result of receiving duplicate task finished status updates)\n</code></pre> <p>Any exception is caught and reported as ERROR message in the logs:</p> <pre><code>Exception in statusUpdate\n</code></pre> <p>CAUTION: FIXME image with scheduler backends calling <code>TaskSchedulerImpl.statusUpdate</code>.</p> <p><code>statusUpdate</code> is used when:</p> <ul> <li> <p><code>DriverEndpoint</code> (of CoarseGrainedSchedulerBackend) is requested to handle a StatusUpdate message</p> </li> <li> <p><code>LocalEndpoint</code> is requested to handle a StatusUpdate message</p> </li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#task-scheduler-speculation-scheduled-executor-service","title":"task-scheduler-speculation Scheduled Executor Service <p><code>speculationScheduler</code> is a java.util.concurrent.ScheduledExecutorService with the name task-scheduler-speculation for Speculative Execution of Tasks.</p> <p>When <code>TaskSchedulerImpl</code> is requested to start (in non-local run mode) with spark.speculation enabled, <code>speculationScheduler</code> is used to schedule checkSpeculatableTasks to execute periodically every spark.speculation.interval.</p> <p><code>speculationScheduler</code> is shut down when <code>TaskSchedulerImpl</code> is requested to stop.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#checking-for-speculatable-tasks","title":"Checking for Speculatable Tasks <pre><code>checkSpeculatableTasks(): Unit\n</code></pre> <p><code>checkSpeculatableTasks</code> requests <code>rootPool</code> to check for speculatable tasks (if they ran for more than <code>100</code> ms) and, if there any, requests scheduler:SchedulerBackend.md#reviveOffers[SchedulerBackend to revive offers].</p> <p>NOTE: <code>checkSpeculatableTasks</code> is executed periodically as part of speculative-execution-of-tasks.md[].</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#cleaning-up-after-removing-executor","title":"Cleaning up After Removing Executor <pre><code>removeExecutor(\n  executorId: String,\n  reason: ExecutorLossReason): Unit\n</code></pre> <p><code>removeExecutor</code> removes the <code>executorId</code> executor from the following &lt;&gt;: &lt;&gt;, <code>executorIdToHost</code>, <code>executorsByHost</code>, and <code>hostsByRack</code>. If the affected hosts and racks are the last entries in <code>executorsByHost</code> and <code>hostsByRack</code>, appropriately, they are removed from the registries. <p>Unless <code>reason</code> is <code>LossReasonPending</code>, the executor is removed from <code>executorIdToHost</code> registry and Schedulable.md#executorLost[TaskSetManagers get notified].</p> <p>NOTE: The internal <code>removeExecutor</code> is called as part of &lt;&gt; and scheduler:TaskScheduler.md#executorLost[executorLost].","text":""},{"location":"scheduler/TaskSchedulerImpl/#handling-nearly-completed-sparkcontext-initialization","title":"Handling Nearly-Completed SparkContext Initialization <pre><code>postStartHook(): Unit\n</code></pre> <p><code>postStartHook</code> is part of the TaskScheduler abstraction.</p> <p><code>postStartHook</code> waits until a scheduler backend is ready.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#waiting-until-schedulerbackend-is-ready","title":"Waiting Until SchedulerBackend is Ready <pre><code>waitBackendReady(): Unit\n</code></pre> <p><code>waitBackendReady</code> waits until the SchedulerBackend is ready. If it is, <code>waitBackendReady</code> returns immediately. Otherwise, <code>waitBackendReady</code> keeps checking every <code>100</code> milliseconds (hardcoded) or the &lt;&gt; is SparkContext.md#stopped[stopped].  <p>Note</p> <p>A <code>SchedulerBackend</code> is ready by default.</p>  <p>If the <code>SparkContext</code> happens to be stopped while waiting, <code>waitBackendReady</code> throws an <code>IllegalStateException</code>:</p> <pre><code>Spark context stopped while waiting for backend\n</code></pre>","text":""},{"location":"scheduler/TaskSchedulerImpl/#stopping-taskschedulerimpl","title":"Stopping TaskSchedulerImpl <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code> stops all the internal services, i.e. &lt;task-scheduler-speculation executor service&gt;&gt;, scheduler:SchedulerBackend.md[SchedulerBackend], scheduler:TaskResultGetter.md[TaskResultGetter], and &lt;&gt; timer.","text":""},{"location":"scheduler/TaskSchedulerImpl/#default-level-of-parallelism","title":"Default Level of Parallelism <pre><code>defaultParallelism(): Int\n</code></pre> <p><code>defaultParallelism</code> is part of the TaskScheduler abstraction.</p> <p><code>defaultParallelism</code> requests the SchedulerBackend for the default level of parallelism.</p>  <p>Note</p> <p>Default level of parallelism is a hint for sizing jobs that <code>SparkContext</code> uses to create RDDs with the right number of partitions unless specified explicitly.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#submitting-tasks-of-taskset-for-execution","title":"Submitting Tasks (of TaskSet) for Execution <pre><code>submitTasks(\n  taskSet: TaskSet): Unit\n</code></pre> <p><code>submitTasks</code> is part of the TaskScheduler abstraction.</p> <p>In essence, <code>submitTasks</code> registers a new TaskSetManager (for the given TaskSet) and requests the SchedulerBackend to handle resource allocation offers (from the scheduling system).</p> <p></p> <p>Internally, <code>submitTasks</code> prints out the following INFO message to the logs:</p> <pre><code>Adding task set [id] with [length] tasks\n</code></pre> <p><code>submitTasks</code> then &lt;&gt; (for the given TaskSet.md[TaskSet] and the &lt;&gt;). <p><code>submitTasks</code> registers (adds) the <code>TaskSetManager</code> per TaskSet.md#stageId[stage] and TaskSet.md#stageAttemptId[stage attempt] IDs (of the TaskSet.md[TaskSet]) in the &lt;&gt; internal registry. <p>NOTE: &lt;&gt; internal registry tracks the TaskSetManager.md[TaskSetManagers] (that represent TaskSet.md[TaskSets]) per stage and stage attempts. In other words, there could be many <code>TaskSetManagers</code> for a single stage, each representing a unique stage attempt. <p>NOTE: Not only could a task be retried (cf. &lt;&gt;), but also a single stage. <p><code>submitTasks</code> makes sure that there is exactly one active <code>TaskSetManager</code> (with different <code>TaskSet</code>) across all the managers (for the stage). Otherwise, <code>submitTasks</code> throws an <code>IllegalStateException</code>:</p> <pre><code>more than one active taskSet for stage [stage]: [TaskSet ids]\n</code></pre> <p>NOTE: <code>TaskSetManager</code> is considered active when it is not a zombie.</p> <p><code>submitTasks</code> requests the &lt;&gt; to SchedulableBuilder.md#addTaskSetManager[add the TaskSetManager to the schedulable pool]. <p>NOTE: The TaskScheduler.md#rootPool[schedulable pool] can be a single flat linked queue (in FIFOSchedulableBuilder.md[FIFO scheduling mode]) or a hierarchy of pools of <code>Schedulables</code> (in FairSchedulableBuilder.md[FAIR scheduling mode]).</p> <p><code>submitTasks</code> &lt;&gt; to make sure that the requested resources (i.e. CPU and memory) are assigned to the Spark application for a &lt;&gt; (the very first time the Spark application is started per &lt;&gt; flag). <p>NOTE: The very first time (&lt;&gt; flag is <code>false</code>) in cluster mode only (i.e. <code>isLocal</code> of the TaskSchedulerImpl is <code>false</code>), <code>starvationTimer</code> is scheduled to execute after configuration-properties.md#spark.starvation.timeout[spark.starvation.timeout]  to ensure that the requested resources, i.e. CPUs and memory, were assigned by a cluster manager. <p>NOTE: After the first configuration-properties.md#spark.starvation.timeout[spark.starvation.timeout] passes, the &lt;&gt; internal flag is <code>true</code>. <p>In the end, <code>submitTasks</code> requests the &lt;&gt; to scheduler:SchedulerBackend.md#reviveOffers[reviveOffers]. <p>TIP: Use <code>dag-scheduler-event-loop</code> thread to step through the code in a debugger.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#scheduling-starvation-task","title":"Scheduling Starvation Task <p>Every time the starvation timer thread is executed and <code>hasLaunchedTask</code> flag is <code>false</code>, the following WARN message is printed out to the logs:</p> <pre><code>Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources\n</code></pre> <p>Otherwise, when the <code>hasLaunchedTask</code> flag is <code>true</code> the timer thread cancels itself.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#createTaskSetManager","title":"Creating TaskSetManager <pre><code>createTaskSetManager(\n  taskSet: TaskSet,\n  maxTaskFailures: Int): TaskSetManager\n</code></pre> <p><code>createTaskSetManager</code> creates a TaskSetManager (with this <code>TaskSchedulerImpl</code>, the given TaskSet and the <code>maxTaskFailures</code>).</p>  <p><code>createTaskSetManager</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to submit a TaskSet</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#notifying-tasksetmanager-that-task-failed","title":"Notifying TaskSetManager that Task Failed <pre><code>handleFailedTask(\n  taskSetManager: TaskSetManager,\n  tid: Long,\n  taskState: TaskState,\n  reason: TaskFailedReason): Unit\n</code></pre> <p><code>handleFailedTask</code> scheduler:TaskSetManager.md#handleFailedTask[notifies <code>taskSetManager</code> that <code>tid</code> task has failed] and, only when scheduler:TaskSetManager.md#zombie-state[<code>taskSetManager</code> is not in zombie state] and <code>tid</code> is not in <code>KILLED</code> state, scheduler:SchedulerBackend.md#reviveOffers[requests SchedulerBackend to revive offers].</p> <p>NOTE: <code>handleFailedTask</code> is called when scheduler:TaskResultGetter.md#enqueueSuccessfulTask[<code>TaskResultGetter</code> deserializes a <code>TaskFailedReason</code>] for a failed task.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#tasksetfinished","title":"taskSetFinished <pre><code>taskSetFinished(\n  manager: TaskSetManager): Unit\n</code></pre> <p><code>taskSetFinished</code> looks all scheduler:TaskSet.md[TaskSet]s up by the stage id (in &lt;&gt; registry) and removes the stage attempt from them, possibly with removing the entire stage record from <code>taskSetsByStageIdAndAttempt</code> registry completely (if there are no other attempts registered). <p></p> <p><code>taskSetFinished</code> then removes <code>manager</code> from the parent's schedulable pool.</p> <p>You should see the following INFO message in the logs:</p> <pre><code>Removed TaskSet [id], whose tasks have all completed, from pool [name]\n</code></pre> <p><code>taskSetFinished</code> is used when:</p> <ul> <li><code>TaskSetManager</code> is requested to maybeFinishTaskSet</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#notifying-dagscheduler-about-new-executor","title":"Notifying DAGScheduler About New Executor <pre><code>executorAdded(\n  execId: String,\n  host: String)\n</code></pre> <p><code>executorAdded</code> just DAGScheduler.md#executorAdded[notifies <code>DAGScheduler</code> that an executor was added].</p> <p>NOTE: <code>executorAdded</code> uses &lt;&gt; that was given when &lt;&gt;.","text":""},{"location":"scheduler/TaskSchedulerImpl/#resourceOffers","title":"Creating TaskDescriptions For Available Executor Resource Offers <pre><code>resourceOffers(\n  offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]]\n</code></pre> <p><code>resourceOffers</code> takes the resources <code>offers</code> and generates a collection of tasks (as TaskDescriptions) to launch (given the resources available).</p>  <p>Note</p> <p>A <code>WorkerOffer</code> represents a resource offer with CPU cores free to use on an executor.</p>  <p></p>  <p>Internally, <code>resourceOffers</code> first updates &lt;&gt; and &lt;&gt; lookup tables to record new hosts and executors (given the input <code>offers</code>). <p>For new executors (not in &lt;&gt;) <code>resourceOffers</code> &lt;DAGScheduler that an executor was added&gt;&gt;. <p>NOTE: TaskSchedulerImpl uses <code>resourceOffers</code> to track active executors.</p> <p>CAUTION: FIXME a picture with <code>executorAdded</code> call from TaskSchedulerImpl to DAGScheduler.</p> <p><code>resourceOffers</code> requests <code>BlacklistTracker</code> to <code>applyBlacklistTimeout</code> and filters out offers on blacklisted nodes and executors.</p> <p>NOTE: <code>resourceOffers</code> uses the optional &lt;&gt; that was given when &lt;&gt;. <p>CAUTION: FIXME Expand on blacklisting</p> <p><code>resourceOffers</code> then randomly shuffles offers (to evenly distribute tasks across executors and avoid over-utilizing some executors) and initializes the local data structures <code>tasks</code> and <code>availableCpus</code> (as shown in the figure below).</p> <p></p> <p><code>resourceOffers</code> Pool.md#getSortedTaskSetQueue[takes <code>TaskSets</code> in scheduling order] from scheduler:TaskScheduler.md#rootPool[top-level Schedulable Pool].</p> <p></p>  <p>Note</p> <p><code>rootPool</code> is configured when &lt;&gt;. <p><code>rootPool</code> is part of the scheduler:TaskScheduler.md#rootPool[TaskScheduler Contract] and exclusively managed by scheduler:SchedulableBuilder.md[SchedulableBuilders], i.e. scheduler:FIFOSchedulableBuilder.md[FIFOSchedulableBuilder] and scheduler:FairSchedulableBuilder.md[FairSchedulableBuilder] (that  scheduler:SchedulableBuilder.md#addTaskSetManager[manage registering TaskSetManagers with the root pool]).</p> <p>scheduler:TaskSetManager.md[TaskSetManager] manages execution of the tasks in a single scheduler:TaskSet.md[TaskSet] that represents a single scheduler:Stage.md[Stage].</p>  <p>For every <code>TaskSetManager</code> (in scheduling order), you should see the following DEBUG message in the logs:</p> <pre><code>parentName: [name], name: [name], runningTasks: [count]\n</code></pre> <p>Only if a new executor was added, <code>resourceOffers</code> scheduler:TaskSetManager.md#executorAdded[notifies every <code>TaskSetManager</code> about the change] (to recompute locality preferences).</p> <p><code>resourceOffers</code> then takes every <code>TaskSetManager</code> (in scheduling order) and offers them each node in increasing order of locality levels (per scheduler:TaskSetManager.md#computeValidLocalityLevels[TaskSetManager's valid locality levels]).</p> <p>NOTE: A <code>TaskSetManager</code> scheduler:TaskSetManager.md#computeValidLocalityLevels[computes locality levels of the tasks] it manages.</p> <p>For every <code>TaskSetManager</code> and the <code>TaskSetManager</code>'s valid locality level, <code>resourceOffers</code> tries to &lt;&gt; as long as the <code>TaskSetManager</code> manages to launch a task (given the locality level). <p>If <code>resourceOffers</code> did not manage to offer resources to a <code>TaskSetManager</code> so it could launch any task, <code>resourceOffers</code> scheduler:TaskSetManager.md#abortIfCompletelyBlacklisted[requests the <code>TaskSetManager</code> to abort the <code>TaskSet</code> if completely blacklisted].</p> <p>When <code>resourceOffers</code> managed to launch a task, the internal &lt;&gt; flag gets enabled (that effectively means what the name says \"there were executors and I managed to launch a task\").  <p><code>resourceOffers</code> is used when:</p> <ul> <li><code>CoarseGrainedSchedulerBackend</code> (via DriverEndpoint RPC endpoint) is requested to make executor resource offers</li> <li><code>LocalEndpoint</code> is requested to revive resource offers</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#maybeinitbarriercoordinator","title":"maybeInitBarrierCoordinator <pre><code>maybeInitBarrierCoordinator(): Unit\n</code></pre> <p>Unless a BarrierCoordinator has already been registered, <code>maybeInitBarrierCoordinator</code> creates a BarrierCoordinator and registers it to be known as barrierSync.</p> <p>In the end, <code>maybeInitBarrierCoordinator</code> prints out the following INFO message to the logs:</p> <pre><code>Registered BarrierCoordinator endpoint\n</code></pre>","text":""},{"location":"scheduler/TaskSchedulerImpl/#resourceOfferSingleTaskSet","title":"Finding Tasks from TaskSetManager to Schedule on Executors <pre><code>resourceOfferSingleTaskSet(\n  taskSet: TaskSetManager,\n  maxLocality: TaskLocality,\n  shuffledOffers: Seq[WorkerOffer],\n  availableCpus: Array[Int],\n  availableResources: Array[Map[String, Buffer[String]]],\n  tasks: IndexedSeq[ArrayBuffer[TaskDescription]]): (Boolean, Option[TaskLocality])\n</code></pre> <p><code>resourceOfferSingleTaskSet</code> takes every <code>WorkerOffer</code> (from the input <code>shuffledOffers</code>) and (only if the number of available CPU cores (using the input <code>availableCpus</code>) is at least configuration-properties.md#spark.task.cpus[spark.task.cpus]) scheduler:TaskSetManager.md#resourceOffer[requests <code>TaskSetManager</code> (as the input <code>taskSet</code>) to find a <code>Task</code> to execute (given the resource offer)] (as an executor, a host, and the input <code>maxLocality</code>).</p> <p><code>resourceOfferSingleTaskSet</code> adds the task to the input <code>tasks</code> collection.</p> <p><code>resourceOfferSingleTaskSet</code> records the task id and <code>TaskSetManager</code> in some registries.</p> <p><code>resourceOfferSingleTaskSet</code> decreases configuration-properties.md#spark.task.cpus[spark.task.cpus] from the input <code>availableCpus</code> (for the <code>WorkerOffer</code>).</p> <p><code>resourceOfferSingleTaskSet</code> returns whether a task was launched or not.</p>  <p>Note</p> <p><code>resourceOfferSingleTaskSet</code> asserts that the number of available CPU cores (in the input <code>availableCpus</code> per <code>WorkerOffer</code>) is at least <code>0</code>.</p>   <p>If there is a <code>TaskNotSerializableException</code>, <code>resourceOfferSingleTaskSet</code> prints out the following ERROR in the logs:</p> <pre><code>Resource offer failed, task set [name] was not serializable\n</code></pre>  <p><code>resourceOfferSingleTaskSet</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to resourceOffers</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#TaskLocality","title":"Task Locality Preference <p><code>TaskLocality</code> represents a task locality preference and can be one of the following (from the most localized to the widest):</p> <ol> <li><code>PROCESS_LOCAL</code></li> <li><code>NODE_LOCAL</code></li> <li><code>NO_PREF</code></li> <li><code>RACK_LOCAL</code></li> <li><code>ANY</code></li> </ol>","text":""},{"location":"scheduler/TaskSchedulerImpl/#workeroffer-free-cpu-cores-on-executor","title":"WorkerOffer \u2014 Free CPU Cores on Executor <pre><code>WorkerOffer(\n  executorId: String,\n  host: String,\n  cores: Int)\n</code></pre> <p><code>WorkerOffer</code> represents a resource offer with free CPU <code>cores</code> available on an <code>executorId</code> executor on a <code>host</code>.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#workerremoved","title":"workerRemoved <pre><code>workerRemoved(\n  workerId: String,\n  host: String,\n  message: String): Unit\n</code></pre> <p><code>workerRemoved</code> is part of the TaskScheduler abstraction.</p> <p><code>workerRemoved</code> prints out the following INFO message to the logs:</p> <pre><code>Handle removed worker [workerId]: [message]\n</code></pre> <p>In the end, <code>workerRemoved</code> requests the DAGScheduler to workerRemoved.</p>","text":""},{"location":"scheduler/TaskSchedulerImpl/#calculateAvailableSlots","title":"calculateAvailableSlots <pre><code>calculateAvailableSlots(\n  scheduler: TaskSchedulerImpl,\n  conf: SparkConf,\n  rpId: Int,\n  availableRPIds: Array[Int],\n  availableCpus: Array[Int],\n  availableResources: Array[Map[String, Int]]): Int\n</code></pre> <p><code>calculateAvailableSlots</code>...FIXME</p>  <p><code>calculateAvailableSlots</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested for TaskDescriptions for the given executor resource offers</li> <li><code>CoarseGrainedSchedulerBackend</code> is requested for the maximum number of concurrent tasks</li> </ul>","text":""},{"location":"scheduler/TaskSchedulerImpl/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.TaskSchedulerImpl</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.TaskSchedulerImpl.name = org.apache.spark.scheduler.TaskSchedulerImpl\nlogger.TaskSchedulerImpl.level = all\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"scheduler/TaskSet/","title":"TaskSet","text":"<p><code>TaskSet</code> is a collection of independent tasks of a stage (and a stage execution attempt) that are missing (uncomputed), i.e. for which computation results are unavailable (as RDD blocks on BlockManagers on executors).</p> <p>In other words, a <code>TaskSet</code> represents the missing partitions of a stage that (as tasks) can be run right away based on the data that is already on the cluster, e.g. map output files from previous stages, though they may fail if this data becomes unavailable.</p> <p>Since the tasks are only the missing tasks, their number does not necessarily have to be the number of all the tasks of a stage. For a brand new stage (that has never been attempted to compute) their numbers are exactly the same.</p> <p>Once <code>DAGScheduler</code> submits the missing tasks for execution (to the TaskScheduler), the execution of the <code>TaskSet</code> is managed by a TaskSetManager that allows for spark.task.maxFailures.</p>"},{"location":"scheduler/TaskSet/#creating-instance","title":"Creating Instance","text":"<p><code>TaskSet</code> takes the following to be created:</p> <ul> <li> Tasks <li> Stage ID <li> Stage (Execution) Attempt ID <li>FIFO Priority</li> <li> Local Properties <li> Resource Profile ID <p><code>TaskSet</code> is created\u00a0when:</p> <ul> <li><code>DAGScheduler</code> is requested to submit the missing tasks of a stage</li> </ul>"},{"location":"scheduler/TaskSet/#id","title":"ID <pre><code>id: String\n</code></pre> <p><code>TaskSet</code> is uniquely identified by an <code>id</code> that uses the stageId followed by the stageAttemptId with the comma (<code>.</code>) in-between:</p> <pre><code>[stageId].[stageAttemptId]\n</code></pre>","text":""},{"location":"scheduler/TaskSet/#textual-representation","title":"Textual Representation <pre><code>toString: String\n</code></pre> <p><code>toString</code> follows the pattern:</p> <pre><code>TaskSet [stageId].[stageAttemptId]\n</code></pre>","text":""},{"location":"scheduler/TaskSet/#task-scheduling-prioritization-fifo-scheduling","title":"Task Scheduling Prioritization (FIFO Scheduling) <p><code>TaskSet</code> is given a <code>priority</code> when created.</p> <p>The priority is the ID of the earliest-created active job that needs the stage (that is given when <code>DAGScheduler</code> is requested to submit the missing tasks of a stage).</p> <p>Once submitted for execution, the priority is the priority of the <code>TaskSetManager</code> (which is a Schedulable) that is used for task prioritization (prioritizing scheduling of tasks) in the FIFO scheduling mode.</p>","text":""},{"location":"scheduler/TaskSetBlacklist/","title":"TaskSetBlacklist","text":"<p>== [[TaskSetBlacklist]] <code>TaskSetBlacklist</code> -- Blacklisting Executors and Nodes For TaskSet</p> <p>CAUTION: FIXME</p> <p>=== [[updateBlacklistForFailedTask]] <code>updateBlacklistForFailedTask</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[isExecutorBlacklistedForTaskSet]] <code>isExecutorBlacklistedForTaskSet</code> Method</p> <p>CAUTION: FIXME</p> <p>=== [[isNodeBlacklistedForTaskSet]] <code>isNodeBlacklistedForTaskSet</code> Method</p> <p>CAUTION: FIXME</p>"},{"location":"scheduler/TaskSetManager/","title":"TaskSetManager","text":"<p><code>TaskSetManager</code> is a Schedulable that manages scheduling the tasks of a TaskSet.</p> <p></p>"},{"location":"scheduler/TaskSetManager/#creating-instance","title":"Creating Instance","text":"<p><code>TaskSetManager</code> takes the following to be created:</p> <ul> <li> TaskSchedulerImpl <li> TaskSet <li>Number of Task Failures</li> <li> <code>HealthTracker</code> <li> <code>Clock</code> <p><code>TaskSetManager</code> is created when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to create a TaskSetManager</li> </ul> <p>While being created, <code>TaskSetManager</code> requests the current epoch from <code>MapOutputTracker</code> and sets it on all tasks in the taskset.</p> <p>Note</p> <p><code>TaskSetManager</code> uses TaskSchedulerImpl to access the current <code>MapOutputTracker</code>.</p> <p><code>TaskSetManager</code> prints out the following DEBUG to the logs:</p> <pre><code>Epoch for [taskSet]: [epoch]\n</code></pre> <p><code>TaskSetManager</code> adds the tasks as pending execution (in reverse order from the highest partition to the lowest).</p>"},{"location":"scheduler/TaskSetManager/#number-of-task-failures","title":"Number of Task Failures <p><code>TaskSetManager</code> is given <code>maxTaskFailures</code> value that is how many times a single task can fail before the whole TaskSet is aborted.</p>    Master URL Number of Task Failures     <code>local</code> 1   local-with-retries <code>maxFailures</code>   <code>local-cluster</code> spark.task.maxFailures   Cluster Manager spark.task.maxFailures","text":""},{"location":"scheduler/TaskSetManager/#isBarrier","title":"isBarrier","text":"<pre><code>isBarrier: Boolean\n</code></pre> <p><code>isBarrier</code> is enabled (<code>true</code>) when this <code>TaskSetManager</code> is created for a TaskSet with barrier tasks.</p> <p><code>isBarrier</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to resourceOfferSingleTaskSet, resourceOffers</li> <li><code>TaskSetManager</code> is requested to resourceOffer, checkSpeculatableTasks, getLocalityWait</li> </ul>"},{"location":"scheduler/TaskSetManager/#resourceOffer","title":"resourceOffer","text":"<pre><code>resourceOffer(\n  execId: String,\n  host: String,\n  maxLocality: TaskLocality.TaskLocality,\n  taskCpus: Int = sched.CPUS_PER_TASK,\n  taskResourceAssignments: Map[String, ResourceInformation] = Map.empty): (Option[TaskDescription], Boolean, Int)\n</code></pre> <p><code>resourceOffer</code> determines allowed locality level for the given <code>TaskLocality</code> being anything but <code>NO_PREF</code>.</p> <p><code>resourceOffer</code> dequeueTask for the given <code>execId</code> and <code>host</code>, and the allowed locality level. This may or may not give a TaskDescription.</p> <p>In the end, <code>resourceOffer</code> returns the <code>TaskDescription</code>, <code>hasScheduleDelayReject</code>, and the index of the dequeued task (if any).</p> <p><code>resourceOffer</code> returns a <code>(None, false, -1)</code> tuple when this <code>TaskSetManager</code> is isZombie or the offer (by the given <code>host</code> or <code>execId</code>) should be ignored (excluded).</p> <p><code>resourceOffer</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to resourceOfferSingleTaskSet</li> </ul>"},{"location":"scheduler/TaskSetManager/#locality-wait","title":"Locality Wait <pre><code>getLocalityWait(\n  level: TaskLocality.TaskLocality): Long\n</code></pre> <p><code>getLocalityWait</code> is <code>0</code> for legacyLocalityWaitReset and isBarrier flags enabled.</p> <p><code>getLocalityWait</code> determines the value of locality wait based on the given <code>TaskLocality.TaskLocality</code>.</p>    TaskLocality Configuration Property     <code>PROCESS_LOCAL</code> spark.locality.wait.process   <code>NODE_LOCAL</code> spark.locality.wait.node   <code>RACK_LOCAL</code> spark.locality.wait.rack    <p>Unless the value has been determined, <code>getLocalityWait</code> defaults to <code>0</code>.</p>  <p>Note</p> <p><code>NO_PREF</code> and <code>ANY</code> task localities have no locality wait.</p>   <p><code>getLocalityWait</code> is used when:</p> <ul> <li><code>TaskSetManager</code> is created and recomputes locality preferences</li> </ul>","text":""},{"location":"scheduler/TaskSetManager/#sparkdrivermaxresultsize","title":"spark.driver.maxResultSize <p><code>TaskSetManager</code> uses spark.driver.maxResultSize configuration property to check available memory for more task results.</p>","text":""},{"location":"scheduler/TaskSetManager/#recomputing-task-locality-preferences","title":"Recomputing Task Locality Preferences <pre><code>recomputeLocality(): Unit\n</code></pre> <p>If zombie, <code>recomputeLocality</code> does nothing.</p> <p><code>recomputeLocality</code> recomputes myLocalityLevels, localityWaits and currentLocalityIndex internal registries.</p> <p><code>recomputeLocality</code> computes locality levels (for scheduled tasks) and saves the result in myLocalityLevels internal registry.</p> <p><code>recomputeLocality</code> computes localityWaits by determining the locality wait for every locality level in myLocalityLevels.</p> <p><code>recomputeLocality</code> computes currentLocalityIndex by getLocalityIndex with the previous locality level. If the current locality index is higher than the previous, <code>recomputeLocality</code> recalculates currentLocalityIndex.</p>  <p><code>recomputeLocality</code> is used when:</p> <ul> <li><code>TaskSetManager</code> is notified about status change in executors (i.e., lost, decommissioned, added)</li> </ul>","text":""},{"location":"scheduler/TaskSetManager/#zombie","title":"Zombie <p>A <code>TaskSetManager</code> is a zombie when all tasks in a taskset have completed successfully (regardless of the number of task attempts), or if the taskset has been aborted.</p> <p>While in zombie state, a <code>TaskSetManager</code> can launch no new tasks and responds with no <code>TaskDescription</code>s to resourceOffers.</p> <p>A <code>TaskSetManager</code> remains in the zombie state until all tasks have finished running, i.e. to continue to track and account for the running tasks.</p>","text":""},{"location":"scheduler/TaskSetManager/#computing-locality-levels-for-scheduled-tasks","title":"Computing Locality Levels (for Scheduled Tasks) <pre><code>computeValidLocalityLevels(): Array[TaskLocality.TaskLocality]\n</code></pre> <p><code>computeValidLocalityLevels</code> computes valid locality levels for tasks that were registered in corresponding registries per locality level.</p>  <p>Note</p> <p>TaskLocality is a locality preference of a task and can be the most localized <code>PROCESS_LOCAL</code>, <code>NODE_LOCAL</code> through <code>NO_PREF</code> and <code>RACK_LOCAL</code> to <code>ANY</code>.</p>  <p>For every pending task (in pendingTasks registry), <code>computeValidLocalityLevels</code> requests the TaskSchedulerImpl for acceptable <code>TaskLocality</code>ies:</p> <ul> <li>For every executor, <code>computeValidLocalityLevels</code> requests the TaskSchedulerImpl to isExecutorAlive and adds <code>PROCESS_LOCAL</code></li> <li>For every host, <code>computeValidLocalityLevels</code> requests the TaskSchedulerImpl to hasExecutorsAliveOnHost and adds <code>NODE_LOCAL</code></li> <li>For any pending tasks with no locality preference, <code>computeValidLocalityLevels</code> adds <code>NO_PREF</code></li> <li>For every rack, <code>computeValidLocalityLevels</code> requests the TaskSchedulerImpl to hasHostAliveOnRack and adds <code>RACK_LOCAL</code></li> </ul> <p><code>computeValidLocalityLevels</code> always registers <code>ANY</code> task locality level.</p> <p>In the end, <code>computeValidLocalityLevels</code> prints out the following DEBUG message to the logs:</p> <pre><code>Valid locality levels for [taskSet]: [comma-separated levels]\n</code></pre>  <p><code>computeValidLocalityLevels</code> is used when:</p> <ul> <li><code>TaskSetManager</code> is created and to recomputeLocality</li> </ul>","text":""},{"location":"scheduler/TaskSetManager/#executoradded","title":"executorAdded <pre><code>executorAdded(): Unit\n</code></pre> <p><code>executorAdded</code> recomputeLocality.</p>  <p><code>executorAdded</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to handle resource offers</li> </ul>","text":""},{"location":"scheduler/TaskSetManager/#prepareLaunchingTask","title":"prepareLaunchingTask <pre><code>prepareLaunchingTask(\n  execId: String,\n  host: String,\n  index: Int,\n  taskLocality: TaskLocality.Value,\n  speculative: Boolean,\n  taskCpus: Int,\n  taskResourceAssignments: Map[String, ResourceInformation],\n  launchTime: Long): TaskDescription\n</code></pre>  taskResourceAssignments <p><code>taskResourceAssignments</code> are the resources that are passed in to resourceOffer.</p>  <p><code>prepareLaunchingTask</code>...FIXME</p>  <p><code>prepareLaunchingTask</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to resourceOffers</li> <li><code>TaskSetManager</code> is requested to resourceOffers</li> </ul>","text":""},{"location":"scheduler/TaskSetManager/#demo","title":"Demo <p>Enable <code>DEBUG</code> logging level for <code>org.apache.spark.scheduler.TaskSchedulerImpl</code> (or <code>org.apache.spark.scheduler.cluster.YarnScheduler</code> for YARN) and <code>org.apache.spark.scheduler.TaskSetManager</code> and execute the following two-stage job to see their low-level innerworkings.</p> <p>A cluster manager is recommended since it gives more task localization choices (with YARN additionally supporting rack localization).</p> <pre><code>$ ./bin/spark-shell \\\n    --master yarn \\\n    --conf spark.ui.showConsoleProgress=false\n\n// Keep # partitions low to keep # messages low\n\nscala&gt; sc.parallelize(0 to 9, 3).groupBy(_ % 3).count\nINFO YarnScheduler: Adding task set 0.0 with 3 tasks\nDEBUG TaskSetManager: Epoch for TaskSet 0.0: 0\nDEBUG TaskSetManager: Valid locality levels for TaskSet 0.0: NO_PREF, ANY\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 0\nINFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.0.2.87, executor 1, partition 0, PROCESS_LOCAL, 7541 bytes)\nINFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.0.2.87, executor 2, partition 1, PROCESS_LOCAL, 7541 bytes)\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 1\nINFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.0.2.87, executor 1, partition 2, PROCESS_LOCAL, 7598 bytes)\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 1\nDEBUG TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY\nINFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 518 ms on 10.0.2.87 (executor 1) (1/3)\nINFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 512 ms on 10.0.2.87 (executor 2) (2/3)\nDEBUG YarnScheduler: parentName: , name: TaskSet_0.0, runningTasks: 0\nINFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 51 ms on 10.0.2.87 (executor 1) (3/3)\nINFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool\nINFO YarnScheduler: Adding task set 1.0 with 3 tasks\nDEBUG TaskSetManager: Epoch for TaskSet 1.0: 1\nDEBUG TaskSetManager: Valid locality levels for TaskSet 1.0: NODE_LOCAL, RACK_LOCAL, ANY\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 0\nINFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 3, 10.0.2.87, executor 2, partition 0, NODE_LOCAL, 7348 bytes)\nINFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 4, 10.0.2.87, executor 1, partition 1, NODE_LOCAL, 7348 bytes)\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 1\nINFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 5, 10.0.2.87, executor 1, partition 2, NODE_LOCAL, 7348 bytes)\nINFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 4) in 130 ms on 10.0.2.87 (executor 1) (1/3)\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 1\nDEBUG TaskSetManager: No tasks for locality level NODE_LOCAL, so moving to locality level RACK_LOCAL\nDEBUG TaskSetManager: No tasks for locality level RACK_LOCAL, so moving to locality level ANY\nINFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 3) in 133 ms on 10.0.2.87 (executor 2) (2/3)\nDEBUG YarnScheduler: parentName: , name: TaskSet_1.0, runningTasks: 0\nINFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 5) in 21 ms on 10.0.2.87 (executor 1) (3/3)\nINFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool\nres0: Long = 3\n</code></pre>","text":""},{"location":"scheduler/TaskSetManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.scheduler.TaskSetManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.scheduler.TaskSetManager=ALL\n</code></pre> <p>Refer to Logging</p>","text":""},{"location":"serializer/","title":"Serialization System","text":"<p>Serialization System is a core component of Apache Spark with pluggable serializers for task closures and block data.</p> <p>Serialization System uses SerializerManager to select the Serializer (based on spark.serializer configuration property).</p>"},{"location":"serializer/DeserializationStream/","title":"DeserializationStream","text":"<p>= DeserializationStream</p> <p>DeserializationStream is an abstraction of streams for reading serialized objects.</p> <p>== [[readObject]] readObject Method</p>"},{"location":"serializer/DeserializationStream/#source-scala","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#readobjectt-classtag-t","title":"readObjectT: ClassTag: T","text":"<p>readObject...FIXME</p> <p>readObject is used when...FIXME</p> <p>== [[readKey]] readKey Method</p>"},{"location":"serializer/DeserializationStream/#source-scala_1","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#readkeyt-classtag-t","title":"readKeyT: ClassTag: T","text":"<p>readKey &lt;&gt; representing the key of a key-value record. <p>readKey is used when...FIXME</p> <p>== [[readValue]] readValue Method</p>"},{"location":"serializer/DeserializationStream/#source-scala_2","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#readvaluet-classtag-t","title":"readValueT: ClassTag: T","text":"<p>readValue &lt;&gt; representing the value of a key-value record. <p>readValue is used when...FIXME</p> <p>== [[asIterator]] asIterator Method</p>"},{"location":"serializer/DeserializationStream/#source-scala_3","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#asiterator-iteratorany","title":"asIterator: Iterator[Any]","text":"<p>asIterator...FIXME</p> <p>asIterator is used when...FIXME</p> <p>== [[asKeyValueIterator]] asKeyValueIterator Method</p>"},{"location":"serializer/DeserializationStream/#source-scala_4","title":"[source, scala]","text":""},{"location":"serializer/DeserializationStream/#askeyvalueiterator-iteratorany","title":"asKeyValueIterator: Iterator[Any]","text":"<p>asKeyValueIterator...FIXME</p> <p>asKeyValueIterator is used when...FIXME</p>"},{"location":"serializer/JavaSerializerInstance/","title":"JavaSerializerInstance","text":"<p><code>JavaSerializerInstance</code> is...FIXME</p>"},{"location":"serializer/KryoSerializer/","title":"KryoSerializer","text":"<p><code>KryoSerializer</code> is a Serializer that uses the Kryo serialization library.</p>"},{"location":"serializer/KryoSerializer/#creating-instance","title":"Creating Instance","text":"<p><code>KryoSerializer</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>KryoSerializer</code> is created\u00a0when:</p> <ul> <li><code>SerializerManager</code> is created</li> <li><code>SparkConf</code> is requested to registerKryoClasses</li> <li><code>SerializerSupport</code> (Spark SQL) is requested for a SerializerInstance</li> </ul>"},{"location":"serializer/KryoSerializer/#useunsafe-flag","title":"useUnsafe Flag <p><code>KryoSerializer</code> uses the spark.kryo.unsafe configuration property for <code>useUnsafe</code> flag (initialized when <code>KryoSerializer</code> is created).</p> <p><code>useUnsafe</code>\u00a0is used when <code>KryoSerializer</code> is requested to create the following:</p> <ul> <li>KryoSerializerInstance</li> <li>KryoOutput</li> </ul>","text":""},{"location":"serializer/KryoSerializer/#creating-new-serializerinstance","title":"Creating New SerializerInstance <pre><code>newInstance(): SerializerInstance\n</code></pre> <p><code>newInstance</code>\u00a0is part of the Serializer abstraction.</p> <p><code>newInstance</code> creates a KryoSerializerInstance with this <code>KryoSerializer</code> (and the useUnsafe and usePool flags).</p>","text":""},{"location":"serializer/KryoSerializer/#newkryooutput","title":"newKryoOutput <pre><code>newKryoOutput(): KryoOutput\n</code></pre> <p><code>newKryoOutput</code>...FIXME</p> <p><code>newKryoOutput</code>\u00a0is used when:</p> <ul> <li><code>KryoSerializerInstance</code> is requested for the output</li> </ul>","text":""},{"location":"serializer/KryoSerializer/#newkryo","title":"newKryo <pre><code>newKryo(): Kryo\n</code></pre> <p><code>newKryo</code>...FIXME</p> <p><code>newKryo</code>\u00a0is used when:</p> <ul> <li><code>KryoSerializer</code> is requested for a KryoFactory</li> <li><code>KryoSerializerInstance</code> is requested to borrowKryo</li> </ul>","text":""},{"location":"serializer/KryoSerializer/#kryofactory","title":"KryoFactory <pre><code>factory: KryoFactory\n</code></pre> <p><code>KryoSerializer</code> creates a <code>KryoFactory</code> lazily (on demand and once only) for internalPool.</p>","text":""},{"location":"serializer/KryoSerializer/#kryopool","title":"KryoPool <p><code>KryoSerializer</code> creates a custom <code>KryoPool</code> lazily (on demand and once only).</p> <p><code>KryoPool</code> is used when:</p> <ul> <li>pool</li> <li>setDefaultClassLoader</li> </ul>","text":""},{"location":"serializer/KryoSerializer/#supportsrelocationofserializedobjects","title":"supportsRelocationOfSerializedObjects <pre><code>supportsRelocationOfSerializedObjects: Boolean\n</code></pre> <p><code>supportsRelocationOfSerializedObjects</code>\u00a0is part of the Serializer abstraction.</p> <p><code>supportsRelocationOfSerializedObjects</code> creates a new SerializerInstance (that is assumed to be a KryoSerializerInstance) and requests it to get the value of the autoReset field.</p>","text":""},{"location":"serializer/KryoSerializerInstance/","title":"KryoSerializerInstance","text":"<p><code>KryoSerializerInstance</code> is a SerializerInstance.</p>"},{"location":"serializer/KryoSerializerInstance/#creating-instance","title":"Creating Instance","text":"<p><code>KryoSerializerInstance</code> takes the following to be created:</p> <ul> <li> KryoSerializer <li> <code>useUnsafe</code> flag <li> <code>usePool</code> flag <p><code>KryoSerializerInstance</code> is created\u00a0when:</p> <ul> <li><code>KryoSerializer</code> is requested for a new SerializerInstance</li> </ul>"},{"location":"serializer/KryoSerializerInstance/#output","title":"Output <p><code>KryoSerializerInstance</code> creates Kryo's <code>Output</code> lazily (on demand and once only).</p> <p><code>KryoSerializerInstance</code> requests the KryoSerializer for a newKryoOutput.</p> <p><code>output</code>\u00a0is used for serialization.</p>","text":""},{"location":"serializer/KryoSerializerInstance/#serialize","title":"serialize <pre><code>serialize[T: ClassTag](\n  t: T): ByteBuffer\n</code></pre> <p><code>serialize</code>\u00a0is part of the SerializerInstance abstraction.</p> <p><code>serialize</code>...FIXME</p>","text":""},{"location":"serializer/KryoSerializerInstance/#deserialize","title":"deserialize <pre><code>deserialize[T: ClassTag](\n  bytes: ByteBuffer): T\n</code></pre> <p><code>deserialize</code>\u00a0is part of the SerializerInstance abstraction.</p> <p><code>deserialize</code>...FIXME</p>","text":""},{"location":"serializer/KryoSerializerInstance/#releasing-kryo-instance","title":"Releasing Kryo Instance <pre><code>releaseKryo(\n  kryo: Kryo): Unit\n</code></pre> <p><code>releaseKryo</code>...FIXME</p> <p><code>releaseKryo</code>\u00a0is used when:</p> <ul> <li><code>KryoSerializationStream</code> is requested to <code>close</code></li> <li><code>KryoDeserializationStream</code> is requested to <code>close</code></li> <li><code>KryoSerializerInstance</code> is requested to serialize and deserialize (and getAutoReset)</li> </ul>","text":""},{"location":"serializer/KryoSerializerInstance/#getautoreset","title":"getAutoReset <pre><code>getAutoReset(): Boolean\n</code></pre> <p><code>getAutoReset</code> uses Java Reflection to access the value of the <code>autoReset</code> field of the <code>Kryo</code> class.</p> <p><code>getAutoReset</code>\u00a0is used when:</p> <ul> <li><code>KryoSerializer</code> is requested for the supportsRelocationOfSerializedObjects flag</li> </ul>","text":""},{"location":"serializer/SerializationStream/","title":"SerializationStream","text":"<p><code>SerializationStream</code> is an abstraction of serialized streams for writing out serialized key-value records.</p>"},{"location":"serializer/SerializationStream/#contract","title":"Contract","text":""},{"location":"serializer/SerializationStream/#closing-stream","title":"Closing Stream <pre><code>close(): Unit\n</code></pre>","text":""},{"location":"serializer/SerializationStream/#flushing-stream","title":"Flushing Stream <pre><code>flush(): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>UnsafeShuffleWriter</code> is requested to insert a record into a ShuffleExternalSorter</li> <li><code>DiskBlockObjectWriter</code> is requested to commitAndGet</li> </ul>","text":""},{"location":"serializer/SerializationStream/#writing-out-object","title":"Writing Out Object <pre><code>writeObject[T: ClassTag](\n  t: T): SerializationStream\n</code></pre> <p>Used when:</p> <ul> <li><code>MemoryStore</code> is requested to putIteratorAsBytes</li> <li><code>JavaSerializerInstance</code> is requested to serialize</li> <li><code>RequestMessage</code> is requested to <code>serialize</code> (for NettyRpcEnv)</li> <li><code>ParallelCollectionPartition</code> is requested to <code>writeObject</code> (for ParallelCollectionRDD)</li> <li><code>ReliableRDDCheckpointData</code> is requested to doCheckpoint</li> <li><code>TorrentBroadcast</code> is created (and requested to writeBlocks)</li> <li><code>RangePartitioner</code> is requested to writeObject</li> <li><code>SerializationStream</code> is requested to writeKey, writeValue or writeAll</li> <li><code>FileSystemPersistenceEngine</code> is requested to <code>serializeIntoFile</code> (for Spark Standalone's <code>Master</code>)</li> </ul>","text":""},{"location":"serializer/SerializationStream/#implementations","title":"Implementations","text":"<ul> <li><code>JavaSerializationStream</code></li> <li><code>KryoSerializationStream</code></li> </ul>"},{"location":"serializer/SerializationStream/#writing-out-all-records","title":"Writing Out All Records <pre><code>writeAll[T: ClassTag](\n  iter: Iterator[T]): SerializationStream\n</code></pre> <p><code>writeAll</code> writes out records of the given iterator (one by one as objects).</p> <p><code>writeAll</code> is used when:</p> <ul> <li><code>ReliableCheckpointRDD</code> is requested to doCheckpoint</li> <li><code>SerializerManager</code> is requested to dataSerializeStream and dataSerializeWithExplicitClassTag</li> </ul>","text":""},{"location":"serializer/SerializationStream/#writing-out-key","title":"Writing Out Key <pre><code>writeKey[T: ClassTag](\n  key: T): SerializationStream\n</code></pre> <p>Writes out the key</p> <p><code>writeKey</code> is used when:</p> <ul> <li><code>UnsafeShuffleWriter</code> is requested to insert a record into a ShuffleExternalSorter</li> <li><code>DiskBlockObjectWriter</code> is requested to write the key and value of a record</li> </ul>","text":""},{"location":"serializer/SerializationStream/#writing-out-value","title":"Writing Out Value <pre><code>writeValue[T: ClassTag](\n  value: T): SerializationStream\n</code></pre> <p>Writes out the value</p> <p><code>writeValue</code> is used when:</p> <ul> <li><code>UnsafeShuffleWriter</code> is requested to insert a record into a ShuffleExternalSorter</li> <li><code>DiskBlockObjectWriter</code> is requested to write the key and value of a record</li> </ul>","text":""},{"location":"serializer/Serializer/","title":"Serializer","text":"<p><code>Serializer</code> is an abstraction of serializers for serialization and deserialization of tasks (closures) and data blocks in a Spark application.</p>"},{"location":"serializer/Serializer/#contract","title":"Contract","text":""},{"location":"serializer/Serializer/#creating-new-serializerinstance","title":"Creating New SerializerInstance <pre><code>newInstance(): SerializerInstance\n</code></pre> <p>Creates a new SerializerInstance</p> <p>Used when:</p> <ul> <li><code>Task</code> is created (only used in tests)</li> <li><code>SerializerSupport</code> (Spark SQL) utility is used to <code>newSerializer</code></li> <li><code>RangePartitioner</code> is requested to writeObject and readObject</li> <li><code>TorrentBroadcast</code> utility is used to blockifyObject and unBlockifyObject</li> <li><code>TaskRunner</code> is requested to run</li> <li><code>NettyBlockRpcServer</code> is requested to deserializeMetadata</li> <li><code>NettyBlockTransferService</code> is requested to uploadBlock</li> <li><code>PairRDDFunctions</code> is requested to...FIXME</li> <li><code>ParallelCollectionPartition</code> is requested to...FIXME</li> <li><code>RDD</code> is requested to...FIXME</li> <li><code>ReliableCheckpointRDD</code> utility is used to...FIXME</li> <li><code>NettyRpcEnvFactory</code> is requested to create a RpcEnv</li> <li><code>DAGScheduler</code> is created</li> <li>others</li> </ul>","text":""},{"location":"serializer/Serializer/#implementations","title":"Implementations","text":"<ul> <li><code>JavaSerializer</code></li> <li>KryoSerializer</li> <li><code>UnsafeRowSerializer</code> (Spark SQL)</li> </ul>"},{"location":"serializer/Serializer/#accessing-serializer","title":"Accessing Serializer","text":"<p><code>Serializer</code> is available using SparkEnv as the closureSerializer and serializer.</p>"},{"location":"serializer/Serializer/#closureserializer","title":"closureSerializer <pre><code>SparkEnv.get.closureSerializer\n</code></pre>","text":""},{"location":"serializer/Serializer/#serializer_1","title":"serializer <pre><code>SparkEnv.get.serializer\n</code></pre>","text":""},{"location":"serializer/Serializer/#serialized-objects-relocation-requirements","title":"Serialized Objects Relocation Requirements <pre><code>supportsRelocationOfSerializedObjects: Boolean\n</code></pre> <p><code>supportsRelocationOfSerializedObjects</code> is disabled (<code>false</code>) by default.</p> <p><code>supportsRelocationOfSerializedObjects</code> is used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to fetchContinuousBlocksInBatch</li> <li><code>SortShuffleManager</code> is requested to create a ShuffleHandle for a given ShuffleDependency (and checks out SerializedShuffleHandle requirements)</li> </ul>","text":""},{"location":"serializer/SerializerInstance/","title":"SerializerInstance","text":"<p><code>SerializerInstance</code> is an abstraction of serializer instances (for use by one thread at a time).</p>"},{"location":"serializer/SerializerInstance/#contract","title":"Contract","text":""},{"location":"serializer/SerializerInstance/#deserializing-from-bytebuffer","title":"Deserializing (from ByteBuffer) <pre><code>deserialize[T: ClassTag](\n  bytes: ByteBuffer): T\ndeserialize[T: ClassTag](\n  bytes: ByteBuffer,\n  loader: ClassLoader): T\n</code></pre> <p>Used when:</p> <ul> <li><code>TaskRunner</code> is requested to run</li> <li><code>ResultTask</code> is requested to run</li> <li><code>ShuffleMapTask</code> is requested to run</li> <li><code>TaskResultGetter</code> is requested to enqueueFailedTask</li> <li>others</li> </ul>","text":""},{"location":"serializer/SerializerInstance/#deserializing-from-inputstream","title":"Deserializing (from InputStream) <pre><code>deserializeStream(\n  s: InputStream): DeserializationStream\n</code></pre>","text":""},{"location":"serializer/SerializerInstance/#serializing-to-bytebuffer","title":"Serializing (to ByteBuffer) <pre><code>serialize[T: ClassTag](\n  t: T): ByteBuffer\n</code></pre>","text":""},{"location":"serializer/SerializerInstance/#serializing-to-outputstream","title":"Serializing (to OutputStream) <pre><code>serializeStream(\n  s: OutputStream): SerializationStream\n</code></pre>","text":""},{"location":"serializer/SerializerInstance/#implementations","title":"Implementations","text":"<ul> <li>JavaSerializerInstance</li> <li>KryoSerializerInstance</li> <li>UnsafeRowSerializerInstance (Spark SQL)</li> </ul>"},{"location":"serializer/SerializerManager/","title":"SerializerManager","text":"<p><code>SerializerManager</code> is used to select the Serializer for shuffle blocks.</p>"},{"location":"serializer/SerializerManager/#creating-instance","title":"Creating Instance","text":"<p><code>SerializerManager</code> takes the following to be created:</p> <ul> <li>Default Serializer</li> <li> SparkConf <li> (optional) Encryption Key (<code>Option[Array[Byte]]</code>) <p><code>SerializerManager</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv (for the driver and executors)</li> </ul>"},{"location":"serializer/SerializerManager/#kryo-compatible-types","title":"Kryo-Compatible Types <p>Kryo-Compatible Types are the following primitive types, <code>Array</code>s of the primitive types and <code>String</code>s:</p> <ul> <li><code>Boolean</code></li> <li><code>Byte</code></li> <li><code>Char</code></li> <li><code>Double</code></li> <li><code>Float</code></li> <li><code>Int</code></li> <li><code>Long</code></li> <li><code>Null</code></li> <li><code>Short</code></li> </ul>","text":""},{"location":"serializer/SerializerManager/#default-serializer","title":"Default Serializer <p><code>SerializerManager</code> is given a Serializer when created (based on spark.serializer configuration property).</p> <p>The <code>Serializer</code> is used when <code>SerializerManager</code> is requested for a Serializer.</p>  <p>Tip</p> <p>Enable <code>DEBUG</code> logging level of SparkEnv to be told about the selected <code>Serializer</code>.</p> <pre><code>Using serializer: [serializer]\n</code></pre>","text":""},{"location":"serializer/SerializerManager/#accessing-serializermanager","title":"Accessing SerializerManager <p><code>SerializerManager</code> is available using SparkEnv on the driver and executors.</p> <pre><code>import org.apache.spark.SparkEnv\nSparkEnv.get.serializerManager\n</code></pre>","text":""},{"location":"serializer/SerializerManager/#kryoserializer","title":"KryoSerializer <p><code>SerializerManager</code> creates a KryoSerializer when created.</p> <p><code>KryoSerializer</code> is used as the serializer when the types of a given key and value are Kryo-compatible.</p>","text":""},{"location":"serializer/SerializerManager/#selecting-serializer","title":"Selecting Serializer <pre><code>getSerializer(\n  ct: ClassTag[_],\n  autoPick: Boolean): Serializer\ngetSerializer(\n  keyClassTag: ClassTag[_],\n  valueClassTag: ClassTag[_]): Serializer\n</code></pre> <p><code>getSerializer</code> returns the KryoSerializer when the given <code>ClassTag</code>s are Kryo-compatible and the <code>autoPick</code> flag is <code>true</code>. Otherwise, <code>getSerializer</code> returns the default Serializer.</p> <p><code>autoPick</code> flag is <code>true</code> for all BlockIds but Spark Streaming's <code>StreamBlockId</code>s.</p> <p><code>getSerializer</code> (with <code>autoPick</code> flag) is used when:</p> <ul> <li><code>SerializerManager</code> is requested to dataSerializeStream, dataSerializeWithExplicitClassTag and dataDeserializeStream</li> <li><code>SerializedValuesHolder</code> (of MemoryStore) is requested for a <code>SerializationStream</code></li> </ul> <p><code>getSerializer</code> (with key and value <code>ClassTag</code>s only) is used when:</p> <ul> <li><code>ShuffledRDD</code> is requested for dependencies</li> </ul>","text":""},{"location":"serializer/SerializerManager/#dataserializestream","title":"dataSerializeStream <pre><code>dataSerializeStream[T: ClassTag](\n  blockId: BlockId,\n  outputStream: OutputStream,\n  values: Iterator[T]): Unit\n</code></pre> <p><code>dataSerializeStream</code>...FIXME</p> <p><code>dataSerializeStream</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to doPutIterator and dropFromMemory</li> </ul>","text":""},{"location":"serializer/SerializerManager/#dataserializewithexplicitclasstag","title":"dataSerializeWithExplicitClassTag <pre><code>dataSerializeWithExplicitClassTag(\n  blockId: BlockId,\n  values: Iterator[_],\n  classTag: ClassTag[_]): ChunkedByteBuffer\n</code></pre> <p><code>dataSerializeWithExplicitClassTag</code>...FIXME</p> <p><code>dataSerializeWithExplicitClassTag</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to doGetLocalBytes</li> <li><code>SerializerManager</code> is requested to dataSerialize</li> </ul>","text":""},{"location":"serializer/SerializerManager/#datadeserializestream","title":"dataDeserializeStream <pre><code>dataDeserializeStream[T](\n  blockId: BlockId,\n  inputStream: InputStream)\n  (classTag: ClassTag[T]): Iterator[T]\n</code></pre> <p><code>dataDeserializeStream</code>...FIXME</p> <p><code>dataDeserializeStream</code>\u00a0is used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to saveDeserializedValuesToMemoryStore</li> <li><code>BlockManager</code> is requested to getLocalValues and getRemoteValues</li> <li><code>MemoryStore</code> is requested to putIteratorAsBytes (when <code>PartiallySerializedBlock</code> is requested for a <code>PartiallyUnrolledIterator</code>)</li> </ul>","text":""},{"location":"shuffle/","title":"Shuffle System","text":"<p>Shuffle System is a core service of Apache Spark that is responsible for shuffle blocks.</p> <p>The main core abstraction is ShuffleManager with SortShuffleManager as the default and only known implementation.</p> <p>spark.shuffle.manager configuration property allows for a custom ShuffleManager.</p> <p>Shuffle System uses shuffle handles, readers and writers.</p>"},{"location":"shuffle/#resources","title":"Resources","text":"<ul> <li>Improving Apache Spark Downscaling by Christopher Crosbie (Google) Ben Sidhom (Google)</li> <li>Spark shuffle introduction by Raymond Liu (aka colorant)</li> </ul>"},{"location":"shuffle/BaseShuffleHandle/","title":"BaseShuffleHandle","text":"<p><code>BaseShuffleHandle</code> is a ShuffleHandle that is used to capture the parameters when <code>SortShuffleManager</code> is requested for a ShuffleHandle (and the other specialized ShuffleHandles could not be selected):</p> <ul> <li> Shuffle ID <li> ShuffleDependency"},{"location":"shuffle/BaseShuffleHandle/#extensions","title":"Extensions","text":"<ul> <li>BypassMergeSortShuffleHandle</li> <li>SerializedShuffleHandle</li> </ul>"},{"location":"shuffle/BaseShuffleHandle/#demo","title":"Demo","text":"<pre><code>// Start a Spark application, e.g. spark-shell, with the Spark properties to trigger selection of BaseShuffleHandle:\n// 1. spark.shuffle.spill.numElementsForceSpillThreshold=1\n// 2. spark.shuffle.sort.bypassMergeThreshold=1\n\n// numSlices &gt; spark.shuffle.sort.bypassMergeThreshold\nscala&gt; val rdd = sc.parallelize(0 to 4, numSlices = 2).groupBy(_ % 2)\nrdd: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[2] at groupBy at &lt;console&gt;:24\n\nscala&gt; rdd.dependencies\nDEBUG SortShuffleManager: Can't use serialized shuffle for shuffle 0 because an aggregator is defined\nres0: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@1160c54b)\n\nscala&gt; rdd.getNumPartitions\nres1: Int = 2\n\nscala&gt; import org.apache.spark.ShuffleDependency\nimport org.apache.spark.ShuffleDependency\n\nscala&gt; val shuffleDep = rdd.dependencies(0).asInstanceOf[ShuffleDependency[Int, Int, Int]]\nshuffleDep: org.apache.spark.ShuffleDependency[Int,Int,Int] = org.apache.spark.ShuffleDependency@1160c54b\n\n// mapSideCombine is disabled\nscala&gt; shuffleDep.mapSideCombine\nres2: Boolean = false\n\n// aggregator defined\nscala&gt; shuffleDep.aggregator\nres3: Option[org.apache.spark.Aggregator[Int,Int,Int]] = Some(Aggregator(&lt;function1&gt;,&lt;function2&gt;,&lt;function2&gt;))\n\n// the number of reduce partitions &lt; spark.shuffle.sort.bypassMergeThreshold\nscala&gt; shuffleDep.partitioner.numPartitions\nres4: Int = 2\n\nscala&gt; shuffleDep.shuffleHandle\nres5: org.apache.spark.shuffle.ShuffleHandle = org.apache.spark.shuffle.BaseShuffleHandle@22b0fe7e\n</code></pre>"},{"location":"shuffle/BlockStoreShuffleReader/","title":"BlockStoreShuffleReader","text":"<p><code>BlockStoreShuffleReader</code> is a ShuffleReader.</p>"},{"location":"shuffle/BlockStoreShuffleReader/#creating-instance","title":"Creating Instance","text":"<p><code>BlockStoreShuffleReader</code> takes the following to be created:</p> <ul> <li> BaseShuffleHandle <li> Blocks by Address (<code>Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]</code>) <li> TaskContext <li> <code>ShuffleReadMetricsReporter</code> <li> SerializerManager <li> BlockManager <li> MapOutputTracker <li> <code>shouldBatchFetch</code> flag (default: <code>false</code>) <p><code>BlockStoreShuffleReader</code> is created\u00a0when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleReader (for a <code>ShuffleHandle</code> and a range of reduce partitions)</li> </ul>"},{"location":"shuffle/BlockStoreShuffleReader/#reading-combined-records-for-reduce-task","title":"Reading Combined Records (for Reduce Task) <pre><code>read(): Iterator[Product2[K, C]]\n</code></pre> <p><code>read</code>\u00a0is part of the ShuffleReader abstraction.</p> <p><code>read</code> creates a ShuffleBlockFetcherIterator.</p> <p><code>read</code>...FIXME</p>","text":""},{"location":"shuffle/BlockStoreShuffleReader/#fetchcontinuousblocksinbatch","title":"fetchContinuousBlocksInBatch <pre><code>fetchContinuousBlocksInBatch: Boolean\n</code></pre> <p><code>fetchContinuousBlocksInBatch</code>...FIXME</p>","text":""},{"location":"shuffle/BlockStoreShuffleReader/#review-me","title":"Review Me <p>=== [[read]] Reading Combined Records For Reduce Task</p> <p>Internally, <code>read</code> first storage:ShuffleBlockFetcherIterator.md#creating-instance[creates a <code>ShuffleBlockFetcherIterator</code>] (passing in the values of &lt;&gt;, &lt;&gt; and &lt;&gt; Spark properties). <p>NOTE: <code>read</code> uses scheduler:MapOutputTracker.md#getMapSizesByExecutorId[<code>MapOutputTracker</code> to find the BlockManagers with the shuffle blocks and sizes] to create <code>ShuffleBlockFetcherIterator</code>.</p> <p><code>read</code> creates a new serializer:SerializerInstance.md[SerializerInstance] (using <code>Serializer</code> from ShuffleDependency).</p> <p><code>read</code> creates a key/value iterator by <code>deserializeStream</code> every shuffle block stream.</p> <p><code>read</code> updates the context task metrics for each record read.</p> <p>NOTE: <code>read</code> uses <code>CompletionIterator</code> (to count the records read) and spark-InterruptibleIterator.md[InterruptibleIterator] (to support task cancellation).</p> <p>If the <code>ShuffleDependency</code> has an <code>Aggregator</code> defined, <code>read</code> wraps the current iterator inside an iterator defined by Aggregator.combineCombinersByKey (for <code>mapSideCombine</code> enabled) or Aggregator.combineValuesByKey otherwise.</p> <p>NOTE: <code>run</code> reports an exception when <code>ShuffleDependency</code> has no <code>Aggregator</code> defined with <code>mapSideCombine</code> flag enabled.</p> <p>For keyOrdering defined in the <code>ShuffleDependency</code>, <code>run</code> does the following:</p> <ol> <li>shuffle:ExternalSorter.md#creating-instance[Creates an <code>ExternalSorter</code>]</li> <li>shuffle:ExternalSorter.md#insertAll[Inserts all the records] into the <code>ExternalSorter</code></li> <li>Updates context <code>TaskMetrics</code></li> <li>Returns a <code>CompletionIterator</code> for the <code>ExternalSorter</code></li> </ol>","text":""},{"location":"shuffle/BypassMergeSortShuffleHandle/","title":"BypassMergeSortShuffleHandle","text":"<p><code>BypassMergeSortShuffleHandle</code> is a BaseShuffleHandle that <code>SortShuffleManager</code> uses when can avoid merge-sorting data (when requested to register a shuffle).</p> <p><code>SerializedShuffleHandle</code> tells <code>SortShuffleManager</code> to use BypassMergeSortShuffleWriter when requested for a ShuffleWriter.</p>"},{"location":"shuffle/BypassMergeSortShuffleHandle/#creating-instance","title":"Creating Instance","text":"<p><code>BypassMergeSortShuffleHandle</code> takes the following to be created:</p> <ul> <li> Shuffle ID <li> ShuffleDependency <p><code>BypassMergeSortShuffleHandle</code> is created when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleHandle (for the ShuffleDependency)</li> </ul>"},{"location":"shuffle/BypassMergeSortShuffleHandle/#demo","title":"Demo","text":"<pre><code>val rdd = sc.parallelize(0 to 8).groupBy(_ % 3)\n\nassert(rdd.dependencies.length == 1)\n\nimport org.apache.spark.ShuffleDependency\nval shuffleDep = rdd.dependencies.head.asInstanceOf[ShuffleDependency[Int, Int, Int]]\n\nassert(shuffleDep.mapSideCombine == false, \"mapSideCombine should be disabled\")\nassert(shuffleDep.aggregator.isDefined)\n</code></pre> <pre><code>// Use ':paste -raw' mode to paste the code\npackage org.apache.spark\nobject open {\n  import org.apache.spark.SparkContext\n  def bypassMergeThreshold(sc: SparkContext) = {\n    import org.apache.spark.internal.config.SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD\n    sc.getConf.get(SHUFFLE_SORT_BYPASS_MERGE_THRESHOLD)\n  }\n}\n</code></pre> <pre><code>import org.apache.spark.open\nval bypassMergeThreshold = open.bypassMergeThreshold(sc)\n\nassert(shuffleDep.partitioner.numPartitions &lt; bypassMergeThreshold)\n</code></pre> <pre><code>import org.apache.spark.shuffle.sort.BypassMergeSortShuffleHandle\n// BypassMergeSortShuffleHandle is private[spark]\n// so the following won't work :(\n// assert(shuffleDep.shuffleHandle.isInstanceOf[BypassMergeSortShuffleHandle[Int, Int]])\nassert(shuffleDep.shuffleHandle.toString.contains(\"BypassMergeSortShuffleHandle\"))\n</code></pre>"},{"location":"shuffle/BypassMergeSortShuffleWriter/","title":"BypassMergeSortShuffleWriter","text":"<p><code>BypassMergeSortShuffleWriter&amp;lt;K, V&amp;gt;</code> is a ShuffleWriter for ShuffleMapTasks to write records into one single shuffle block data file.</p> <p></p>"},{"location":"shuffle/BypassMergeSortShuffleWriter/#creating-instance","title":"Creating Instance","text":"<p><code>BypassMergeSortShuffleWriter</code> takes the following to be created:</p> <ul> <li> BlockManager <li> BypassMergeSortShuffleHandle (of <code>K</code> keys and <code>V</code> values) <li> Map ID <li> SparkConf <li> ShuffleWriteMetricsReporter <li> <code>ShuffleExecutorComponents</code> <p><code>BypassMergeSortShuffleWriter</code> is created when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleWriter (for a BypassMergeSortShuffleHandle)</li> </ul>"},{"location":"shuffle/BypassMergeSortShuffleWriter/#diskblockobjectwriters","title":"DiskBlockObjectWriters <pre><code>DiskBlockObjectWriter[] partitionWriters\n</code></pre> <p><code>BypassMergeSortShuffleWriter</code> uses a DiskBlockObjectWriter per partition (based on the Partitioner).</p> <p><code>BypassMergeSortShuffleWriter</code> asserts that no <code>partitionWriters</code> are created while writing out records to a shuffle file.</p> <p>While writing, <code>BypassMergeSortShuffleWriter</code> requests the BlockManager for as many DiskBlockObjectWriters as there are partitions (in the Partitioner).</p> <p>While writing, <code>BypassMergeSortShuffleWriter</code> requests the Partitioner for a partition for records (using keys) and finds the per-partition <code>DiskBlockObjectWriter</code> that is requested to write out the partition records. After all records are written out to their shuffle files, the <code>DiskBlockObjectWriter</code>s are requested to commitAndGet.</p> <p><code>BypassMergeSortShuffleWriter</code> uses the partition writers while writing out partition data and removes references to them (<code>null</code>ify them) in the end.</p> <p>In other words, after writing out partition data <code>partitionWriters</code> internal registry is <code>null</code>.</p> <p><code>partitionWriters</code> internal registry becomes <code>null</code> after <code>BypassMergeSortShuffleWriter</code> has finished:</p> <ul> <li>Writing out partition data</li> <li>Stopping</li> </ul>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#indexshuffleblockresolver","title":"IndexShuffleBlockResolver <p><code>BypassMergeSortShuffleWriter</code> is given a IndexShuffleBlockResolver when created.</p> <p><code>BypassMergeSortShuffleWriter</code> uses the <code>IndexShuffleBlockResolver</code> for writing out records (to writeIndexFileAndCommit and getDataFile).</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#serializer","title":"Serializer <p>When created, <code>BypassMergeSortShuffleWriter</code> requests the ShuffleDependency (of the given BypassMergeSortShuffleHandle) for the Serializer.</p> <p><code>BypassMergeSortShuffleWriter</code> creates a new instance of the <code>Serializer</code> for writing out records.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#configuration-properties","title":"Configuration Properties","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#sparkshufflefilebuffer","title":"spark.shuffle.file.buffer <p><code>BypassMergeSortShuffleWriter</code> uses spark.shuffle.file.buffer configuration property for...FIXME</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#sparkfiletransferto","title":"spark.file.transferTo <p>BypassMergeSortShuffleWriter uses spark.file.transferTo configuration property to control whether to use Java New I/O while writing to a partitioned file.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#writing-out-records-to-shuffle-file","title":"Writing Out Records to Shuffle File <pre><code>void write(\n  Iterator&lt;Product2&lt;K, V&gt;&gt; records)\n</code></pre> <p><code>write</code> is part of the ShuffleWriter abstraction.</p> <p><code>write</code> creates a new instance of the Serializer.</p> <p><code>write</code> initializes the partitionWriters and partitionWriterSegments internal registries (for DiskBlockObjectWriters and FileSegments for every partition, respectively).</p> <p><code>write</code> requests the BlockManager for the DiskBlockManager and for every partition <code>write</code> requests it for a shuffle block ID and the file. <code>write</code> creates a DiskBlockObjectWriter for the shuffle block (using the BlockManager). <code>write</code> stores the reference to <code>DiskBlockObjectWriters</code> in the partitionWriters internal registry.</p> <p>After all <code>DiskBlockObjectWriters</code> are created, <code>write</code> requests the ShuffleWriteMetrics to increment shuffle write time metric.</p> <p>For every record (a key-value pair), write requests the Partitioner for the partition ID for the key. The partition ID is then used as an index of the partition writer (among the DiskBlockObjectWriters) to write the current record out to a block file.</p> <p>Once all records have been writted out to their respective block files, write does the following for every DiskBlockObjectWriter:</p> <ol> <li> <p>Requests the <code>DiskBlockObjectWriter</code> to commit and return a corresponding FileSegment of the shuffle block</p> </li> <li> <p>Saves the (reference to) <code>FileSegments</code> in the partitionWriterSegments internal registry</p> </li> <li> <p>Requests the <code>DiskBlockObjectWriter</code> to close</p> </li> </ol>  <p>Note</p> <p>At this point, all the records are in shuffle block files on a local disk. The records are split across block files by key.</p>  <p><code>write</code> requests the IndexShuffleBlockResolver for the shuffle file for the shuffle and the mapDs&gt;&gt;.</p> <p><code>write</code> creates a temporary file (based on the name of the shuffle file) and writes all the per-partition shuffle files to it. The size of every per-partition shuffle files is saved as the partitionLengths internal registry.</p>  <p>Note</p> <p>At this point, all the per-partition shuffle block files are one single map shuffle data file.</p>  <p><code>write</code> requests the IndexShuffleBlockResolver to write shuffle index and data files for the shuffle and the map IDs (with the partitionLengths and the temporary shuffle output file).</p> <p><code>write</code> returns a shuffle map output status (with the shuffle server ID and the partitionLengths).</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#no-records","title":"No Records <p>When there is no records to write out, <code>write</code> initializes the partitionLengths internal array (of numPartitions size) with all elements being 0.</p> <p><code>write</code> requests the IndexShuffleBlockResolver to write shuffle index and data files, but the difference (compared to when there are records to write) is that the <code>dataTmp</code> argument is simply <code>null</code>.</p> <p><code>write</code> sets the internal <code>mapStatus</code> (with the address of BlockManager in use and partitionLengths).</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#requirements","title":"Requirements <p><code>write</code> requires that there are no DiskBlockObjectWriters.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#writing-out-partitioned-data","title":"Writing Out Partitioned Data <pre><code>long[] writePartitionedData(\n  ShuffleMapOutputWriter mapOutputWriter)\n</code></pre> <p><code>writePartitionedData</code> makes sure that DiskBlockObjectWriters are available (<code>partitionWriters != null</code>).</p> <p>For every partition, <code>writePartitionedData</code> takes the partition file (from the FileSegments). Only when the partition file exists, <code>writePartitionedData</code> requests the given ShuffleMapOutputWriter for a ShufflePartitionWriter and writes out the partitioned data. At the end, <code>writePartitionedData</code> deletes the file.</p> <p><code>writePartitionedData</code> requests the ShuffleWriteMetricsReporter to increment the write time.</p> <p>In the end, <code>writePartitionedData</code> requests the <code>ShuffleMapOutputWriter</code> to commitAllPartitions and returns the size of each partition of the output map file.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#copying-raw-bytes-between-input-streams","title":"Copying Raw Bytes Between Input Streams <pre><code>copyStream(\n  in: InputStream,\n  out: OutputStream,\n  closeStreams: Boolean = false,\n  transferToEnabled: Boolean = false): Long\n</code></pre> <p>copyStream branches off depending on the type of <code>in</code> and <code>out</code> streams, i.e. whether they are both <code>FileInputStream</code> with <code>transferToEnabled</code> input flag is enabled.</p> <p>If they are both <code>FileInputStream</code> with <code>transferToEnabled</code> enabled, copyStream gets their <code>FileChannels</code> and transfers bytes from the input file to the output file and counts the number of bytes, possibly zero, that were actually transferred.</p> <p>NOTE: copyStream uses Java's {java-javadoc-url}/java/nio/channels/FileChannel.html[java.nio.channels.FileChannel] to manage file channels.</p> <p>If either <code>in</code> and <code>out</code> input streams are not <code>FileInputStream</code> or <code>transferToEnabled</code> flag is disabled (default), copyStream reads data from <code>in</code> to write to <code>out</code> and counts the number of bytes written.</p> <p>copyStream can optionally close <code>in</code> and <code>out</code> streams (depending on the input <code>closeStreams</code> -- disabled by default).</p> <p>NOTE: <code>Utils.copyStream</code> is used when &lt;&gt; (among other places).  <p>Tip</p> <p>Visit the official web site of JSR 51: New I/O APIs for the Java Platform and read up on java.nio package.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#stopping-shufflewriter","title":"Stopping ShuffleWriter <pre><code>Option&lt;MapStatus&gt; stop(\n  boolean success)\n</code></pre> <p><code>stop</code>...FIXME</p> <p><code>stop</code>\u00a0is part of the ShuffleWriter abstraction.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#temporary-array-of-partition-lengths","title":"Temporary Array of Partition Lengths <pre><code>long[] partitionLengths\n</code></pre> <p>Temporary array of partition lengths after records are written to a shuffle system.</p> <p>Initialized every time <code>BypassMergeSortShuffleWriter</code> writes out records (before passing it in to IndexShuffleBlockResolver). After <code>IndexShuffleBlockResolver</code> finishes, it is used to initialize mapStatus internal property.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#internal-properties","title":"Internal Properties","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#numpartitions","title":"numPartitions","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#partitionwritersegments","title":"partitionWriterSegments","text":""},{"location":"shuffle/BypassMergeSortShuffleWriter/#mapstatus","title":"mapStatus <p>MapStatus that BypassMergeSortShuffleWriter returns when stopped</p> <p>Initialized every time <code>BypassMergeSortShuffleWriter</code> writes out records.</p> <p>Used when BypassMergeSortShuffleWriter stops (with <code>success</code> enabled) as a marker if any records were written and returned if they did.</p>","text":""},{"location":"shuffle/DownloadFileManager/","title":"DownloadFileManager","text":"<p><code>DownloadFileManager</code> is an abstraction of file managers that can createTempFile and registerTempFileToClean.</p>"},{"location":"shuffle/DownloadFileManager/#contract","title":"Contract","text":""},{"location":"shuffle/DownloadFileManager/#createtempfile","title":"createTempFile <pre><code>DownloadFile createTempFile(\n  TransportConf transportConf)\n</code></pre> <p>Used when:</p> <ul> <li><code>DownloadCallback</code> (of OneForOneBlockFetcher) is created</li> </ul>","text":""},{"location":"shuffle/DownloadFileManager/#registertempfiletoclean","title":"registerTempFileToClean <pre><code>boolean registerTempFileToClean(\n  DownloadFile file)\n</code></pre> <p>Used when:</p> <ul> <li><code>DownloadCallback</code> (of OneForOneBlockFetcher) is requested to <code>onComplete</code></li> </ul>","text":""},{"location":"shuffle/DownloadFileManager/#implementations","title":"Implementations","text":"<ul> <li>RemoteBlockDownloadFileManager</li> <li>ShuffleBlockFetcherIterator</li> </ul>"},{"location":"shuffle/ExecutorDiskUtils/","title":"ExecutorDiskUtils","text":""},{"location":"shuffle/ExternalAppendOnlyMap/","title":"ExternalAppendOnlyMap","text":"<p><code>ExternalAppendOnlyMap</code> is a Spillable of SizeTrackers.</p> <p><code>ExternalAppendOnlyMap[K, V, C]</code> is a parameterized type of <code>K</code> keys, <code>V</code> values, and <code>C</code> combiner (partial) values.</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#creating-instance","title":"Creating Instance","text":"<p>ExternalAppendOnlyMap takes the following to be created:</p> <ul> <li>[[createCombiner]] createCombiner function (<code>V =&gt; C</code>)</li> <li>[[mergeValue]] mergeValue function (<code>(C, V) =&gt; C</code>)</li> <li>[[mergeCombiners]] mergeCombiners function (<code>(C, C) =&gt; C</code>)</li> <li>[[serializer]] Optional serializer:Serializer.md[Serializer] (default: core:SparkEnv.md#serializer[system Serializer])</li> <li>[[blockManager]] Optional storage:BlockManager.md[BlockManager] (default: core:SparkEnv.md#blockManager[system BlockManager])</li> <li>[[context]] TaskContext</li> <li>[[serializerManager]] Optional serializer:SerializerManager.md[SerializerManager] (default: core:SparkEnv.md#serializerManager[system SerializerManager])</li> </ul> <p>ExternalAppendOnlyMap is created when:</p> <ul> <li> <p>Aggregator is requested to rdd:Aggregator.md#combineValuesByKey[combineValuesByKey] and rdd:Aggregator.md#combineCombinersByKey[combineCombinersByKey]</p> </li> <li> <p><code>CoGroupedRDD</code> is requested to compute a partition</p> </li> </ul> <p>== [[currentMap]] SizeTrackingAppendOnlyMap</p> <p>ExternalAppendOnlyMap manages a SizeTrackingAppendOnlyMap.</p> <p>A SizeTrackingAppendOnlyMap is created immediately when ExternalAppendOnlyMap is and every time when &lt;&gt; and &lt;&gt; spilled to disk. <p>SizeTrackingAppendOnlyMap are dereferenced (<code>null</code>ed) for the memory to be garbage-collected when &lt;&gt; and &lt;&gt;. <p>SizeTrackingAppendOnlyMap is used when &lt;&gt;, &lt;&gt;, &lt;&gt; and &lt;&gt;. <p>== [[insertAll]] Inserting All Key-Value Pairs (from Iterator)</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala","title":"[source, scala]","text":"<p>insertAll(   entries: Iterator[Product2[K, V]]): Unit</p> <p>[[insertAll-update-function]] insertAll creates an update function that uses the &lt;&gt; function for an existing value or the &lt;&gt; function for a new value. <p>For every key-value pair (from the input iterator), insertAll does the following:</p> <ul> <li> <p>Requests the &lt;&gt; for the estimated size and, if greater than the &lt;&lt;_peakMemoryUsedBytes, _peakMemoryUsedBytes&gt;&gt; metric, updates it. <li> <p>shuffle:Spillable.md#maybeSpill[Spills to a disk if necessary] and, if spilled, creates a new &lt;&gt; <li> <p>Requests the &lt;&gt; to change value for the current value (with the &lt;&gt; function) <li> <p>shuffle:Spillable.md#addElementsRead[Increments the elements read counter]</p> </li> <p>=== [[insertAll-usage]] Usage</p> <p>insertAll is used when:</p> <ul> <li> <p>Aggregator is requested to rdd:Aggregator.md#combineValuesByKey[combineValuesByKey] and rdd:Aggregator.md#combineCombinersByKey[combineCombinersByKey]</p> </li> <li> <p><code>CoGroupedRDD</code> is requested to compute a partition</p> </li> <li> <p>ExternalAppendOnlyMap is requested to &lt;&gt; <p>=== [[insertAll-requirements]] Requirements</p> <p>insertAll throws an IllegalStateException when the &lt;&gt; internal registry is <code>null</code>:"},{"location":"shuffle/ExternalAppendOnlyMap/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#cannot-insert-new-elements-into-a-map-after-calling-iterator","title":"Cannot insert new elements into a map after calling iterator","text":"<p>== [[iterator]] Iterator of \"Combined\" Pairs</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_1","title":"[source, scala]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#iterator-iteratork-c","title":"iterator: Iterator[(K, C)]","text":"<p>iterator...FIXME</p> <p>iterator is used when...FIXME</p> <p>== [[spill]] Spilling to Disk if Necessary</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_2","title":"[source, scala]","text":"<p>spill(   collection: SizeTracker): Unit</p> <p>spill...FIXME</p> <p>spill is used when...FIXME</p> <p>== [[forceSpill]] Forcing Disk Spilling</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_3","title":"[source, scala]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#forcespill-boolean","title":"forceSpill(): Boolean","text":"<p>forceSpill returns a flag to indicate whether spilling to disk has really happened (<code>true</code>) or not (<code>false</code>).</p> <p>forceSpill branches off per the current state it is in (and should rather use a state-aware implementation).</p> <p>When a &lt;&gt; is in use, forceSpill requests it to spill and, if it did, dereferences (<code>null</code>ify) the &lt;&gt;. forceSpill returns whatever the spilling of the &lt;&gt; returned. <p>When there is at least one element in the &lt;&gt;, forceSpill &lt;&gt; it. forceSpill then creates a new &lt;&gt; and always returns <code>true</code>. <p>In other cases, forceSpill simply returns <code>false</code>.</p> <p>forceSpill is part of the shuffle:Spillable.md[Spillable] abstraction.</p> <p>== [[freeCurrentMap]] Freeing Up SizeTrackingAppendOnlyMap and Releasing Memory</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_4","title":"[source, scala]","text":""},{"location":"shuffle/ExternalAppendOnlyMap/#freecurrentmap-unit","title":"freeCurrentMap(): Unit","text":"<p>freeCurrentMap dereferences (<code>null</code>ify) the &lt;&gt; (if there still was one) followed by shuffle:Spillable.md#releaseMemory[releasing all memory]. <p>freeCurrentMap is used when SpillableIterator is requested to destroy itself.</p> <p>== [[spillMemoryIteratorToDisk]] spillMemoryIteratorToDisk Method</p>"},{"location":"shuffle/ExternalAppendOnlyMap/#source-scala_5","title":"[source, scala]","text":"<p>spillMemoryIteratorToDisk(   inMemoryIterator: Iterator[(K, C)]): DiskMapIterator</p> <p>spillMemoryIteratorToDisk...FIXME</p> <p>spillMemoryIteratorToDisk is used when...FIXME</p>"},{"location":"shuffle/ExternalSorter/","title":"ExternalSorter","text":"<p><code>ExternalSorter</code> is a Spillable of <code>WritablePartitionedPairCollection</code> of pairs (of K keys and C values).</p> <p><code>ExternalSorter[K, V, C]</code> is a parameterized type of <code>K</code> keys, <code>V</code> values, and <code>C</code> combiner (partial) values.</p> <p><code>ExternalSorter</code> is used for the following:</p> <ul> <li>SortShuffleWriter to write records</li> <li>BlockStoreShuffleReader to read records (with a key ordering defined)</li> </ul>"},{"location":"shuffle/ExternalSorter/#creating-instance","title":"Creating Instance","text":"<p><code>ExternalSorter</code> takes the following to be created:</p> <ul> <li> TaskContext <li> Optional Aggregator (default: undefined) <li> Optional Partitioner (default: undefined) <li> Optional <code>Ordering</code> (Scala) for keys (default: undefined) <li> Serializer (default: Serializer) <p><code>ExternalSorter</code> is created\u00a0when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read records (for a reduce task)</li> <li><code>SortShuffleWriter</code> is requested to write records (as a <code>ExternalSorter[K, V, C]</code> or <code>ExternalSorter[K, V, V]</code> based on Map-Size Partial Aggregation Flag)</li> </ul>"},{"location":"shuffle/ExternalSorter/#inserting-records","title":"Inserting Records <pre><code>insertAll(\n  records: Iterator[Product2[K, V]]): Unit\n</code></pre> <p><code>insertAll</code> branches off per whether the optional Aggregator was specified or not (when creating the ExternalSorter).</p> <p><code>insertAll</code> takes all records eagerly and materializes the given records iterator.</p>","text":""},{"location":"shuffle/ExternalSorter/#map-side-aggregator-specified","title":"Map-Side Aggregator Specified <p>With an Aggregator given, <code>insertAll</code> creates an update function based on the mergeValue and createCombiner functions of the <code>Aggregator</code>.</p> <p>For every record, <code>insertAll</code> increment internal read counter.</p> <p><code>insertAll</code> requests the PartitionedAppendOnlyMap to <code>changeValue</code> for the key (made up of the partition of the key of the current record and the key itself, i.e. <code>(partition, key)</code>) with the update function.</p> <p>In the end, <code>insertAll</code> spills the in-memory collection to disk if needed with the <code>usingMap</code> flag enabled (to indicate that the PartitionedAppendOnlyMap was updated).</p>","text":""},{"location":"shuffle/ExternalSorter/#no-map-side-aggregator-specified","title":"No Map-Side Aggregator Specified <p>With no Aggregator given, <code>insertAll</code> iterates over all the records and uses the PartitionedPairBuffer instead.</p> <p>For every record, <code>insertAll</code> increment internal read counter.</p> <p><code>insertAll</code> requests the PartitionedPairBuffer to insert with the partition of the key of the current record, the key itself and the value of the current record.</p> <p>In the end, <code>insertAll</code> spills the in-memory collection to disk if needed with the <code>usingMap</code> flag disabled (since this time the PartitionedPairBuffer was updated, not the PartitionedAppendOnlyMap).</p>","text":""},{"location":"shuffle/ExternalSorter/#spilling-in-memory-collection-to-disk","title":"Spilling In-Memory Collection to Disk <pre><code>maybeSpillCollection(\n  usingMap: Boolean): Unit\n</code></pre> <p><code>maybeSpillCollection</code> branches per the input <code>usingMap</code> flag (to indicate which in-memory collection to use, the PartitionedAppendOnlyMap or the PartitionedPairBuffer).</p> <p><code>maybeSpillCollection</code> requests the collection to estimate size (in bytes) that is tracked as the peakMemoryUsedBytes metric (for every size bigger than what is currently recorded).</p> <p><code>maybeSpillCollection</code> spills the collection to disk if needed. If spilled, <code>maybeSpillCollection</code> creates a new collection (a new <code>PartitionedAppendOnlyMap</code> or a new <code>PartitionedPairBuffer</code>).</p>","text":""},{"location":"shuffle/ExternalSorter/#usage","title":"Usage <p><code>insertAll</code> is used when:</p> <ul> <li><code>SortShuffleWriter</code> is requested to write records (as a <code>ExternalSorter[K, V, C]</code> or <code>ExternalSorter[K, V, V]</code> based on Map-Size Partial Aggregation Flag)</li> <li><code>BlockStoreShuffleReader</code> is requested to read records (with a key sorting defined)</li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#in-memory-collections-of-records","title":"In-Memory Collections of Records <p><code>ExternalSorter</code> uses <code>PartitionedPairBuffer</code>s or <code>PartitionedAppendOnlyMap</code>s to store records in memory before spilling to disk.</p> <p><code>ExternalSorter</code> uses <code>PartitionedPairBuffer</code>s when created with no Aggregator. Otherwise, <code>ExternalSorter</code> uses <code>PartitionedAppendOnlyMap</code>s.</p> <p><code>ExternalSorter</code> inserts records to the collections when insertAll.</p> <p><code>ExternalSorter</code> spills the in-memory collection to disk if needed and, if so, creates a new collection.</p> <p><code>ExternalSorter</code> releases the collections (<code>null</code>s them) when requested to forceSpill and stop. That is when the JVM garbage collector takes care of evicting them from memory completely.</p>","text":""},{"location":"shuffle/ExternalSorter/#peak-size-of-in-memory-collection","title":"Peak Size of In-Memory Collection <p><code>ExternalSorter</code> tracks the peak size (in bytes) of the in-memory collection whenever requested to spill the in-memory collection to disk if needed.</p> <p>The peak size is used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task (with an ordering defined)</li> <li><code>ExternalSorter</code> is requested to writePartitionedMapOutput</li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#spills","title":"Spills <pre><code>spills: ArrayBuffer[SpilledFile]\n</code></pre> <p><code>ExternalSorter</code> creates the <code>spills</code> internal buffer of SpilledFiles when created.</p> <p>A new <code>SpilledFile</code> is added when <code>ExternalSorter</code> is requested to spill.</p>  <p>Note</p> <p>No elements in <code>spills</code> indicate that there is only in-memory data.</p>  <p><code>SpilledFile</code>s are deleted physically from disk and the <code>spills</code> buffer is cleared when <code>ExternalSorter</code> is requested to stop.</p> <p><code>ExternalSorter</code> uses the <code>spills</code> buffer when requested for an partitionedIterator.</p>","text":""},{"location":"shuffle/ExternalSorter/#number-of-spills","title":"Number of Spills <pre><code>numSpills: Int\n</code></pre> <p><code>numSpills</code> is the number of spill files this sorter has spilled.</p>","text":""},{"location":"shuffle/ExternalSorter/#spilledfile","title":"SpilledFile <p><code>SpilledFile</code> is a metadata of a spilled file:</p> <ul> <li> <code>File</code> (Java) <li> BlockId <li> Serializer Batch Sizes (<code>Array[Long]</code>) <li> Elements per Partition (<code>Array[Long]</code>)","text":""},{"location":"shuffle/ExternalSorter/#spilling-data-to-disk","title":"Spilling Data to Disk <pre><code>spill(\n  collection: WritablePartitionedPairCollection[K, C]): Unit\n</code></pre> <p><code>spill</code> is part of the Spillable abstraction.</p> <p><code>spill</code> requests the given <code>WritablePartitionedPairCollection</code> for a destructive <code>WritablePartitionedIterator</code>.</p> <p><code>spill</code> spillMemoryIteratorToDisk (with the destructive <code>WritablePartitionedIterator</code>) that creates a SpilledFile.</p> <p>In the end, <code>spill</code> adds the <code>SpilledFile</code> to the spills internal registry.</p>","text":""},{"location":"shuffle/ExternalSorter/#spillmemoryiteratortodisk","title":"spillMemoryIteratorToDisk <pre><code>spillMemoryIteratorToDisk(\n  inMemoryIterator: WritablePartitionedIterator): SpilledFile\n</code></pre> <p><code>spillMemoryIteratorToDisk</code>...FIXME</p> <p><code>spillMemoryIteratorToDisk</code> is used when:</p> <ul> <li><code>ExternalSorter</code> is requested to spill</li> <li><code>SpillableIterator</code> is requested to <code>spill</code></li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#partitionediterator","title":"partitionedIterator <pre><code>partitionedIterator: Iterator[(Int, Iterator[Product2[K, C]])]\n</code></pre> <p><code>partitionedIterator</code>...FIXME</p> <p><code>partitionedIterator</code> is used when:</p> <ul> <li><code>ExternalSorter</code> is requested for an iterator and to writePartitionedMapOutput</li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#writepartitionedmapoutput","title":"writePartitionedMapOutput <pre><code>writePartitionedMapOutput(\n  shuffleId: Int,\n  mapId: Long,\n  mapOutputWriter: ShuffleMapOutputWriter): Unit\n</code></pre> <p><code>writePartitionedMapOutput</code>...FIXME</p> <p><code>writePartitionedMapOutput</code> is used when:</p> <ul> <li><code>SortShuffleWriter</code> is requested to write records</li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#iterator","title":"Iterator <pre><code>iterator: Iterator[Product2[K, C]]\n</code></pre> <p><code>iterator</code> turns the isShuffleSort flag off (<code>false</code>).</p> <p><code>iterator</code> partitionedIterator and takes the combined values (the second elements) only.</p> <p><code>iterator</code> is used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined records for a reduce task</li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#stopping-externalsorter","title":"Stopping ExternalSorter <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>...FIXME</p> <p><code>stop</code> is used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read records (with ordering defined)</li> <li><code>SortShuffleWriter</code> is requested to stop</li> </ul>","text":""},{"location":"shuffle/ExternalSorter/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.util.collection.ExternalSorter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.util.collection.ExternalSorter=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"shuffle/FetchFailedException/","title":"FetchFailedException","text":"<p><code>FetchFailedException</code> exception may be thrown when a task runs (and <code>ShuffleBlockFetcherIterator</code> could not fetch shuffle blocks).</p> <p>When <code>FetchFailedException</code> is reported, <code>TaskRunner</code> catches it and notifies the ExecutorBackend (with <code>TaskState.FAILED</code> task state).</p>"},{"location":"shuffle/FetchFailedException/#creating-instance","title":"Creating Instance","text":"<p><code>FetchFailedException</code> takes the following to be created:</p> <ul> <li> BlockManagerId <li> Shuffle ID <li> Map ID <li> Map Index <li> Reduce ID <li> Error Message <li>Error Cause</li> <p>While being created, <code>FetchFailedException</code> requests the current TaskContext to setFetchFailed.</p> <p><code>FetchFailedException</code> is created\u00a0when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to throw a FetchFailedException (for a <code>ShuffleBlockId</code> or a <code>ShuffleBlockBatchId</code>)</li> </ul>"},{"location":"shuffle/FetchFailedException/#error-cause","title":"Error Cause <p><code>FetchFailedException</code> can be given an error cause when created.</p> <p>The root cause of the <code>FetchFailedException</code> is usually because the Executor (with the BlockManager for requested shuffle blocks) is lost and no longer available due to the following:</p> <ol> <li><code>OutOfMemoryError</code> could be thrown (aka OOMed) or some other unhandled exception</li> <li>The cluster manager that manages the workers with the executors of your Spark application (e.g. Kubernetes, Hadoop YARN) enforces the container memory limits and eventually decides to kill the executor due to excessive memory usage</li> </ol> <p>A solution is usually to tune the memory of your Spark application.</p>","text":""},{"location":"shuffle/FetchFailedException/#taskcontext","title":"TaskContext <p>TaskContext comes with setFetchFailed and fetchFailed to hold a <code>FetchFailedException</code> unmodified (regardless of what happens in a user code).</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/","title":"IndexShuffleBlockResolver","text":"<p><code>IndexShuffleBlockResolver</code> is a ShuffleBlockResolver that manages shuffle block data and uses shuffle index files for faster shuffle data access.</p>"},{"location":"shuffle/IndexShuffleBlockResolver/#creating-instance","title":"Creating Instance","text":"<p><code>IndexShuffleBlockResolver</code> takes the following to be created:</p> <ul> <li> SparkConf <li> BlockManager <p><code>IndexShuffleBlockResolver</code> is created\u00a0when:</p> <ul> <li><code>SortShuffleManager</code> is created</li> <li><code>LocalDiskShuffleExecutorComponents</code> is requested to initializeExecutor</li> </ul> <p></p>"},{"location":"shuffle/IndexShuffleBlockResolver/#getstoredshuffles","title":"getStoredShuffles <pre><code>getStoredShuffles(): Seq[ShuffleBlockInfo]\n</code></pre> <p><code>getStoredShuffles</code>\u00a0is part of the MigratableResolver abstraction.</p> <p><code>getStoredShuffles</code>...FIXME</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#putshuffleblockasstream","title":"putShuffleBlockAsStream <pre><code>putShuffleBlockAsStream(\n  blockId: BlockId,\n  serializerManager: SerializerManager): StreamCallbackWithID\n</code></pre> <p><code>putShuffleBlockAsStream</code>\u00a0is part of the MigratableResolver abstraction.</p> <p><code>putShuffleBlockAsStream</code>...FIXME</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#getmigrationblocks","title":"getMigrationBlocks <pre><code>getMigrationBlocks(\n  shuffleBlockInfo: ShuffleBlockInfo): List[(BlockId, ManagedBuffer)]\n</code></pre> <p><code>getMigrationBlocks</code>\u00a0is part of the MigratableResolver abstraction.</p> <p><code>getMigrationBlocks</code>...FIXME</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#writing-shuffle-index-and-data-files","title":"Writing Shuffle Index and Data Files <pre><code>writeIndexFileAndCommit(\n  shuffleId: Int,\n  mapId: Long,\n  lengths: Array[Long],\n  dataTmp: File): Unit\n</code></pre> <p><code>writeIndexFileAndCommit</code> finds the index and data files for the input <code>shuffleId</code> and <code>mapId</code>.</p> <p><code>writeIndexFileAndCommit</code> creates a temporary file for the index file (in the same directory) and writes offsets (as the moving sum of the input <code>lengths</code> starting from 0 to the final offset at the end for the end of the output file).</p>  <p>Note</p> <p>The offsets are the sizes in the input <code>lengths</code> exactly.</p>  <p></p> <p><code>writeIndexFileAndCommit</code>...FIXME (Review me)</p> <p><code>writeIndexFileAndCommit</code> &lt;&gt; for the input <code>shuffleId</code> and <code>mapId</code>. <p><code>writeIndexFileAndCommit</code> &lt;&gt; (aka consistency check). <p>If the consistency check fails, it means that another attempt for the same task has already written the map outputs successfully and so the input <code>dataTmp</code> and temporary index files are deleted (as no longer correct).</p> <p>If the consistency check succeeds, the existing index and data files are deleted (if they exist) and the temporary index and data files become \"official\", i.e. renamed to their final names.</p> <p>In case of any IO-related exception, <code>writeIndexFileAndCommit</code> throws a <code>IOException</code> with the messages:</p> <pre><code>fail to rename file [indexTmp] to [indexFile]\n</code></pre> <p>or</p> <pre><code>fail to rename file [dataTmp] to [dataFile]\n</code></pre> <p><code>writeIndexFileAndCommit</code>\u00a0is used when:</p> <ul> <li><code>LocalDiskShuffleMapOutputWriter</code> is requested to commitAllPartitions</li> <li><code>LocalDiskSingleSpillMapOutputWriter</code> is requested to transferMapSpillFile</li> </ul>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#removing-shuffle-index-and-data-files","title":"Removing Shuffle Index and Data Files <pre><code>removeDataByMap(\n  shuffleId: Int,\n  mapId: Long): Unit\n</code></pre> <p><code>removeDataByMap</code> finds and deletes the shuffle data file (for the input <code>shuffleId</code> and <code>mapId</code>) followed by finding and deleting the shuffle data index file.</p> <p><code>removeDataByMap</code>\u00a0is used when:</p> <ul> <li><code>SortShuffleManager</code> is requested to unregister a shuffle (and remove a shuffle from a shuffle system)</li> </ul>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#creating-shuffle-block-index-file","title":"Creating Shuffle Block Index File <pre><code>getIndexFile(\n  shuffleId: Int,\n  mapId: Long,\n  dirs: Option[Array[String]] = None): File\n</code></pre> <p><code>getIndexFile</code> creates a ShuffleIndexBlockId.</p> <p>With <code>dirs</code> local directories defined, <code>getIndexFile</code> places the index file of the <code>ShuffleIndexBlockId</code> (by the name) in the local directories (with the spark.diskStore.subDirectories).</p> <p>Otherwise, with no local directories, <code>getIndexFile</code> requests the DiskBlockManager (of the BlockManager) to get the data file.</p> <p><code>getIndexFile</code>\u00a0is used when:</p> <ul> <li><code>IndexShuffleBlockResolver</code> is requested to getBlockData, removeDataByMap, putShuffleBlockAsStream, getMigrationBlocks, writeIndexFileAndCommit</li> <li><code>FallbackStorage</code> is requested to copy</li> </ul>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#creating-shuffle-block-data-file","title":"Creating Shuffle Block Data File <pre><code>getDataFile(\n  shuffleId: Int,\n  mapId: Long): File // (1)\ngetDataFile(\n  shuffleId: Int,\n  mapId: Long,\n  dirs: Option[Array[String]]): File\n</code></pre> <ol> <li><code>dirs</code> is <code>None</code> (undefined)</li> </ol> <p><code>getDataFile</code> creates a ShuffleDataBlockId.</p> <p>With <code>dirs</code> local directories defined, <code>getDataFile</code> places the data file of the <code>ShuffleDataBlockId</code> (by the name) in the local directories (with the spark.diskStore.subDirectories).</p> <p>Otherwise, with no local directories, <code>getDataFile</code> requests the DiskBlockManager (of the BlockManager) to get the data file.</p> <p><code>getDataFile</code>\u00a0is used when:</p> <ul> <li><code>IndexShuffleBlockResolver</code> is requested to getBlockData, removeDataByMap, putShuffleBlockAsStream, getMigrationBlocks, writeIndexFileAndCommit</li> <li><code>LocalDiskShuffleMapOutputWriter</code> is created</li> <li><code>LocalDiskSingleSpillMapOutputWriter</code> is requested to transferMapSpillFile</li> <li><code>FallbackStorage</code> is requested to copy</li> </ul>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#creating-managedbuffer-to-read-shuffle-block-data-file","title":"Creating ManagedBuffer to Read Shuffle Block Data File <pre><code>getBlockData(\n  blockId: BlockId,\n  dirs: Option[Array[String]]): ManagedBuffer\n</code></pre> <p><code>getBlockData</code>\u00a0is part of the ShuffleBlockResolver abstraction.</p> <p><code>getBlockData</code>...FIXME</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#checking-consistency-of-shuffle-index-and-data-files","title":"Checking Consistency of Shuffle Index and Data Files <pre><code>checkIndexAndDataFile(\n  index: File,\n  data: File,\n  blocks: Int): Array[Long]\n</code></pre>  <p>Danger</p> <p>Review Me</p>  <p><code>checkIndexAndDataFile</code> first checks if the size of the input <code>index</code> file is exactly the input <code>blocks</code> multiplied by <code>8</code>.</p> <p><code>checkIndexAndDataFile</code> returns <code>null</code> when the numbers, and hence the shuffle index and data files, don't match.</p> <p><code>checkIndexAndDataFile</code> reads the shuffle <code>index</code> file and converts the offsets into lengths of each block.</p> <p><code>checkIndexAndDataFile</code> makes sure that the size of the input shuffle <code>data</code> file is exactly the sum of the block lengths.</p> <p><code>checkIndexAndDataFile</code> returns the block lengths if the numbers match, and <code>null</code> otherwise.</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#transportconf","title":"TransportConf <p><code>IndexShuffleBlockResolver</code> creates a TransportConf (for shuffle module) when created.</p> <p><code>transportConf</code>\u00a0is used in getMigrationBlocks and getBlockData.</p>","text":""},{"location":"shuffle/IndexShuffleBlockResolver/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.IndexShuffleBlockResolver</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.shuffle.IndexShuffleBlockResolver=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"shuffle/LocalDiskShuffleDataIO/","title":"LocalDiskShuffleDataIO","text":"<p><code>LocalDiskShuffleDataIO</code> is a ShuffleDataIO.</p>"},{"location":"shuffle/LocalDiskShuffleDataIO/#shuffleexecutorcomponents","title":"ShuffleExecutorComponents <pre><code>ShuffleExecutorComponents executor()\n</code></pre> <p><code>executor</code>\u00a0is part of the ShuffleDataIO abstraction.</p> <p><code>executor</code> creates a new LocalDiskShuffleExecutorComponents.</p>","text":""},{"location":"shuffle/LocalDiskShuffleExecutorComponents/","title":"LocalDiskShuffleExecutorComponents","text":"<p><code>LocalDiskShuffleExecutorComponents</code> is a ShuffleExecutorComponents.</p>"},{"location":"shuffle/LocalDiskShuffleExecutorComponents/#creating-instance","title":"Creating Instance","text":"<p><code>LocalDiskShuffleExecutorComponents</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>LocalDiskShuffleExecutorComponents</code> is created\u00a0when:</p> <ul> <li><code>LocalDiskShuffleDataIO</code> is requested for a ShuffleExecutorComponents</li> </ul>"},{"location":"shuffle/LocalDiskShuffleMapOutputWriter/","title":"LocalDiskShuffleMapOutputWriter","text":"<p><code>LocalDiskShuffleMapOutputWriter</code> is...FIXME</p>"},{"location":"shuffle/LocalDiskSingleSpillMapOutputWriter/","title":"LocalDiskSingleSpillMapOutputWriter","text":"<p><code>LocalDiskSingleSpillMapOutputWriter</code> is...FIXME</p>"},{"location":"shuffle/MigratableResolver/","title":"MigratableResolver","text":"<p><code>MigratableResolver</code> is an abstraction of resolvers that allow Spark to migrate shuffle blocks.</p>"},{"location":"shuffle/MigratableResolver/#contract","title":"Contract","text":""},{"location":"shuffle/MigratableResolver/#getmigrationblocks","title":"getMigrationBlocks <pre><code>getMigrationBlocks(\n  shuffleBlockInfo: ShuffleBlockInfo): List[(BlockId, ManagedBuffer)]\n</code></pre> <p>Used when:</p> <ul> <li><code>ShuffleMigrationRunnable</code> is requested to run</li> </ul>","text":""},{"location":"shuffle/MigratableResolver/#getstoredshuffles","title":"getStoredShuffles <pre><code>getStoredShuffles(): Seq[ShuffleBlockInfo]\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockManagerDecommissioner</code> is requested to refreshOffloadingShuffleBlocks</li> </ul>","text":""},{"location":"shuffle/MigratableResolver/#putshuffleblockasstream","title":"putShuffleBlockAsStream <pre><code>putShuffleBlockAsStream(\n  blockId: BlockId,\n  serializerManager: SerializerManager): StreamCallbackWithID\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockManager</code> is requested to putBlockDataAsStream</li> </ul>","text":""},{"location":"shuffle/MigratableResolver/#implementations","title":"Implementations","text":"<ul> <li>IndexShuffleBlockResolver</li> </ul>"},{"location":"shuffle/SerializedShuffleHandle/","title":"SerializedShuffleHandle","text":"<p><code>SerializedShuffleHandle</code> is a BaseShuffleHandle that <code>SortShuffleManager</code> uses when canUseSerializedShuffle (when requested to register a shuffle and BypassMergeSortShuffleHandles could not be selected).</p> <p><code>SerializedShuffleHandle</code> tells <code>SortShuffleManager</code> to use UnsafeShuffleWriter when requested for a ShuffleWriter.</p>"},{"location":"shuffle/SerializedShuffleHandle/#creating-instance","title":"Creating Instance","text":"<p><code>SerializedShuffleHandle</code> takes the following to be created:</p> <ul> <li> Shuffle ID <li> ShuffleDependency <p><code>SerializedShuffleHandle</code> is created when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleHandle (for the ShuffleDependency)</li> </ul>"},{"location":"shuffle/ShuffleBlockPusher/","title":"ShuffleBlockPusher","text":"<p><code>ShuffleBlockPusher</code> is...FIXME</p>"},{"location":"shuffle/ShuffleBlockResolver/","title":"ShuffleBlockResolver","text":"<p>= [[ShuffleBlockResolver]] ShuffleBlockResolver</p> <p>ShuffleBlockResolver is an &lt;&gt; of &lt;&gt; that storage:BlockManager.md[BlockManager] uses to &lt;&gt; for a logical shuffle block identifier (i.e. map, reduce, and shuffle). <p>NOTE: Shuffle block data files are often referred to as map outputs files.</p> <p>[[implementations]] NOTE: shuffle:IndexShuffleBlockResolver.md[IndexShuffleBlockResolver] is the default and only known ShuffleBlockResolver in Apache Spark.</p> <p>[[contract]] .ShuffleBlockResolver Contract [cols=\"1m,3\",options=\"header\",width=\"100%\"] |=== | Method | Description</p> <p>| getBlockData a| [[getBlockData]]</p>"},{"location":"shuffle/ShuffleBlockResolver/#source-scala","title":"[source, scala]","text":"<p>getBlockData(   blockId: ShuffleBlockId): ManagedBuffer</p> <p>Retrieves the data (as a <code>ManagedBuffer</code>) for the given storage:BlockId.md#ShuffleBlockId[block] (a tuple of <code>shuffleId</code>, <code>mapId</code> and <code>reduceId</code>).</p> <p>Used when <code>BlockManager</code> is requested to retrieve a storage:BlockManager.md#getLocalBytes[block data from a local block manager] and storage:BlockManager.md#getBlockData[block data]</p> <p>| stop a| [[stop]]</p>"},{"location":"shuffle/ShuffleBlockResolver/#source-scala_1","title":"[source, scala]","text":""},{"location":"shuffle/ShuffleBlockResolver/#stop-unit","title":"stop(): Unit","text":"<p>Stops the <code>ShuffleBlockResolver</code></p> <p>Used when <code>SortShuffleManager</code> is requested to SortShuffleManager.md#stop[stop]</p> <p>|===</p>"},{"location":"shuffle/ShuffleDataIO/","title":"ShuffleDataIO","text":"<p><code>ShuffleDataIO</code> is an abstraction of pluggable temporary shuffle block store plugins for storing shuffle blocks in arbitrary storage backends.</p>"},{"location":"shuffle/ShuffleDataIO/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleDataIO/#shuffledrivercomponents","title":"ShuffleDriverComponents <pre><code>ShuffleDriverComponents driver()\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"shuffle/ShuffleDataIO/#shuffleexecutorcomponents","title":"ShuffleExecutorComponents <pre><code>ShuffleExecutorComponents executor()\n</code></pre> <p>Used when:</p> <ul> <li><code>SortShuffleManager</code> utility is used to load the ShuffleExecutorComponents</li> </ul>","text":""},{"location":"shuffle/ShuffleDataIO/#implementations","title":"Implementations","text":"<ul> <li>LocalDiskShuffleDataIO</li> </ul>"},{"location":"shuffle/ShuffleDataIOUtils/","title":"ShuffleDataIOUtils","text":""},{"location":"shuffle/ShuffleDataIOUtils/#loading-shuffledataio","title":"Loading ShuffleDataIO <pre><code>loadShuffleDataIO(\n  conf: SparkConf): ShuffleDataIO\n</code></pre> <p><code>loadShuffleDataIO</code> uses the spark.shuffle.sort.io.plugin.class configuration property to load the ShuffleDataIO.</p> <p><code>loadShuffleDataIO</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is created</li> <li><code>SortShuffleManager</code> utility is used to loadShuffleExecutorComponents</li> </ul>","text":""},{"location":"shuffle/ShuffleDriverComponents/","title":"ShuffleDriverComponents","text":"<p><code>ShuffleDriverComponents</code> is...FIXME</p>"},{"location":"shuffle/ShuffleExecutorComponents/","title":"ShuffleExecutorComponents","text":"<p><code>ShuffleExecutorComponents</code> is an abstraction of executor shuffle builders.</p>"},{"location":"shuffle/ShuffleExecutorComponents/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleExecutorComponents/#createmapoutputwriter","title":"createMapOutputWriter <pre><code>ShuffleMapOutputWriter createMapOutputWriter(\n  int shuffleId,\n  long mapTaskId,\n  int numPartitions) throws IOException\n</code></pre> <p>Creates a ShuffleMapOutputWriter</p> <p>Used when:</p> <ul> <li><code>BypassMergeSortShuffleWriter</code> is requested to write records</li> <li><code>UnsafeShuffleWriter</code> is requested to mergeSpills and mergeSpillsUsingStandardWriter</li> <li><code>SortShuffleWriter</code> is requested to write records</li> </ul>","text":""},{"location":"shuffle/ShuffleExecutorComponents/#createsinglefilemapoutputwriter","title":"createSingleFileMapOutputWriter <pre><code>Optional&lt;SingleSpillShuffleMapOutputWriter&gt; createSingleFileMapOutputWriter(\n  int shuffleId,\n  long mapId) throws IOException\n</code></pre> <p>Creates a SingleSpillShuffleMapOutputWriter</p> <p>Default: empty</p> <p>Used when:</p> <ul> <li><code>UnsafeShuffleWriter</code> is requested to mergeSpills</li> </ul>","text":""},{"location":"shuffle/ShuffleExecutorComponents/#initializeexecutor","title":"initializeExecutor <pre><code>void initializeExecutor(\n  String appId,\n  String execId,\n  Map&lt;String, String&gt; extraConfigs);\n</code></pre> <p>Used when:</p> <ul> <li><code>SortShuffleManager</code> utility is used to loadShuffleExecutorComponents</li> </ul>","text":""},{"location":"shuffle/ShuffleExecutorComponents/#implementations","title":"Implementations","text":"<ul> <li>LocalDiskShuffleExecutorComponents</li> </ul>"},{"location":"shuffle/ShuffleExternalSorter/","title":"ShuffleExternalSorter","text":"<p><code>ShuffleExternalSorter</code> is a specialized cache-efficient sorter that sorts arrays of compressed record pointers and partition ids.</p> <p><code>ShuffleExternalSorter</code> uses only 8 bytes of space per record in the sorting array to fit more of the array into cache.</p> <p><code>ShuffleExternalSorter</code> is created and used by UnsafeShuffleWriter only.</p> <p></p>"},{"location":"shuffle/ShuffleExternalSorter/#memoryconsumer","title":"MemoryConsumer <p><code>ShuffleExternalSorter</code> is a MemoryConsumer with page size of 128 MB (unless TaskMemoryManager uses smaller).</p> <p><code>ShuffleExternalSorter</code> can spill to disk to free up execution memory.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#configuration-properties","title":"Configuration Properties","text":""},{"location":"shuffle/ShuffleExternalSorter/#sparkshufflefilebuffer","title":"spark.shuffle.file.buffer <p><code>ShuffleExternalSorter</code> uses spark.shuffle.file.buffer configuration property for...FIXME</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#sparkshufflespillnumelementsforcespillthreshold","title":"spark.shuffle.spill.numElementsForceSpillThreshold <p><code>ShuffleExternalSorter</code> uses spark.shuffle.spill.numElementsForceSpillThreshold configuration property for...FIXME</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#creating-instance","title":"Creating Instance <p><code>ShuffleExternalSorter</code> takes the following to be created:</p> <ul> <li> TaskMemoryManager <li> BlockManager <li> TaskContext <li> Initial Size <li> Number of Partitions <li> SparkConf <li> ShuffleWriteMetricsReporter  <p><code>ShuffleExternalSorter</code> is created when:</p> <ul> <li><code>UnsafeShuffleWriter</code> is requested to open a ShuffleExternalSorter</li> </ul>","text":""},{"location":"shuffle/ShuffleExternalSorter/#shuffleinmemorysorter","title":"ShuffleInMemorySorter <p><code>ShuffleExternalSorter</code> manages a ShuffleInMemorySorter:</p> <ul> <li> <p><code>ShuffleInMemorySorter</code> is created immediately when <code>ShuffleExternalSorter</code> is</p> </li> <li> <p><code>ShuffleInMemorySorter</code> is requested to free up memory and dereferenced (<code>null</code>ed) when <code>ShuffleExternalSorter</code> is requested to cleanupResources and closeAndGetSpills</p> </li> </ul> <p><code>ShuffleExternalSorter</code> uses the <code>ShuffleInMemorySorter</code> for the following:</p> <ul> <li>writeSortedFile</li> <li>spill</li> <li>getMemoryUsage</li> <li>growPointerArrayIfNecessary</li> <li>insertRecord</li> </ul>","text":""},{"location":"shuffle/ShuffleExternalSorter/#spilling-to-disk","title":"Spilling To Disk <pre><code>long spill(\n  long size,\n  MemoryConsumer trigger)\n</code></pre> <p><code>spill</code> is part of the MemoryConsumer abstraction.</p>  <p><code>spill</code> returns the memory bytes spilled (spill size).</p> <p><code>spill</code> prints out the following INFO message to the logs:</p> <pre><code>Thread [threadId] spilling sort data of [memoryUsage] to disk ([spillsSize] [time|times] so far)\n</code></pre> <p><code>spill</code> writeSortedFile (with the <code>isLastFile</code> flag disabled).</p> <p><code>spill</code> frees up execution memory (and records the memory bytes spilled as <code>spillSize</code>).</p> <p><code>spill</code> requests the ShuffleInMemorySorter to reset.</p> <p>In the end, <code>spill</code> requests the TaskContext for TaskMetrics to increase the memory bytes spilled.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#closeandgetspills","title":"closeAndGetSpills <pre><code>SpillInfo[] closeAndGetSpills()\n</code></pre> <p><code>closeAndGetSpills</code>...FIXME</p> <p><code>closeAndGetSpills</code> is used when <code>UnsafeShuffleWriter</code> is requested to closeAndWriteOutput.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#getmemoryusage","title":"getMemoryUsage <pre><code>long getMemoryUsage()\n</code></pre> <p><code>getMemoryUsage</code>...FIXME</p> <p><code>getMemoryUsage</code> is used when <code>ShuffleExternalSorter</code> is created and requested to spill and updatePeakMemoryUsed.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#updatepeakmemoryused","title":"updatePeakMemoryUsed <pre><code>void updatePeakMemoryUsed()\n</code></pre> <p><code>updatePeakMemoryUsed</code>...FIXME</p> <p><code>updatePeakMemoryUsed</code> is used when <code>ShuffleExternalSorter</code> is requested to getPeakMemoryUsedBytes and freeMemory.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#writesortedfile","title":"writeSortedFile <pre><code>void writeSortedFile(\n  boolean isLastFile)\n</code></pre> <p><code>writeSortedFile</code>...FIXME</p>  <p><code>writeSortedFile</code> is used when:</p> <ul> <li><code>ShuffleExternalSorter</code> is requested to spill and closeAndGetSpills</li> </ul>","text":""},{"location":"shuffle/ShuffleExternalSorter/#cleanupresources","title":"cleanupResources <pre><code>void cleanupResources()\n</code></pre> <p><code>cleanupResources</code>...FIXME</p> <p><code>cleanupResources</code> is used when <code>UnsafeShuffleWriter</code> is requested to write records and stop.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#inserting-serialized-record-into-shuffleinmemorysorter","title":"Inserting Serialized Record Into ShuffleInMemorySorter <pre><code>void insertRecord(\n  Object recordBase,\n  long recordOffset,\n  int length,\n  int partitionId)\n</code></pre> <p><code>insertRecord</code>...FIXME</p> <p><code>insertRecord</code> growPointerArrayIfNecessary.</p> <p><code>insertRecord</code>...FIXME</p> <p><code>insertRecord</code> acquireNewPageIfNecessary.</p> <p><code>insertRecord</code>...FIXME</p> <p><code>insertRecord</code> is used when <code>UnsafeShuffleWriter</code> is requested to insertRecordIntoSorter</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#growpointerarrayifnecessary","title":"growPointerArrayIfNecessary <pre><code>void growPointerArrayIfNecessary()\n</code></pre> <p><code>growPointerArrayIfNecessary</code>...FIXME</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#acquirenewpageifnecessary","title":"acquireNewPageIfNecessary <pre><code>void acquireNewPageIfNecessary(\n  int required)\n</code></pre> <p><code>acquireNewPageIfNecessary</code>...FIXME</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#freememory","title":"freeMemory <pre><code>long freeMemory()\n</code></pre> <p><code>freeMemory</code>...FIXME</p> <p><code>freeMemory</code> is used when <code>ShuffleExternalSorter</code> is requested to spill, cleanupResources, and closeAndGetSpills.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#peak-memory-used","title":"Peak Memory Used <pre><code>long getPeakMemoryUsedBytes()\n</code></pre> <p><code>getPeakMemoryUsedBytes</code>...FIXME</p> <p><code>getPeakMemoryUsedBytes</code> is used when <code>UnsafeShuffleWriter</code> is requested to updatePeakMemoryUsed.</p>","text":""},{"location":"shuffle/ShuffleExternalSorter/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.sort.ShuffleExternalSorter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.shuffle.sort.ShuffleExternalSorter=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"shuffle/ShuffleHandle/","title":"ShuffleHandle","text":"<p><code>ShuffleHandle</code> is an abstraction of shuffle handles for ShuffleManager to pass information about shuffles to tasks.</p> <p><code>ShuffleHandle</code> is <code>Serializable</code> (Java).</p>"},{"location":"shuffle/ShuffleHandle/#implementations","title":"Implementations","text":"<ul> <li>BaseShuffleHandle</li> </ul>"},{"location":"shuffle/ShuffleHandle/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleHandle</code> takes the following to be created:</p> <ul> <li> Shuffle ID <p>Abstract Class</p> <p><code>ShuffleHandle</code> is an abstract class and cannot be created directly. It is created indirectly for the concrete ShuffleHandles.</p>"},{"location":"shuffle/ShuffleInMemorySorter/","title":"ShuffleInMemorySorter","text":"<p><code>ShuffleInMemorySorter</code> is used by ShuffleExternalSorter to &lt;&gt; using &lt;&gt; sort algorithms. <p></p> <p>== [[creating-instance]] Creating Instance</p> <p>ShuffleInMemorySorter takes the following to be created:</p> <ul> <li>[[consumer]] memory:MemoryConsumer.md[MemoryConsumer]</li> <li>[[initialSize]] Initial size</li> <li>[[useRadixSort]] useRadixSort flag (to indicate whether to use &lt;&gt;) <p>ShuffleInMemorySorter requests the given &lt;&gt; to memory:MemoryConsumer.md#allocateArray[allocate an array] of the given &lt;&gt; for the &lt;&gt;. <p>ShuffleInMemorySorter is created for a shuffle:ShuffleExternalSorter.md#inMemSorter[ShuffleExternalSorter].</p> <p>== [[getSortedIterator]] Iterator of Records Sorted</p>"},{"location":"shuffle/ShuffleInMemorySorter/#source-java","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#shufflesorteriterator-getsortediterator","title":"ShuffleSorterIterator getSortedIterator()","text":"<p>getSortedIterator...FIXME</p> <p>getSortedIterator is used when ShuffleExternalSorter is requested to shuffle:ShuffleExternalSorter.md#writeSortedFile[writeSortedFile].</p> <p>== [[reset]] Resetting</p>"},{"location":"shuffle/ShuffleInMemorySorter/#source-java_1","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#void-reset","title":"void reset()","text":"<p>reset...FIXME</p> <p>reset is used when...FIXME</p> <p>== [[numRecords]] numRecords Method</p>"},{"location":"shuffle/ShuffleInMemorySorter/#source-java_2","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#int-numrecords","title":"int numRecords()","text":"<p>numRecords...FIXME</p> <p>numRecords is used when...FIXME</p> <p>== [[getUsableCapacity]] Calculating Usable Capacity</p>"},{"location":"shuffle/ShuffleInMemorySorter/#source-java_3","title":"[source, java]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#int-getusablecapacity","title":"int getUsableCapacity()","text":"<p>getUsableCapacity calculates the capacity that is a half or two-third of the memory used for the &lt;&gt;. <p>getUsableCapacity is used when...FIXME</p> <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.sort.ShuffleExternalSorter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>"},{"location":"shuffle/ShuffleInMemorySorter/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"shuffle/ShuffleInMemorySorter/#log4jloggerorgapachesparkshufflesortshuffleexternalsorterall","title":"log4j.logger.org.apache.spark.shuffle.sort.ShuffleExternalSorter=ALL","text":"<p>Refer to spark-logging.md[Logging].</p> <p>== [[internal-properties]] Internal Properties</p> <p>=== [[array]] Unsafe LongArray of Record Pointers and Partition IDs</p> <p>ShuffleInMemorySorter uses a LongArray.</p> <p>=== [[usableCapacity]] Usable Capacity</p> <p>ShuffleInMemorySorter...FIXME</p>"},{"location":"shuffle/ShuffleManager/","title":"ShuffleManager","text":"<p><code>ShuffleManager</code> is an abstraction of shuffle managers that manage shuffle data.</p> <p><code>ShuffleManager</code> is specified using spark.shuffle.manager configuration property.</p> <p><code>ShuffleManager</code> is used to create a BlockManager.</p>"},{"location":"shuffle/ShuffleManager/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleManager/#getting-shufflereader-for-shufflehandle","title":"Getting ShuffleReader for ShuffleHandle <pre><code>getReader[K, C](\n  handle: ShuffleHandle,\n  startPartition: Int,\n  endPartition: Int,\n  context: TaskContext,\n  metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]\n</code></pre> <p>ShuffleReader to read shuffle data (for the given ShuffleHandle)</p> <p>Used when the following <code>RDD</code>s are requested to compute a partition:</p> <ul> <li><code>CoGroupedRDD</code> is requested to compute a partition</li> <li><code>ShuffledRDD</code> is requested to compute a partition</li> <li><code>SubtractedRDD</code> is requested to compute a partition</li> <li><code>ShuffledRowRDD</code> (Spark SQL) is requested to <code>compute</code> a partition</li> </ul>","text":""},{"location":"shuffle/ShuffleManager/#getreaderforrange","title":"getReaderForRange <pre><code>getReaderForRange[K, C](\n  handle: ShuffleHandle,\n  startMapIndex: Int,\n  endMapIndex: Int,\n  startPartition: Int,\n  endPartition: Int,\n  context: TaskContext,\n  metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]\n</code></pre> <p>ShuffleReader for a range of reduce partitions to read from map output in the ShuffleHandle</p> <p>Used when <code>ShuffledRowRDD</code> (Spark SQL) is requested to compute a partition</p>","text":""},{"location":"shuffle/ShuffleManager/#getting-shufflewriter-for-shufflehandle","title":"Getting ShuffleWriter for ShuffleHandle <pre><code>getWriter[K, V](\n  handle: ShuffleHandle,\n  mapId: Long,\n  context: TaskContext,\n  metrics: ShuffleWriteMetricsReporter): ShuffleWriter[K, V]\n</code></pre> <p>ShuffleWriter to write shuffle data in the ShuffleHandle</p> <p>Used when <code>ShuffleWriteProcessor</code> is requested to write a partition</p>","text":""},{"location":"shuffle/ShuffleManager/#registering-shuffle-of-shuffledependency-and-getting-shufflehandle","title":"Registering Shuffle of ShuffleDependency (and Getting ShuffleHandle) <pre><code>registerShuffle[K, V, C](\n  shuffleId: Int,\n  dependency: ShuffleDependency[K, V, C]): ShuffleHandle\n</code></pre> <p>Registers a shuffle (by the given <code>shuffleId</code> and ShuffleDependency) and gives a ShuffleHandle</p> <p>Used when <code>ShuffleDependency</code> is created (and registers with the shuffle system)</p>","text":""},{"location":"shuffle/ShuffleManager/#shuffleblockresolver","title":"ShuffleBlockResolver <pre><code>shuffleBlockResolver: ShuffleBlockResolver\n</code></pre> <p>ShuffleBlockResolver of the shuffle system</p> <p>Used when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleWriter for a ShuffleHandle, to unregister a shuffle and stop</li> <li><code>BlockManager</code> is requested to getLocalBlockData and getHostLocalShuffleData</li> </ul>","text":""},{"location":"shuffle/ShuffleManager/#stopping-shufflemanager","title":"Stopping ShuffleManager <pre><code>stop(): Unit\n</code></pre> <p>Stops the shuffle system</p> <p>Used when <code>SparkEnv</code> is requested to stop</p>","text":""},{"location":"shuffle/ShuffleManager/#unregistering-shuffle","title":"Unregistering Shuffle <pre><code>unregisterShuffle(\n  shuffleId: Int): Boolean\n</code></pre> <p>Unregisters a given shuffle</p> <p>Used when <code>BlockManagerSlaveEndpoint</code> is requested to handle a RemoveShuffle message</p>","text":""},{"location":"shuffle/ShuffleManager/#implementations","title":"Implementations","text":"<ul> <li>SortShuffleManager</li> </ul>"},{"location":"shuffle/ShuffleManager/#accessing-shufflemanager-using-sparkenv","title":"Accessing ShuffleManager using SparkEnv <p><code>ShuffleManager</code> is available on the driver and executors using SparkEnv.shuffleManager.</p> <pre><code>val shuffleManager = SparkEnv.get.shuffleManager\n</code></pre>","text":""},{"location":"shuffle/ShuffleMapOutputWriter/","title":"ShuffleMapOutputWriter","text":"<p><code>ShuffleMapOutputWriter</code> is...FIXME</p>"},{"location":"shuffle/ShuffleReader/","title":"ShuffleReader","text":"<p><code>ShuffleReader</code> is an abstraction of shuffle block readers that can read combined key-value records for a reduce task.</p>"},{"location":"shuffle/ShuffleReader/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleReader/#reading-combined-records-for-reduce-task","title":"Reading Combined Records (for Reduce Task) <pre><code>read(): Iterator[Product2[K, C]]\n</code></pre> <p>Used when:</p> <ul> <li>CoGroupedRDD, ShuffledRDD, and SubtractedRDD are requested to compute a partition (for a <code>ShuffleDependency</code> dependency)</li> <li><code>ShuffledRowRDD</code> (Spark SQL) is requested to <code>compute</code> a partition</li> </ul>","text":""},{"location":"shuffle/ShuffleReader/#implementations","title":"Implementations","text":"<ul> <li>BlockStoreShuffleReader</li> </ul>"},{"location":"shuffle/ShuffleWriteMetricsReporter/","title":"ShuffleWriteMetricsReporter","text":"<p><code>ShuffleWriteMetricsReporter</code> is an abstraction of shuffle write metrics reporters.</p>"},{"location":"shuffle/ShuffleWriteMetricsReporter/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#decbyteswritten","title":"decBytesWritten <pre><code>decBytesWritten(\n  v: Long): Unit\n</code></pre>","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#decrecordswritten","title":"decRecordsWritten <pre><code>decRecordsWritten(\n  v: Long): Unit\n</code></pre>","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#incbyteswritten","title":"incBytesWritten <pre><code>incBytesWritten(\n  v: Long): Unit\n</code></pre>","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#increcordswritten","title":"incRecordsWritten <pre><code>incRecordsWritten(\n  v: Long): Unit\n</code></pre> <p>See ShuffleWriteMetrics</p> <p>Used when:</p> <ul> <li><code>ShufflePartitionPairsWriter</code> is requested to <code>recordWritten</code></li> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile</li> <li><code>DiskBlockObjectWriter</code> is requested to record bytes written</li> </ul>","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#incwritetime","title":"incWriteTime <pre><code>incWriteTime(\n  v: Long): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>BypassMergeSortShuffleWriter</code> is requested to write partition records and writePartitionedData</li> <li><code>UnsafeShuffleWriter</code> is requested to mergeSpillsWithTransferTo</li> <li><code>DiskBlockObjectWriter</code> is requested to commitAndGet</li> <li><code>TimeTrackingOutputStream</code> is requested to <code>write</code>, <code>flush</code>, and <code>close</code></li> </ul>","text":""},{"location":"shuffle/ShuffleWriteMetricsReporter/#implementations","title":"Implementations","text":"<ul> <li>ShuffleWriteMetrics</li> <li>SQLShuffleWriteMetricsReporter (Spark SQL)</li> </ul>"},{"location":"shuffle/ShuffleWriteProcessor/","title":"ShuffleWriteProcessor","text":"<p><code>ShuffleWriteProcessor</code> controls write behavior in ShuffleMapTasks while writing partition records out to the shuffle system.</p> <p><code>ShuffleWriteProcessor</code> is used to create a ShuffleDependency.</p>"},{"location":"shuffle/ShuffleWriteProcessor/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleWriteProcessor</code> takes no arguments to be created.</p> <p><code>ShuffleWriteProcessor</code> is created when:</p> <ul> <li><code>ShuffleDependency</code> is created</li> <li><code>ShuffleExchangeExec</code> (Spark SQL) physical operator is requested to <code>createShuffleWriteProcessor</code></li> </ul>"},{"location":"shuffle/ShuffleWriteProcessor/#writing-partition-records-to-shuffle-system","title":"Writing Partition Records to Shuffle System <pre><code>write(\n  rdd: RDD[_],\n  dep: ShuffleDependency[_, _, _],\n  mapId: Long,\n  context: TaskContext,\n  partition: Partition): MapStatus\n</code></pre> <p><code>write</code> requests the ShuffleManager for the ShuffleWriter for the ShuffleHandle (of the given ShuffleDependency).</p> <p><code>write</code> requests the <code>ShuffleWriter</code> to write out records (of the given Partition and RDD).</p> <p>In the end, <code>write</code> requests the <code>ShuffleWriter</code> to stop (with the <code>success</code> flag enabled).</p> <p>In case of any <code>Exception</code>s, <code>write</code> requests the <code>ShuffleWriter</code> to stop (with the <code>success</code> flag disabled).</p> <p><code>write</code>\u00a0is used when <code>ShuffleMapTask</code> is requested to run.</p>","text":""},{"location":"shuffle/ShuffleWriteProcessor/#creating-metricsreporter","title":"Creating MetricsReporter <pre><code>createMetricsReporter(\n  context: TaskContext): ShuffleWriteMetricsReporter\n</code></pre> <p><code>createMetricsReporter</code> creates a ShuffleWriteMetricsReporter from the given TaskContext.</p> <p><code>createMetricsReporter</code> requests the given TaskContext for TaskMetrics and then for the ShuffleWriteMetrics.</p>","text":""},{"location":"shuffle/ShuffleWriter/","title":"ShuffleWriter","text":"<p><code>ShuffleWriter[K, V]</code> (of <code>K</code> keys and <code>V</code> values) is an abstraction of shuffle writers that can write out key-value records (of a RDD partition) to a shuffle system.</p> <p><code>ShuffleWriter</code> is used when ShuffleMapTask is requested to run (and uses a <code>ShuffleWriteProcessor</code> to write partition records to a shuffle system).</p>"},{"location":"shuffle/ShuffleWriter/#contract","title":"Contract","text":""},{"location":"shuffle/ShuffleWriter/#writing-out-partition-records-to-shuffle-system","title":"Writing Out Partition Records to Shuffle System <pre><code>write(\n  records: Iterator[Product2[K, V]]): Unit\n</code></pre> <p>Writes key-value records (of a partition) out to a shuffle system</p> <p>Used when:</p> <ul> <li><code>ShuffleWriteProcessor</code> is requested to write</li> </ul>","text":""},{"location":"shuffle/ShuffleWriter/#stopping-shufflewriter","title":"Stopping ShuffleWriter <pre><code>stop(\n  success: Boolean): Option[MapStatus]\n</code></pre> <p>Stops (closes) the <code>ShuffleWriter</code> and returns a MapStatus if the writing completed successfully. The <code>success</code> flag is the status of the task execution.</p> <p>Used when:</p> <ul> <li><code>ShuffleWriteProcessor</code> is requested to write</li> </ul>","text":""},{"location":"shuffle/ShuffleWriter/#implementations","title":"Implementations","text":"<ul> <li>BypassMergeSortShuffleWriter</li> <li>SortShuffleWriter</li> <li>UnsafeShuffleWriter</li> </ul>"},{"location":"shuffle/SingleSpillShuffleMapOutputWriter/","title":"SingleSpillShuffleMapOutputWriter","text":"<p><code>SingleSpillShuffleMapOutputWriter</code> is...FIXME</p>"},{"location":"shuffle/SortShuffleManager/","title":"SortShuffleManager","text":"<p><code>SortShuffleManager</code> is the default and only ShuffleManager in Apache Spark (with the short name <code>sort</code> or <code>tungsten-sort</code>).</p> <p></p>"},{"location":"shuffle/SortShuffleManager/#creating-instance","title":"Creating Instance","text":"<p><code>SortShuffleManager</code> takes the following to be created:</p> <ul> <li> SparkConf <p><code>SortShuffleManager</code> is created when <code>SparkEnv</code> is created (on the driver and executors at the very beginning of a Spark application's lifecycle).</p>"},{"location":"shuffle/SortShuffleManager/#taskidmapsforshuffle-registry","title":"taskIdMapsForShuffle Registry <pre><code>taskIdMapsForShuffle: ConcurrentHashMap[Int, OpenHashSet[Long]]\n</code></pre> <p><code>SortShuffleManager</code> uses <code>taskIdMapsForShuffle</code> internal registry to track task (attempt) IDs by shuffle.</p> <p>A new shuffle and task IDs are added when <code>SortShuffleManager</code> is requested for a ShuffleWriter (for a partition and a <code>ShuffleHandle</code>).</p> <p>A shuffle ID (and associated task IDs) are removed when <code>SortShuffleManager</code> is requested to unregister a shuffle.</p>","text":""},{"location":"shuffle/SortShuffleManager/#getting-shufflewriter-for-partition-and-shufflehandle","title":"Getting ShuffleWriter for Partition and ShuffleHandle <pre><code>getWriter[K, V](\n  handle: ShuffleHandle,\n  mapId: Int,\n  context: TaskContext): ShuffleWriter[K, V]\n</code></pre> <p><code>getWriter</code> registers the given ShuffleHandle (by the shuffleId and numMaps) in the taskIdMapsForShuffle internal registry unless already done.</p>  <p>Note</p> <p><code>getWriter</code> expects that the input <code>ShuffleHandle</code> is a BaseShuffleHandle. Moreover, <code>getWriter</code> expects that in two (out of three cases) it is a more specialized IndexShuffleBlockResolver.</p>  <p><code>getWriter</code> then creates a new <code>ShuffleWriter</code> based on the type of the given <code>ShuffleHandle</code>.</p>    ShuffleHandle ShuffleWriter     SerializedShuffleHandle UnsafeShuffleWriter   BypassMergeSortShuffleHandle BypassMergeSortShuffleWriter   BaseShuffleHandle SortShuffleWriter    <p><code>getWriter</code> is part of the ShuffleManager abstraction.</p>","text":""},{"location":"shuffle/SortShuffleManager/#shuffleexecutorcomponents","title":"ShuffleExecutorComponents <pre><code>shuffleExecutorComponents: ShuffleExecutorComponents\n</code></pre> <p><code>SortShuffleManager</code> defines the <code>shuffleExecutorComponents</code> internal registry for a ShuffleExecutorComponents.</p> <p><code>shuffleExecutorComponents</code>\u00a0is used when:</p> <ul> <li><code>SortShuffleManager</code> is requested for the ShuffleWriter</li> </ul>","text":""},{"location":"shuffle/SortShuffleManager/#loadshuffleexecutorcomponents","title":"loadShuffleExecutorComponents <pre><code>loadShuffleExecutorComponents(\n  conf: SparkConf): ShuffleExecutorComponents\n</code></pre> <p><code>loadShuffleExecutorComponents</code> loads the ShuffleDataIO that is then requested for the ShuffleExecutorComponents.</p> <p><code>loadShuffleExecutorComponents</code> requests the <code>ShuffleExecutorComponents</code> to initialize before returning it.</p>","text":""},{"location":"shuffle/SortShuffleManager/#creating-shufflehandle-for-shuffledependency","title":"Creating ShuffleHandle for ShuffleDependency <pre><code>registerShuffle[K, V, C](\n  shuffleId: Int,\n  dependency: ShuffleDependency[K, V, C]): ShuffleHandle\n</code></pre> <p><code>registerShuffle</code>\u00a0is part of the ShuffleManager abstraction.</p> <p><code>registerShuffle</code> creates a new ShuffleHandle (for the given ShuffleDependency) that is one of the following:</p> <ol> <li> <p>BypassMergeSortShuffleHandle (with <code>ShuffleDependency[K, V, V]</code>) when shouldBypassMergeSort condition holds</p> </li> <li> <p>SerializedShuffleHandle (with <code>ShuffleDependency[K, V, V]</code>) when canUseSerializedShuffle condition holds</p> </li> <li> <p>BaseShuffleHandle</p> </li> </ol>","text":""},{"location":"shuffle/SortShuffleManager/#serializedshufflehandle-requirements","title":"SerializedShuffleHandle Requirements <pre><code>canUseSerializedShuffle(\n  dependency: ShuffleDependency[_, _, _]): Boolean\n</code></pre> <p><code>canUseSerializedShuffle</code> is <code>true</code> when all of the following hold for the given ShuffleDependency:</p> <ol> <li> <p>Serializer (of the given <code>ShuffleDependency</code>) supports relocation of serialized objects</p> </li> <li> <p>mapSideCombine flag (of the given <code>ShuffleDependency</code>) is <code>false</code></p> </li> <li> <p>Number of partitions (of the Partitioner of the given <code>ShuffleDependency</code>) is not greater than the supported maximum number</p> </li> </ol> <p>With all of the above positive, <code>canUseSerializedShuffle</code> prints out the following DEBUG message to the logs:</p> <pre><code>Can use serialized shuffle for shuffle [shuffleId]\n</code></pre> <p>Otherwise, <code>canUseSerializedShuffle</code> is <code>false</code> and prints out one of the following DEBUG messages based on the failed requirement:</p> <pre><code>Can't use serialized shuffle for shuffle [id] because the serializer, [name], does not support object relocation\n</code></pre> <pre><code>Can't use serialized shuffle for shuffle [id] because we need to do map-side aggregation\n</code></pre> <pre><code>Can't use serialized shuffle for shuffle [id] because it has more than [number] partitions\n</code></pre> <p><code>canUseSerializedShuffle</code>\u00a0is used when:</p> <ul> <li><code>SortShuffleManager</code> is requested to register a shuffle (and creates a ShuffleHandle)</li> </ul>","text":""},{"location":"shuffle/SortShuffleManager/#maximum-number-of-partition-identifiers-for-serialized-mode","title":"Maximum Number of Partition Identifiers for Serialized Mode <p><code>SortShuffleManager</code> defines <code>MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE</code> internal constant to be <code>(1 &lt;&lt; 24)</code> (<code>16777216</code>) for the maximum number of shuffle output partitions.</p> <p><code>MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE</code> is used when:</p> <ul> <li><code>UnsafeShuffleWriter</code> is created</li> <li><code>SortShuffleManager</code> utility is used to check out SerializedShuffleHandle requirements</li> <li><code>ShuffleExchangeExec</code> (Spark SQL) utility is used to <code>needToCopyObjectsBeforeShuffle</code></li> </ul>","text":""},{"location":"shuffle/SortShuffleManager/#creating-shuffleblockresolver","title":"Creating ShuffleBlockResolver <pre><code>shuffleBlockResolver: IndexShuffleBlockResolver\n</code></pre> <p><code>shuffleBlockResolver</code>\u00a0is part of the ShuffleManager abstraction.</p> <p><code>shuffleBlockResolver</code> is a IndexShuffleBlockResolver (and is created immediately alongside this SortShuffleManager).</p>","text":""},{"location":"shuffle/SortShuffleManager/#unregistering-shuffle","title":"Unregistering Shuffle <pre><code>unregisterShuffle(\n  shuffleId: Int): Boolean\n</code></pre> <p><code>unregisterShuffle</code>\u00a0is part of the ShuffleManager abstraction.</p> <p><code>unregisterShuffle</code> removes the given <code>shuffleId</code> from the taskIdMapsForShuffle internal registry.</p> <p>If the <code>shuffleId</code> was found and removed successfully, <code>unregisterShuffle</code> requests the IndexShuffleBlockResolver to remove the shuffle index and data files for every <code>mapTaskId</code> (mappers producing the output for the shuffle).</p> <p><code>unregisterShuffle</code> is always <code>true</code>.</p>","text":""},{"location":"shuffle/SortShuffleManager/#getting-shufflereader-for-shufflehandle","title":"Getting ShuffleReader for ShuffleHandle <pre><code>getReader[K, C](\n  handle: ShuffleHandle,\n  startMapIndex: Int,\n  endMapIndex: Int,\n  startPartition: Int,\n  endPartition: Int,\n  context: TaskContext,\n  metrics: ShuffleReadMetricsReporter): ShuffleReader[K, C]\n</code></pre> <p><code>getReader</code>\u00a0is part of the ShuffleManager abstraction.</p> <p><code>getReader</code> requests the MapOutputTracker (via SparkEnv) for the getMapSizesByExecutorId for the <code>shuffleId</code> (of the given ShuffleHandle).</p> <p>In the end, <code>getReader</code> creates a new BlockStoreShuffleReader.</p>","text":""},{"location":"shuffle/SortShuffleManager/#stopping-shufflemanager","title":"Stopping ShuffleManager <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>\u00a0is part of the ShuffleManager abstraction.</p> <p><code>stop</code> requests the IndexShuffleBlockResolver to stop.</p>","text":""},{"location":"shuffle/SortShuffleManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.sort.SortShuffleManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.shuffle.sort.SortShuffleManager=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"shuffle/SortShuffleWriter/","title":"SortShuffleWriter \u2014 Fallback ShuffleWriter","text":"<p><code>SortShuffleWriter</code> is a \"fallback\" ShuffleWriter (when <code>SortShuffleManager</code> is requested for a ShuffleWriter and the more specialized BypassMergeSortShuffleWriter and UnsafeShuffleWriter could not be used).</p> <p><code>SortShuffleWriter[K, V, C]</code> is a parameterized type with <code>K</code> keys, <code>V</code> values, and <code>C</code> combiner values.</p>"},{"location":"shuffle/SortShuffleWriter/#creating-instance","title":"Creating Instance","text":"<p><code>SortShuffleWriter</code> takes the following to be created:</p> <ul> <li> IndexShuffleBlockResolver (unused) <li> BaseShuffleHandle <li> Map ID <li> TaskContext <li> ShuffleExecutorComponents <p><code>SortShuffleWriter</code> is created\u00a0when:</p> <ul> <li><code>SortShuffleManager</code> is requested for a ShuffleWriter (for a given ShuffleHandle)</li> </ul>"},{"location":"shuffle/SortShuffleWriter/#mapstatus","title":"MapStatus <p><code>SortShuffleWriter</code> uses <code>mapStatus</code> internal registry for a MapStatus after writing records.</p> <p>Writing records itself does not return a value and <code>SortShuffleWriter</code> uses the registry when requested to stop (which allows returning a <code>MapStatus</code>).</p>","text":""},{"location":"shuffle/SortShuffleWriter/#writing-records-into-shuffle-partitioned-file-in-disk-store","title":"Writing Records (Into Shuffle Partitioned File In Disk Store) <pre><code>write(\n  records: Iterator[Product2[K, V]]): Unit\n</code></pre> <p><code>write</code> is part of the ShuffleWriter abstraction.</p> <p><code>write</code> creates an ExternalSorter based on the ShuffleDependency (of the BaseShuffleHandle), namely the Map-Size Partial Aggregation flag. The <code>ExternalSorter</code> uses the aggregator and key ordering when the flag is enabled.</p> <p><code>write</code> requests the <code>ExternalSorter</code> to insert all the given records.</p> <p><code>write</code>...FIXME</p>","text":""},{"location":"shuffle/SortShuffleWriter/#stopping-sortshufflewriter-and-calculating-mapstatus","title":"Stopping SortShuffleWriter (and Calculating MapStatus) <pre><code>stop(\n  success: Boolean): Option[MapStatus]\n</code></pre> <p><code>stop</code> is part of the ShuffleWriter abstraction.</p> <p><code>stop</code> turns the stopping flag on and returns the internal mapStatus if the input <code>success</code> is enabled.</p> <p>Otherwise, when stopping flag is already enabled or the input <code>success</code> is disabled, <code>stop</code> returns no <code>MapStatus</code> (i.e. <code>None</code>).</p> <p>In the end, <code>stop</code> requests the <code>ExternalSorter</code> to stop and increments the shuffle write time task metrics.</p>","text":""},{"location":"shuffle/SortShuffleWriter/#requirements-of-bypassmergesortshufflehandle-as-shufflehandle","title":"Requirements of BypassMergeSortShuffleHandle (as ShuffleHandle) <pre><code>shouldBypassMergeSort(\n  conf: SparkConf,\n  dep: ShuffleDependency[_, _, _]): Boolean\n</code></pre> <p><code>shouldBypassMergeSort</code> returns <code>true</code> when all of the following hold:</p> <ol> <li> <p>No map-side aggregation (the mapSideCombine flag of the given ShuffleDependency is off)</p> </li> <li> <p>Number of partitions (of the Partitioner of the given ShuffleDependency) is not greater than spark.shuffle.sort.bypassMergeThreshold configuration property</p> </li> </ol> <p>Otherwise, <code>shouldBypassMergeSort</code> does not hold (<code>false</code>).</p> <p><code>shouldBypassMergeSort</code> is used when:</p> <ul> <li><code>SortShuffleManager</code> is requested to register a shuffle (and creates a ShuffleHandle)</li> </ul>","text":""},{"location":"shuffle/SortShuffleWriter/#stopping-flag","title":"stopping Flag <p><code>SortShuffleWriter</code> uses <code>stopping</code> internal flag to indicate whether or not this <code>SortShuffleWriter</code> has been stopped.</p>","text":""},{"location":"shuffle/SortShuffleWriter/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.sort.SortShuffleWriter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.shuffle.sort.SortShuffleWriter=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"shuffle/Spillable/","title":"Spillable","text":"<p><code>Spillable</code> is an extension of the MemoryConsumer abstraction for spillable collections that can spill to disk.</p> <p><code>Spillable[C]</code> is a parameterized type of <code>C</code> combiner (partial) values.</p>"},{"location":"shuffle/Spillable/#contract","title":"Contract","text":""},{"location":"shuffle/Spillable/#forcespill","title":"forceSpill <pre><code>forceSpill(): Boolean\n</code></pre> <p>Force spilling the current in-memory collection to disk to release memory.</p> <p>Used when <code>Spillable</code> is requested to spill</p>","text":""},{"location":"shuffle/Spillable/#spill","title":"spill <pre><code>spill(\n  collection: C): Unit\n</code></pre> <p>Spills the current in-memory collection to disk, and releases the memory.</p> <p>Used when:</p> <ul> <li><code>ExternalAppendOnlyMap</code> is requested to forceSpill</li> <li><code>Spillable</code> is requested to spilling to disk if necessary</li> </ul>","text":""},{"location":"shuffle/Spillable/#implementations","title":"Implementations","text":"<ul> <li>ExternalAppendOnlyMap</li> <li>ExternalSorter</li> </ul>"},{"location":"shuffle/Spillable/#memory-threshold","title":"Memory Threshold <p><code>Spillable</code> uses a threshold for the memory size (in bytes) to know when to spill to disk.</p> <p>When the size of the in-memory collection is above the threshold, <code>Spillable</code> will try to acquire more memory. Unless given all requested memory, <code>Spillable</code> spills to disk.</p> <p>The memory threshold starts as spark.shuffle.spill.initialMemoryThreshold configuration property and is increased every time <code>Spillable</code> is requested to spill to disk if needed, but managed to acquire required memory. The threshold goes back to the initial value when requested to release all memory.</p> <p>Used when <code>Spillable</code> is requested to spill and releaseMemory.</p>","text":""},{"location":"shuffle/Spillable/#creating-instance","title":"Creating Instance <p><code>Spillable</code> takes the following to be created:</p> <ul> <li> TaskMemoryManager   Abstract Class <p><code>Spillable</code> is an abstract class and cannot be created directly. It is created indirectly for the concrete Spillables.</p>","text":""},{"location":"shuffle/Spillable/#configuration-properties","title":"Configuration Properties","text":""},{"location":"shuffle/Spillable/#sparkshufflespillnumelementsforcespillthreshold","title":"spark.shuffle.spill.numElementsForceSpillThreshold <p><code>Spillable</code> uses spark.shuffle.spill.numElementsForceSpillThreshold configuration property to force spilling in-memory objects to disk when requested to maybeSpill.</p>","text":""},{"location":"shuffle/Spillable/#sparkshufflespillinitialmemorythreshold","title":"spark.shuffle.spill.initialMemoryThreshold <p><code>Spillable</code> uses spark.shuffle.spill.initialMemoryThreshold configuration property as the initial threshold for the size of a collection (and the minimum memory required to operate properly).</p> <p><code>Spillable</code> uses it when requested to spill and releaseMemory.</p>","text":""},{"location":"shuffle/Spillable/#releasing-all-memory","title":"Releasing All Memory <pre><code>releaseMemory(): Unit\n</code></pre> <p><code>releaseMemory</code>...FIXME</p> <p><code>releaseMemory</code> is used when:</p> <ul> <li><code>ExternalAppendOnlyMap</code> is requested to freeCurrentMap</li> <li><code>ExternalSorter</code> is requested to stop</li> <li><code>Spillable</code> is requested to maybeSpill and spill (and spilled to disk in either case)</li> </ul>","text":""},{"location":"shuffle/Spillable/#spilling-in-memory-collection-to-disk-to-release-memory","title":"Spilling In-Memory Collection to Disk (to Release Memory) <pre><code>spill(\n  collection: C): Unit\n</code></pre> <p><code>spill</code> spills the given in-memory collection to disk to release memory.</p> <p><code>spill</code> is used when:</p> <ul> <li><code>ExternalAppendOnlyMap</code> is requested to forceSpill</li> <li><code>Spillable</code> is requested to maybeSpill</li> </ul>","text":""},{"location":"shuffle/Spillable/#forcespill_1","title":"forceSpill <pre><code>forceSpill(): Boolean\n</code></pre> <p><code>forceSpill</code> forcefully spills the Spillable to disk to release memory.</p> <p><code>forceSpill</code> is used when <code>Spillable</code> is requested to spill an in-memory collection to disk.</p>","text":""},{"location":"shuffle/Spillable/#spilling-to-disk-if-necessary","title":"Spilling to Disk if Necessary <pre><code>maybeSpill(\n  collection: C,\n  currentMemory: Long): Boolean\n</code></pre> <p><code>maybeSpill</code>...FIXME</p> <p><code>maybeSpill</code> is used when:</p> <ul> <li><code>ExternalAppendOnlyMap</code> is requested to insertAll</li> <li><code>ExternalSorter</code> is requested to attempt to spill an in-memory collection to disk if needed</li> </ul>","text":""},{"location":"shuffle/UnsafeShuffleWriter/","title":"UnsafeShuffleWriter","text":"<p><code>UnsafeShuffleWriter&lt;K, V&gt;</code> is a ShuffleWriter for SerializedShuffleHandles.</p> <p></p> <p><code>UnsafeShuffleWriter</code> opens resources (a ShuffleExternalSorter and the buffers) while being created.</p> <p></p>"},{"location":"shuffle/UnsafeShuffleWriter/#creating-instance","title":"Creating Instance","text":"<p><code>UnsafeShuffleWriter</code> takes the following to be created:</p> <ul> <li> BlockManager <li> TaskMemoryManager <li> SerializedShuffleHandle <li> Map ID <li> TaskContext <li> SparkConf <li> ShuffleWriteMetricsReporter <li> <code>ShuffleExecutorComponents</code> <p><code>UnsafeShuffleWriter</code> is created when <code>SortShuffleManager</code> is requested for a ShuffleWriter for a SerializedShuffleHandle.</p> <p><code>UnsafeShuffleWriter</code> makes sure that the number of partitions at most 16MB reduce partitions (<code>1 &lt;&lt; 24</code>) (as the upper bound of the partition identifiers that can be encoded) or throws an <code>IllegalArgumentException</code>:</p> <pre><code>UnsafeShuffleWriter can only be used for shuffles with at most 16777215 reduce partitions\n</code></pre> <p><code>UnsafeShuffleWriter</code> uses the number of partitions of the Partitioner that is used for the ShuffleDependency of the SerializedShuffleHandle.</p> <p>Note</p> <p>The number of shuffle output partitions is first enforced when <code>SortShuffleManager</code> is requested to check out whether a SerializedShuffleHandle can be used for ShuffleHandle (that eventually leads to <code>UnsafeShuffleWriter</code>).</p> <p>In the end, <code>UnsafeShuffleWriter</code> creates a ShuffleExternalSorter and a SerializationStream.</p>"},{"location":"shuffle/UnsafeShuffleWriter/#shuffleexternalsorter","title":"ShuffleExternalSorter <p><code>UnsafeShuffleWriter</code> uses a ShuffleExternalSorter.</p> <p><code>ShuffleExternalSorter</code> is created when <code>UnsafeShuffleWriter</code> is requested to open (while being created) and dereferenced (<code>null</code>ed) when requested to close internal resources and merge spill files.</p> <p>Used when <code>UnsafeShuffleWriter</code> is requested for the following:</p> <ul> <li>Updating peak memory used</li> <li>Writing records</li> <li>Closing internal resources and merging spill files</li> <li>Inserting a record</li> <li>Stopping</li> </ul>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#indexshuffleblockresolver","title":"IndexShuffleBlockResolver <p><code>UnsafeShuffleWriter</code> is given a IndexShuffleBlockResolver when created.</p> <p><code>UnsafeShuffleWriter</code> uses the IndexShuffleBlockResolver for...FIXME</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#initial-serialized-buffer-size","title":"Initial Serialized Buffer Size <p><code>UnsafeShuffleWriter</code> uses a fixed buffer size for the output stream of serialized data written into a byte array (default: <code>1024 * 1024</code>).</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#inputbuffersizeinbytes","title":"inputBufferSizeInBytes <p><code>UnsafeShuffleWriter</code> uses the spark.shuffle.file.buffer configuration property for...FIXME</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#outputbuffersizeinbytes","title":"outputBufferSizeInBytes <p><code>UnsafeShuffleWriter</code> uses the spark.shuffle.unsafe.file.output.buffer configuration property for...FIXME</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#transfertoenabled","title":"transferToEnabled <p><code>UnsafeShuffleWriter</code> can use a specialized NIO-based fast merge procedure that avoids extra serialization/deserialization when spark.file.transferTo configuration property is enabled.</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#initialsortbuffersize","title":"initialSortBufferSize <p><code>UnsafeShuffleWriter</code> uses the initial buffer size for sorting (default: <code>4096</code>) when creating a ShuffleExternalSorter (when requested to open).</p>  <p>Tip</p> <p>Use spark.shuffle.sort.initialBufferSize configuration property to change the buffer size.</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#merging-spills","title":"Merging Spills <pre><code>long[] mergeSpills(\n  SpillInfo[] spills,\n  File outputFile)\n</code></pre>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#many-spills","title":"Many Spills <p>With multiple <code>SpillInfos</code> to merge, <code>mergeSpills</code> selects between fast and slow merge strategies. The fast merge strategy can be transferTo- or fileStream-based.</p> <p>mergeSpills uses the spark.shuffle.unsafe.fastMergeEnabled configuration property to consider one of the fast merge strategies.</p> <p>A fast merge strategy is supported when spark.shuffle.compress configuration property is disabled or the IO compression codec supports decompression of concatenated compressed streams.</p> <p>With spark.shuffle.compress configuration property enabled, <code>mergeSpills</code> will always use the slow merge strategy.</p> <p>With fast merge strategy enabled and supported, transferToEnabled enabled and encryption disabled, <code>mergeSpills</code> prints out the following DEBUG message to the logs and mergeSpillsWithTransferTo.</p> <pre><code>Using transferTo-based fast merge\n</code></pre> <p>With fast merge strategy enabled and supported, no transferToEnabled or encryption enabled, <code>mergeSpills</code> prints out the following DEBUG message to the logs and mergeSpillsWithFileStream (with no compression codec).</p> <pre><code>Using fileStream-based fast merge\n</code></pre> <p>For slow merge, <code>mergeSpills</code> prints out the following DEBUG message to the logs and mergeSpillsWithFileStream (with the compression codec).</p> <pre><code>Using slow merge\n</code></pre> <p>In the end, <code>mergeSpills</code> requests the ShuffleWriteMetrics to decBytesWritten and incBytesWritten, and returns the partition length array.</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#one-spill","title":"One Spill <p>With one <code>SpillInfo</code> to merge, <code>mergeSpills</code> simply renames the spill file to be the output file and returns the partition length array of the one spill.</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#no-spills","title":"No Spills <p>With no <code>SpillInfo</code>s to merge, <code>mergeSpills</code> creates an empty output file and returns an array of <code>0</code>s of size of the numPartitions of the Partitioner.</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#usage","title":"Usage <p><code>mergeSpills</code> is used when <code>UnsafeShuffleWriter</code> is requested to close internal resources and merge spill files.</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#mergespillswithtransferto","title":"mergeSpillsWithTransferTo <pre><code>long[] mergeSpillsWithTransferTo(\n  SpillInfo[] spills,\n  File outputFile)\n</code></pre> <p><code>mergeSpillsWithTransferTo</code>...FIXME</p> <p><code>mergeSpillsWithTransferTo</code> is used when <code>UnsafeShuffleWriter</code> is requested to mergeSpills (with the transferToEnabled flag enabled and no encryption).</p> <p>== [[updatePeakMemoryUsed]] updatePeakMemoryUsed Internal Method</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java","title":"[source, java]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#void-updatepeakmemoryused","title":"void updatePeakMemoryUsed() <p>updatePeakMemoryUsed...FIXME</p> <p>updatePeakMemoryUsed is used when UnsafeShuffleWriter is requested for the &lt;&gt; and to &lt;&gt;.","text":""},{"location":"shuffle/UnsafeShuffleWriter/#writing-key-value-records-of-partition","title":"Writing Key-Value Records of Partition <pre><code>void write(\n  Iterator&lt;Product2&lt;K, V&gt;&gt; records)\n</code></pre> <p><code>write</code> traverses the input sequence of records (for a RDD partition) and insertRecordIntoSorter one by one. When all the records have been processed, <code>write</code> closes internal resources and merges spill files.</p> <p>In the end, <code>write</code> requests <code>ShuffleExternalSorter</code> to clean up.</p> <p>CAUTION: FIXME</p> <p>When requested to &lt;&gt;, UnsafeShuffleWriter simply &lt;&gt; followed by &lt;&gt; (that, among other things, creates the &lt;&gt;). <p><code>write</code> is part of the ShuffleWriter abstraction.</p> <p>== [[stop]] Stopping ShuffleWriter</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_1","title":"[source, java] <p>Option stop(   boolean success)  <p><code>stop</code>...FIXME</p> <p>When requested to &lt;&gt;, UnsafeShuffleWriter records the peak execution memory metric and returns the &lt;&gt; (that was created when requested to &lt;&gt;). <p><code>stop</code> is part of the ShuffleWriter abstraction.</p> <p>== [[insertRecordIntoSorter]] Inserting Record Into ShuffleExternalSorter</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_2","title":"[source, java] <p>void insertRecordIntoSorter(   Product2 record)  <p>insertRecordIntoSorter requires that the &lt;&gt; is available. <p>insertRecordIntoSorter requests the &lt;&gt; to reset (so that all currently accumulated output in the output stream is discarded and reusing the already allocated buffer space). <p>insertRecordIntoSorter requests the &lt;&gt; to write out the record (write the serializer:SerializationStream.md#writeKey[key] and the serializer:SerializationStream.md#writeValue[value]) and to serializer:SerializationStream.md#flush[flush]. <p>[[insertRecordIntoSorter-serializedRecordSize]] insertRecordIntoSorter requests the &lt;&gt; for the length of the buffer. <p>[[insertRecordIntoSorter-partitionId]] insertRecordIntoSorter requests the &lt;&gt; for the ../rdd/Partitioner.md#getPartition[partition] for the given record (by the key). <p>In the end, insertRecordIntoSorter requests the &lt;&gt; to ShuffleExternalSorter.md#insertRecord[insert] the &lt;&gt; as a byte array (with the &lt;&gt; and the &lt;&gt;). <p>insertRecordIntoSorter is used when UnsafeShuffleWriter is requested to &lt;&gt;.","text":""},{"location":"shuffle/UnsafeShuffleWriter/#closing-and-writing-output-merging-spill-files","title":"Closing and Writing Output (Merging Spill Files) <pre><code>void closeAndWriteOutput()\n</code></pre> <p><code>closeAndWriteOutput</code> asserts that the ShuffleExternalSorter is created (non-<code>null</code>).</p> <p><code>closeAndWriteOutput</code> updates peak memory used.</p> <p><code>closeAndWriteOutput</code> removes the references to the &lt;&gt; and &lt;&gt; output streams (<code>null</code>s them). <p><code>closeAndWriteOutput</code> requests the &lt;&gt; to ShuffleExternalSorter.md#closeAndGetSpills[close and return spill metadata]. <p><code>closeAndWriteOutput</code> removes the reference to the &lt;&gt; (<code>null</code>s it). <p><code>closeAndWriteOutput</code> requests the &lt;&gt; for the IndexShuffleBlockResolver.md#getDataFile[output data file] for the &lt;&gt; and &lt;&gt; IDs. <p>[[closeAndWriteOutput-partitionLengths]][[closeAndWriteOutput-tmp]] closeAndWriteOutput creates a temporary file (along the data output file) and uses it to &lt;&gt; (that gives a partition length array). All spill files are then deleted. <p>closeAndWriteOutput requests the &lt;&gt; to IndexShuffleBlockResolver.md#writeIndexFileAndCommit[write shuffle index and data files] (for the &lt;&gt; and &lt;&gt; IDs, the &lt;&gt; and the &lt;&gt;). <p>In the end, closeAndWriteOutput creates a scheduler:MapStatus.md[MapStatus] with the storage:BlockManager.md#shuffleServerId[location of the local BlockManager] and the &lt;&gt;. <p>closeAndWriteOutput prints out the following ERROR message to the logs if there is an issue with deleting spill files:</p> <pre><code>Error while deleting spill file [path]\n</code></pre> <p>closeAndWriteOutput prints out the following ERROR message to the logs if there is an issue with deleting the &lt;&gt;: <pre><code>Error while deleting temp file [path]\n</code></pre> <p><code>closeAndWriteOutput</code> is used when <code>UnsafeShuffleWriter</code> is requested to write records.</p> <p>== [[getPeakMemoryUsedBytes]] Getting Peak Memory Used</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_3","title":"[source, java]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#long-getpeakmemoryusedbytes","title":"long getPeakMemoryUsedBytes() <p>getPeakMemoryUsedBytes simply &lt;&gt; and returns the internal &lt;&gt; registry. <p>getPeakMemoryUsedBytes is used when UnsafeShuffleWriter is requested to &lt;&gt;. <p>== [[open]] Opening UnsafeShuffleWriter and Buffers</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#source-java_4","title":"[source, java]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#void-open","title":"void open() <p>open requires that there is no &lt;&gt; available. <p>open creates a ShuffleExternalSorter.md[ShuffleExternalSorter].</p> <p>open creates a &lt;&gt; with the capacity of &lt;&gt;. <p>open requests the &lt;&gt; for a serializer:SerializerInstance.md#serializeStream[SerializationStream] to the &lt;&gt; (available internally as the &lt;&gt; reference). <p>open is used when UnsafeShuffleWriter is &lt;&gt;. <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.shuffle.sort.UnsafeShuffleWriter</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"shuffle/UnsafeShuffleWriter/#log4jloggerorgapachesparkshufflesortunsafeshufflewriterall","title":"log4j.logger.org.apache.spark.shuffle.sort.UnsafeShuffleWriter=ALL <p>Refer to spark-logging.md[Logging].</p>","text":""},{"location":"shuffle/UnsafeShuffleWriter/#internal-properties","title":"Internal Properties","text":""},{"location":"shuffle/UnsafeShuffleWriter/#mapstatus","title":"MapStatus <p>MapStatus</p> <p>Created when UnsafeShuffleWriter is requested to &lt;&gt; (with the storage:BlockManagerId.md[] of the &lt;&gt; and <code>partitionLengths</code>) <p>Returned when UnsafeShuffleWriter is requested to &lt;&gt;","text":""},{"location":"shuffle/UnsafeShuffleWriter/#partitioner","title":"Partitioner <p>Partitioner (as used by the BaseShuffleHandle.md#dependency[ShuffleDependency] of the &lt;&gt;) <p>Used when UnsafeShuffleWriter is requested for the following:</p> <ul> <li> <p>&lt;&gt; (and create a ShuffleExternalSorter.md[ShuffleExternalSorter] with the given ../rdd/Partitioner.md#numPartitions[number of partitions])  <li> <p>&lt;&gt; (and request the ../rdd/Partitioner.md#getPartition[partition for the key])  <li> <p>&lt;&gt;, &lt;&gt; and &lt;&gt; (for the ../rdd/Partitioner.md#numPartitions[number of partitions] to create partition lengths)","text":""},{"location":"shuffle/UnsafeShuffleWriter/#peak-memory-used","title":"Peak Memory Used <p>Peak memory used (in bytes) that is updated exclusively in &lt;&gt; (after requesting the &lt;&gt; for ShuffleExternalSorter.md#getPeakMemoryUsedBytes[getPeakMemoryUsedBytes]) <p>Use &lt;&gt; to access the current value","text":""},{"location":"shuffle/UnsafeShuffleWriter/#bytearrayoutputstream-for-serialized-data","title":"ByteArrayOutputStream for Serialized Data <p>{java-javadoc-url}/java/io/ByteArrayOutputStream.html[java.io.ByteArrayOutputStream] of serialized data (written into a byte array of &lt;&gt; initial size) <p>Used when UnsafeShuffleWriter is requested for the following:</p> <ul> <li> <p>&lt;&gt; (and create the internal &lt;&gt;)  <li> <p>&lt;&gt;   <p>Destroyed (<code>null</code>) when requested to &lt;&gt;. <p>=== [[serializer]] serializer</p> <p>serializer:SerializerInstance.md[SerializerInstance] (that is a new instance of the Serializer of the BaseShuffleHandle.md#dependency[ShuffleDependency] of the &lt;&gt;) <p>Used exclusively when UnsafeShuffleWriter is requested to &lt;&gt; (and creates the &lt;&gt;) <p>=== [[serOutputStream]] serOutputStream</p> <p>serializer:SerializationStream.md[SerializationStream] (that is created when the &lt;&gt; is requested to serializer:SerializerInstance.md#serializeStream[serializeStream] with the &lt;&gt;) <p>Used when UnsafeShuffleWriter is requested to &lt;&gt; <p>Destroyed (<code>null</code>) when requested to &lt;&gt;.","text":""},{"location":"shuffle/UnsafeShuffleWriter/#shuffle-id","title":"Shuffle ID <p>Shuffle ID (of the ShuffleDependency of the SerializedShuffleHandle)</p> <p>Used exclusively when requested to &lt;&gt; <p>=== [[writeMetrics]] writeMetrics</p> <p>executor:ShuffleWriteMetrics.md[] (of the TaskMetrics of the &lt;&gt;) <p>Used when UnsafeShuffleWriter is requested for the following:</p> <ul> <li> <p>&lt;&gt; (and creates the &lt;&gt;)  <li> <p>&lt;&gt;  <li> <p>&lt;&gt;  <li> <p>&lt;&gt;","text":""},{"location":"stage-level-scheduling/","title":"Stage-Level Scheduling","text":"<p>Stage-Level Scheduling uses ResourceProfiles for the following:</p> <ul> <li>Spark developers can specify task and executor resource requirements at stage level</li> <li>Spark (Scheduler) uses the stage-level requirements to acquire the necessary resources and executors and schedule tasks based on the per-stage requirements</li> </ul> <p>Apache Spark 3.1.1</p> <p>Stage-Level Scheduling was introduced in Apache Spark 3.1.1 (cf. SPARK-27495)</p>"},{"location":"stage-level-scheduling/#resource-profiles","title":"Resource Profiles","text":"<p>Resource Profiles are managed by ResourceProfileManager.</p> <p>The Default ResourceProfile is known by ID <code>0</code>.</p> <p>Custom Resource Profiles are ResourceProfiles with non-<code>0</code> IDs. Custom Resource Profiles are only supported on YARN, Kubernetes and Spark Standalone.</p> <p><code>ResourceProfile</code>s are associated with an <code>RDD</code> using withResources operator.</p>"},{"location":"stage-level-scheduling/#resource-requests","title":"Resource Requests","text":""},{"location":"stage-level-scheduling/#executor","title":"Executor","text":"<p>Executor Resource Requests are specified using executorResources of a <code>ResourceProfile</code>.</p> <p>Executor Resource Requests can be the following built-in resources:</p> <ul> <li><code>cores</code></li> <li><code>memory</code></li> <li><code>memoryOverhead</code></li> <li><code>pyspark.memory</code></li> <li><code>offHeap</code></li> </ul> <p>Other (deployment environment-specific) executor resource requests can be defined as Custom Executor Resources.</p>"},{"location":"stage-level-scheduling/#task","title":"Task","text":"<p>Default Task Resources are specified based on spark.task.cpus and spark.task.resource-prefixed configuration properties.</p>"},{"location":"stage-level-scheduling/#sparklistenerresourceprofileadded","title":"SparkListenerResourceProfileAdded","text":"<p><code>ResourceProfile</code>s can be monitored using SparkListenerResourceProfileAdded.</p>"},{"location":"stage-level-scheduling/#dynamic-allocation","title":"Dynamic Allocation","text":"<p>Dynamic Allocation of Executors is not supported.</p>"},{"location":"stage-level-scheduling/#demo","title":"Demo","text":""},{"location":"stage-level-scheduling/#describe-distributed-computation","title":"Describe Distributed Computation","text":"<p>Let's describe a distributed computation (using RDD API) over a 10-record dataset.</p> <pre><code>val rdd = sc.range(0, 9)\n</code></pre>"},{"location":"stage-level-scheduling/#describe-required-resources","title":"Describe Required Resources","text":"<p>Optional Step</p> <p>This demo assumes to be executed in <code>local</code> deployment mode (that supports the default ResourceProfile only) and so the step is considered optional until a supported cluster manager is used.</p> <pre><code>import org.apache.spark.resource.ResourceProfileBuilder\nval rpb = new ResourceProfileBuilder\nval rp1 = rpb.build()\n</code></pre> <pre><code>scala&gt; println(rp1.toString)\nProfile: id = 1, executor resources: , task resources:\n</code></pre>"},{"location":"stage-level-scheduling/#configure-default-resourceprofile","title":"Configure Default ResourceProfile","text":"<p>FIXME</p> <p>Use <code>spark.task.resource</code>-prefixed properties per ResourceUtils.</p>"},{"location":"stage-level-scheduling/#associate-required-resources-to-distributed-computation","title":"Associate Required Resources to Distributed Computation","text":"<pre><code>rdd.withResources(rp1)\n</code></pre> <pre><code>scala&gt; rdd.withResources(rp1)\norg.apache.spark.SparkException: TaskResourceProfiles are only supported for Standalone cluster for now when dynamic allocation is disabled.\n  at org.apache.spark.resource.ResourceProfileManager.isSupported(ResourceProfileManager.scala:71)\n  at org.apache.spark.resource.ResourceProfileManager.addResourceProfile(ResourceProfileManager.scala:126)\n  at org.apache.spark.rdd.RDD.withResources(RDD.scala:1802)\n  ... 42 elided\n</code></pre> SPARK-43912 <p>Reported as SPARK-43912 Incorrect SparkException for Stage-Level Scheduling in local mode.</p> <p>Until it is fixed, enable Dynamic Allocation.</p> <pre><code>$ ./bin/spark-shell -c spark.dynamicAllocation.enabled=true\n</code></pre>"},{"location":"stage-level-scheduling/ExecutorResourceInfo/","title":"ExecutorResourceInfo","text":"<p><code>ExecutorResourceInfo</code> is a ResourceAllocator.</p>"},{"location":"stage-level-scheduling/ExecutorResourceInfo/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorResourceInfo</code> takes the following to be created:</p> <ul> <li> Resource Name <li> Addresses <li> Number of slots (per address) <p><code>ExecutorResourceInfo</code> is created when:</p> <ul> <li><code>DriverEndpoint</code> is requested to handle a RegisterExecutor event</li> </ul>"},{"location":"stage-level-scheduling/ExecutorResourceRequest/","title":"ExecutorResourceRequest","text":""},{"location":"stage-level-scheduling/ExecutorResourceRequest/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorResourceRequest</code> takes the following to be created:</p> <ul> <li> Resource Name <li> Amount <li> Discovery Script <li> Vendor <p><code>ExecutorResourceRequest</code> is created when:</p> <ul> <li><code>ExecutorResourceRequests</code> is requested to memory, offHeapMemory, memoryOverhead, pysparkMemory, cores and resource</li> <li><code>JsonProtocol</code> utility is used to executorResourceRequestFromJson</li> </ul>"},{"location":"stage-level-scheduling/ExecutorResourceRequest/#serializable","title":"Serializable","text":"<p><code>ExecutorResourceRequest</code> is a <code>Serializable</code> (Java).</p>"},{"location":"stage-level-scheduling/ExecutorResourceRequests/","title":"ExecutorResourceRequests","text":"<p><code>ExecutorResourceRequests</code> is a set of ExecutorResourceRequests for Spark developers to (programmatically) specify resources for an RDD to be applied at stage level:</p> <ul> <li><code>cores</code></li> <li><code>memory</code></li> <li><code>memoryOverhead</code></li> <li><code>offHeap</code></li> <li><code>pyspark.memory</code></li> <li>custom resource</li> </ul>"},{"location":"stage-level-scheduling/ExecutorResourceRequests/#creating-instance","title":"Creating Instance","text":"<p><code>ExecutorResourceRequests</code> takes no arguments to be created.</p> <p><code>ExecutorResourceRequests</code> is created when:</p> <ul> <li><code>ResourceProfile</code> utility is used to get the default executor resource requests (for tasks)</li> </ul>"},{"location":"stage-level-scheduling/ExecutorResourceRequests/#serializable","title":"Serializable","text":"<p><code>ExecutorResourceRequests</code> is a <code>Serializable</code> (Java).</p>"},{"location":"stage-level-scheduling/ExecutorResourceRequests/#resource","title":"resource <pre><code>resource(\n  resourceName: String,\n  amount: Long,\n  discoveryScript: String = \"\",\n  vendor: String = \"\"): this.type\n</code></pre> <p><code>resource</code> creates a ExecutorResourceRequest and registers it under <code>resourceName</code>.</p> <p><code>resource</code> is used when:</p> <ul> <li><code>ResourceProfile</code> utility is used for the default executor resources</li> </ul>","text":""},{"location":"stage-level-scheduling/ExecutorResourceRequests/#text-representation","title":"Text Representation <p><code>ExecutorResourceRequests</code> presents itself as:</p> <pre><code>Executor resource requests: [_executorResources]\n</code></pre>","text":""},{"location":"stage-level-scheduling/ExecutorResourceRequests/#demo","title":"Demo <pre><code>import org.apache.spark.resource.ExecutorResourceRequests\nval executorResources = new ExecutorResourceRequests()\n  .memory(\"2g\")\n  .memoryOverhead(\"512m\")\n  .cores(8)\n  .resource(\n    resourceName = \"my-custom-resource\",\n    amount = 1,\n    discoveryScript = \"/this/is/path/to/discovery/script.sh\",\n    vendor = \"pl.japila\")\n</code></pre> <pre><code>scala&gt; println(executorResources)\nExecutor resource requests: {memoryOverhead=name: memoryOverhead, amount: 512, script: , vendor: , memory=name: memory, amount: 2048, script: , vendor: , cores=name: cores, amount: 8, script: , vendor: , my-custom-resource=name: my-custom-resource, amount: 1, script: /this/is/path/to/discovery/script.sh, vendor: pl.japila}\n</code></pre>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/","title":"ResourceAllocator","text":"<p><code>ResourceAllocator</code> is an abstraction of resource allocators.</p>"},{"location":"stage-level-scheduling/ResourceAllocator/#contract","title":"Contract","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#resourceaddresses","title":"resourceAddresses <pre><code>resourceAddresses: Seq[String]\n</code></pre> <p>Used when:</p> <ul> <li><code>ResourceAllocator</code> is requested for the addressAvailabilityMap</li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#resourcename","title":"resourceName <pre><code>resourceName: String\n</code></pre> <p>Used when:</p> <ul> <li><code>ResourceAllocator</code> is requested to acquire and release addresses</li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#slotsperaddress","title":"slotsPerAddress <pre><code>slotsPerAddress: Int\n</code></pre> <p>Used when:</p> <ul> <li><code>ResourceAllocator</code> is requested for the addressAvailabilityMap, assignedAddrs and to release</li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#implementations","title":"Implementations","text":"<ul> <li>ExecutorResourceInfo</li> <li><code>WorkerResourceInfo</code> (Spark Standalone)</li> </ul>"},{"location":"stage-level-scheduling/ResourceAllocator/#acquiring-addresses","title":"Acquiring Addresses <pre><code>acquire(\n  addrs: Seq[String]): Unit\n</code></pre> <p><code>acquire</code>...FIXME</p> <p><code>acquire</code> is used when:</p> <ul> <li><code>DriverEndpoint</code> is requested to launchTasks</li> <li><code>WorkerResourceInfo</code> (Spark Standalone) is requested to <code>acquire</code> and <code>recoverResources</code></li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#releasing-addresses","title":"Releasing Addresses <pre><code>release(\n  addrs: Seq[String]): Unit\n</code></pre> <p><code>release</code>...FIXME</p> <p><code>release</code> is used when:</p> <ul> <li><code>DriverEndpoint</code> is requested to handle a StatusUpdate event</li> <li><code>WorkerInfo</code> (Spark Standalone) is requested to <code>releaseResources</code></li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#assignedaddrs","title":"assignedAddrs <pre><code>assignedAddrs: Seq[String]\n</code></pre> <p><code>assignedAddrs</code>...FIXME</p> <p><code>assignedAddrs</code> is used when:</p> <ul> <li><code>WorkerInfo</code> (Spark Standalone) is requested for the <code>resourcesInfoUsed</code></li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#availableaddrs","title":"availableAddrs <pre><code>availableAddrs: Seq[String]\n</code></pre> <p><code>availableAddrs</code>...FIXME</p> <p><code>availableAddrs</code> is used when:</p> <ul> <li><code>WorkerInfo</code> (Spark Standalone) is requested for the <code>resourcesInfoFree</code></li> <li><code>WorkerResourceInfo</code> (Spark Standalone) is requested to <code>acquire</code> and <code>resourcesAmountFree</code></li> <li><code>DriverEndpoint</code> is requested to makeOffers</li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceAllocator/#addressavailabilitymap","title":"addressAvailabilityMap <pre><code>addressAvailabilityMap: Seq[String]\n</code></pre> <p><code>addressAvailabilityMap</code>...FIXME</p>  Lazy Value <p><code>addressAvailabilityMap</code> is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.</p> <p>Learn more in the Scala Language Specification.</p>  <p><code>addressAvailabilityMap</code> is used when:</p> <ul> <li><code>ResourceAllocator</code> is requested to availableAddrs, assignedAddrs, acquire, release</li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceID/","title":"ResourceID","text":"<p><code>ResourceID</code> is...FIXME</p>","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/ResourceProfile/","title":"ResourceProfile","text":"<p><code>ResourceProfile</code> is a resource profile that describes executor and task requirements of an RDD in Stage-Level Scheduling.</p> <p><code>ResourceProfile</code> can be associated with an <code>RDD</code> using RDD.withResources method.</p> <p>The <code>ResourceProfile</code> of an <code>RDD</code> is available using RDD.getResourceProfile method.</p>"},{"location":"stage-level-scheduling/ResourceProfile/#creating-instance","title":"Creating Instance","text":"<p><code>ResourceProfile</code> takes the following to be created:</p> <ul> <li> Executor Resources (<code>Map[String, ExecutorResourceRequest]</code>) <li> Task Resources (<code>Map[String, TaskResourceRequest]</code>) <p><code>ResourceProfile</code> is created (directly or using getOrCreateDefaultProfile)\u00a0when:</p> <ul> <li><code>DriverEndpoint</code> is requested to handle a RetrieveSparkAppConfig message</li> <li><code>ResourceProfileBuilder</code> utility is requested to build</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfile/#allSupportedExecutorResources","title":"Built-In Executor Resources","text":"<p><code>ResourceProfile</code> defines the following names as the Supported Executor Resources (among the specified executorResources):</p> <ul> <li><code>cores</code></li> <li><code>memory</code></li> <li><code>memoryOverhead</code></li> <li><code>pyspark.memory</code></li> <li><code>offHeap</code></li> </ul> <p>All other executor resources (names) are considered Custom Executor Resources.</p>"},{"location":"stage-level-scheduling/ResourceProfile/#getCustomExecutorResources","title":"Custom Executor Resources","text":"<pre><code>getCustomExecutorResources(): Map[String, ExecutorResourceRequest]\n</code></pre> <p><code>getCustomExecutorResources</code> is the Executor Resources that are not supported executor resources.</p> <p><code>getCustomExecutorResources</code> is used when:</p> <ul> <li><code>ApplicationDescription</code> is requested to <code>resourceReqsPerExecutor</code></li> <li><code>ApplicationInfo</code> is requested to <code>createResourceDescForResourceProfile</code></li> <li><code>ResourceProfile</code> is requested to calculateTasksAndLimitingResource</li> <li><code>ResourceUtils</code> is requested to getOrDiscoverAllResourcesForResourceProfile, warnOnWastedResources</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfile/#limitingResource","title":"Limiting Resource","text":"<pre><code>limitingResource(\n  sparkConf: SparkConf): String\n</code></pre> <p><code>limitingResource</code> takes the _limitingResource, if calculated earlier, or calculateTasksAndLimitingResource.</p> <p><code>limitingResource</code> is used when:</p> <ul> <li><code>ResourceProfileManager</code> is requested to add a new ResourceProfile (to recompute a limiting resource eagerly)</li> <li><code>ResourceUtils</code> is requested to warnOnWastedResources (for reporting purposes only)</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfile/#_limitingResource","title":"_limitingResource","text":"<pre><code>_limitingResource: Option[String] = None\n</code></pre> <p><code>ResourceProfile</code> defines <code>_limitingResource</code> variable that is determined (if there is one) while calculateTasksAndLimitingResource.</p> <p><code>_limitingResource</code> can be the following:</p> <ul> <li>A \"special\" empty resource identifier (that is assumed <code>cpus</code> in TaskSchedulerImpl)</li> <li><code>cpus</code> built-in task resource identifier</li> <li>any custom resource identifier</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfile/#defaultProfile","title":"Default Profile","text":"<p><code>ResourceProfile</code> (Scala object) defines <code>defaultProfile</code> internal registry for the default ResourceProfile (per JVM instance).</p> <p><code>defaultProfile</code> is undefined (<code>None</code>) and gets a new <code>ResourceProfile</code> when first requested.</p> <p><code>defaultProfile</code> can be accessed using getOrCreateDefaultProfile.</p> <p><code>defaultProfile</code> is cleared (removed) in clearDefaultProfile.</p>"},{"location":"stage-level-scheduling/ResourceProfile/#getOrCreateDefaultProfile","title":"getOrCreateDefaultProfile","text":"<pre><code>getOrCreateDefaultProfile(\n  conf: SparkConf): ResourceProfile\n</code></pre> <p><code>getOrCreateDefaultProfile</code> returns the default profile (if already defined) or creates a new one.</p> <p>Unless defined, <code>getOrCreateDefaultProfile</code> creates a ResourceProfile with the default task and executor resource descriptions and makes it the defaultProfile.</p> <p><code>getOrCreateDefaultProfile</code> prints out the following INFO message to the logs:</p> <pre><code>Default ResourceProfile created,\nexecutor resources: [executorResources], task resources: [taskResources]\n</code></pre> <p><code>getOrCreateDefaultProfile</code>\u00a0is used when:</p> <ul> <li><code>TaskResourceProfile</code> is requested to getCustomExecutorResources</li> <li><code>ResourceProfile</code> is requested to getDefaultProfileExecutorResources</li> <li><code>ResourceProfileManager</code> is created</li> <li><code>YarnAllocator</code> (Spark on YARN) is requested to <code>initDefaultProfile</code></li> </ul>"},{"location":"stage-level-scheduling/ResourceProfile/#getDefaultExecutorResources","title":"Default Executor Resource Requests","text":"<pre><code>getDefaultExecutorResources(\n  conf: SparkConf): Map[String, ExecutorResourceRequest]\n</code></pre> <p><code>getDefaultExecutorResources</code> creates an ExecutorResourceRequests with the following:</p> Property Configuration Property cores spark.executor.cores memory spark.executor.memory memoryOverhead spark.executor.memoryOverhead pysparkMemory spark.executor.pyspark.memory offHeapMemory spark.memory.offHeap.size <p><code>getDefaultExecutorResources</code> finds executor resource requests (with the <code>spark.executor</code> component name in the given SparkConf) for ExecutorResourceRequests.</p> <p><code>getDefaultExecutorResources</code> initializes the defaultProfileExecutorResources (with the executor resource requests).</p> <p>In the end, <code>getDefaultExecutorResources</code> requests the <code>ExecutorResourceRequests</code> for all the resource requests</p>"},{"location":"stage-level-scheduling/ResourceProfile/#getDefaultTaskResources","title":"Default Task Resource Requests","text":"<pre><code>getDefaultTaskResources(\n  conf: SparkConf): Map[String, TaskResourceRequest]\n</code></pre> <p><code>getDefaultTaskResources</code> creates a new TaskResourceRequests with the cpus based on spark.task.cpus configuration property.</p> <p><code>getDefaultTaskResources</code> adds task resource requests (configured in the given SparkConf using <code>spark.task.resource</code>-prefixed properties).</p> <p>In the end, <code>getDefaultTaskResources</code> requests the <code>TaskResourceRequests</code> for the requests.</p>"},{"location":"stage-level-scheduling/ResourceProfile/#getresourcesforclustermanager","title":"getResourcesForClusterManager <pre><code>getResourcesForClusterManager(\n  rpId: Int,\n  execResources: Map[String, ExecutorResourceRequest],\n  overheadFactor: Double,\n  conf: SparkConf,\n  isPythonApp: Boolean,\n  resourceMappings: Map[String, String]): ExecutorResourcesOrDefaults\n</code></pre> <p><code>getResourcesForClusterManager</code> takes the DefaultProfileExecutorResources.</p> <p><code>getResourcesForClusterManager</code> calculates the overhead memory with the following:</p> <ul> <li><code>memoryOverheadMiB</code> and <code>executorMemoryMiB</code> of the <code>DefaultProfileExecutorResources</code></li> <li>Given <code>overheadFactor</code></li> </ul> <p>If the given <code>rpId</code> resource profile ID is not the default ID (<code>0</code>), <code>getResourcesForClusterManager</code>...FIXME (there is so much to \"digest\")</p> <p><code>getResourcesForClusterManager</code>...FIXME</p> <p>In the end, <code>getResourcesForClusterManager</code> creates a <code>ExecutorResourcesOrDefaults</code>.</p>  <p><code>getResourcesForClusterManager</code> is used when:</p> <ul> <li><code>BasicExecutorFeatureStep</code> (Spark on Kubernetes) is created</li> <li><code>YarnAllocator</code> (Spark on YARN) is requested to <code>createYarnResourceForResourceProfile</code></li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceProfile/#getDefaultProfileExecutorResources","title":"getDefaultProfileExecutorResources <pre><code>getDefaultProfileExecutorResources(\n  conf: SparkConf): DefaultProfileExecutorResources\n</code></pre> <p><code>getDefaultProfileExecutorResources</code>...FIXME</p>  <p><code>getDefaultProfileExecutorResources</code> is used when:</p> <ul> <li><code>ResourceProfile</code> is requested to getResourcesForClusterManager</li> <li><code>YarnAllocator</code> (Spark on YARN) is requested to <code>runAllocatedContainers</code></li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceProfile/#serializable","title":"Serializable <p><code>ResourceProfile</code> is a Java Serializable.</p>","text":""},{"location":"stage-level-scheduling/ResourceProfile/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.resource.ResourceProfile</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.ResourceProfile.name = org.apache.spark.resource.ResourceProfile\nlogger.ResourceProfile.level = all\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"stage-level-scheduling/ResourceProfileBuilder/","title":"ResourceProfileBuilder","text":"<p><code>ResourceProfileBuilder</code> is a fluent API for Spark developers to build ResourceProfiles (to associate with an RDD).</p> Available in Scala and Python APIs <p><code>ResourceProfileBuilder</code> is available in Scala and Python APIs.</p>"},{"location":"stage-level-scheduling/ResourceProfileBuilder/#creating-instance","title":"Creating Instance","text":"<p><code>ResourceProfileBuilder</code> takes no arguments to be created.</p>"},{"location":"stage-level-scheduling/ResourceProfileBuilder/#build","title":"Building ResourceProfile","text":"<pre><code>build: ResourceProfile\n</code></pre> <p><code>build</code> creates a ResourceProfile:</p> <ul> <li>TaskResourceProfile when _executorResources are undefined</li> <li>ResourceProfile with the executorResources and the taskResources</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileBuilder/#executorResources","title":"Executor Resources","text":"<pre><code>executorResources: Map[String, ExecutorResourceRequest]\n</code></pre> <p><code>executorResources</code>...FIXME</p>"},{"location":"stage-level-scheduling/ResourceProfileBuilder/#taskResources","title":"Task Resources <pre><code>taskResources: Map[String, TaskResourceRequest]\n</code></pre> <p><code>taskResources</code> is TaskResourceRequests specified by users (by their resource names)</p> <p><code>taskResources</code> are specified using require method.</p> <p><code>taskResources</code> can be removed using clearTaskResourceRequests method.</p> <p><code>taskResources</code> can be printed out using toString method.</p> <p><code>taskResources</code> is used when:</p> <ul> <li><code>ResourceProfileBuilder</code> is requested to build a ResourceProfile</li> </ul>","text":""},{"location":"stage-level-scheduling/ResourceProfileBuilder/#demo","title":"Demo","text":"<pre><code>import org.apache.spark.resource.ResourceProfileBuilder\nval rp1 = new ResourceProfileBuilder()\n\nimport org.apache.spark.resource.ExecutorResourceRequests\nval execReqs = new ExecutorResourceRequests().cores(4).resource(\"gpu\", 4)\n\nimport org.apache.spark.resource.ExecutorResourceRequests\nval taskReqs = new TaskResourceRequests().cpus(1).resource(\"gpu\", 1)\n\nrp1.require(execReqs).require(taskReqs)\nval rprof1 = rp1.build\n</code></pre> <pre><code>val rpManager = sc.resourceProfileManager // (1)!\nrpManager.addResourceProfile(rprof1)\n</code></pre> <ol> <li>NOTE: <code>resourceProfileManager</code> is <code>private[spark]</code></li> </ol>"},{"location":"stage-level-scheduling/ResourceProfileManager/","title":"ResourceProfileManager","text":"<p><code>ResourceProfileManager</code> manages ResourceProfiles.</p>"},{"location":"stage-level-scheduling/ResourceProfileManager/#creating-instance","title":"Creating Instance","text":"<p><code>ResourceProfileManager</code> takes the following to be created:</p> <ul> <li> SparkConf <li> LiveListenerBus <p><code>ResourceProfileManager</code> is created when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileManager/#accessing-resourceprofilemanager","title":"Accessing ResourceProfileManager","text":"<p><code>ResourceProfileManager</code> is available to other Spark services using SparkContext.</p>"},{"location":"stage-level-scheduling/ResourceProfileManager/#resourceProfileIdToResourceProfile","title":"Registered ResourceProfiles","text":"<pre><code>resourceProfileIdToResourceProfile: HashMap[Int, ResourceProfile]\n</code></pre> <p><code>ResourceProfileManager</code> creates <code>resourceProfileIdToResourceProfile</code> registry of ResourceProfiles by their ID.</p> <p>A new <code>ResourceProfile</code> is added when addResourceProfile.</p> <p><code>ResourceProfile</code>s are resolved (looked up) using resourceProfileFromId.</p> <p><code>ResourceProfile</code>s can be equivalent when they specify the same resources.</p> <p><code>resourceProfileIdToResourceProfile</code> is used when:</p> <ul> <li>canBeScheduled</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileManager/#defaultProfile","title":"Default ResourceProfile","text":"<p><code>ResourceProfileManager</code> gets or creates the default ResourceProfile when created and registers it immediately.</p> <p>The default profile is available as defaultResourceProfile.</p>"},{"location":"stage-level-scheduling/ResourceProfileManager/#defaultResourceProfile","title":"Accessing Default ResourceProfile","text":"<pre><code>defaultResourceProfile: ResourceProfile\n</code></pre> <p><code>defaultResourceProfile</code> returns the default ResourceProfile.</p> <p><code>defaultResourceProfile</code> is used when:</p> <ul> <li><code>ExecutorAllocationManager</code> is created</li> <li><code>SparkContext</code> is requested to requestTotalExecutors and createTaskScheduler</li> <li><code>DAGScheduler</code> is requested to mergeResourceProfilesForStage</li> <li><code>CoarseGrainedSchedulerBackend</code> is requested to requestExecutors</li> <li><code>StandaloneSchedulerBackend</code> (Spark Standalone) is created</li> <li><code>KubernetesClusterSchedulerBackend</code> (Spark on Kubernetes) is created</li> <li><code>MesosCoarseGrainedSchedulerBackend</code> (Spark on Mesos) is created</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileManager/#addResourceProfile","title":"Registering ResourceProfile","text":"<pre><code>addResourceProfile(\n  rp: ResourceProfile): Unit\n</code></pre> <p><code>addResourceProfile</code> checks if the given ResourceProfile is supported.</p> <p><code>addResourceProfile</code> registers the given ResourceProfile (in the resourceProfileIdToResourceProfile registry) unless done earlier (by ResourceProfile ID).</p> <p>With a new <code>ResourceProfile</code>, <code>addResourceProfile</code> requests the given ResourceProfile for the limiting resource (for no reason but to calculate it upfront) and prints out the following INFO message to the logs:</p> <pre><code>Added ResourceProfile id: [id]\n</code></pre> <p>In the end (for a new <code>ResourceProfile</code>), <code>addResourceProfile</code> requests the LiveListenerBus to post a SparkListenerResourceProfileAdded.</p> <p><code>addResourceProfile</code> is used when:</p> <ul> <li>RDD.withResources operator is used</li> <li><code>ResourceProfileManager</code> is created (and registers the default profile)</li> <li><code>DAGScheduler</code> is requested to mergeResourceProfilesForStage</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileManager/#dynamicEnabled","title":"Dynamic Allocation","text":"<p><code>ResourceProfileManager</code> initializes <code>dynamicEnabled</code> flag to be isDynamicAllocationEnabled when created.</p> <p><code>dynamicEnabled</code> flag is used when:</p> <ul> <li>isSupported</li> <li>canBeScheduled</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileManager/#isSupported","title":"isSupported","text":"<pre><code>isSupported(\n  rp: ResourceProfile): Boolean\n</code></pre> <p><code>isSupported</code>...FIXME</p>"},{"location":"stage-level-scheduling/ResourceProfileManager/#canBeScheduled","title":"canBeScheduled","text":"<pre><code>canBeScheduled(\n  taskRpId: Int,\n  executorRpId: Int): Boolean\n</code></pre> <p><code>canBeScheduled</code> asserts that the given <code>taskRpId</code> and <code>executorRpId</code> are valid ResourceProfile IDs or throws an <code>AssertionError</code>:</p> <pre><code>Tasks and executors must have valid resource profile id\n</code></pre> <p><code>canBeScheduled</code> finds the ResourceProfile.</p> <p><code>canBeScheduled</code> holds positive (<code>true</code>) when either holds:</p> <ol> <li>The given <code>taskRpId</code> and <code>executorRpId</code> are the same</li> <li>Dynamic Allocation is disabled and the <code>ResourceProfile</code> is a TaskResourceProfile</li> </ol> <p><code>canBeScheduled</code> is used when:</p> <ul> <li><code>TaskSchedulerImpl</code> is requested to resourceOfferSingleTaskSet and calculateAvailableSlots</li> </ul>"},{"location":"stage-level-scheduling/ResourceProfileManager/#logging","title":"Logging","text":"<p>Enable <code>ALL</code> logging level for <code>org.apache.spark.resource.ResourceProfileManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j2.properties</code>:</p> <pre><code>logger.ResourceProfileManager.name = org.apache.spark.resource.ResourceProfileManager\nlogger.ResourceProfileManager.level = all\n</code></pre> <p>Refer to Logging.</p>"},{"location":"stage-level-scheduling/ResourceUtils/","title":"ResourceUtils","text":""},{"location":"stage-level-scheduling/ResourceUtils/#addTaskResourceRequests","title":"Registering Task Resource Requests (from SparkConf)","text":"<pre><code>addTaskResourceRequests(\n  sparkConf: SparkConf,\n  treqs: TaskResourceRequests): Unit\n</code></pre> <p><code>addTaskResourceRequests</code> registers all task resource requests in the given SparkConf with the given TaskResourceRequests.</p> <p><code>addTaskResourceRequests</code> listResourceIds with <code>spark.task</code> component name in the given SparkConf.</p> <p>For every ResourceID discovered, <code>addTaskResourceRequests</code> does the following:</p> <ol> <li>Finds all the settings with the confPrefix</li> <li>Looks up <code>amount</code> setting (or throws a <code>SparkException</code>)</li> <li>Registers the resourceName with the <code>amount</code> in the given TaskResourceRequests</li> </ol> <p><code>addTaskResourceRequests</code> is used when:</p> <ul> <li><code>ResourceProfile</code> is requested for the default task resource requests</li> </ul>"},{"location":"stage-level-scheduling/ResourceUtils/#listResourceIds","title":"Listing All Configured Resources","text":"<pre><code>listResourceIds(\n  sparkConf: SparkConf,\n  componentName: String): Seq[ResourceID]\n</code></pre> <p><code>listResourceIds</code> requests the given SparkConf to find all Spark settings with the keys with the prefix of the following pattern:</p> <pre><code>[componentName].resource.\n</code></pre> Internals <p><code>listResourceIds</code> gets resource-related settings (from SparkConf) with the prefix removed (e.g., <code>spark.my_component.resource.gpu.amount</code> becomes just <code>gpu.amount</code>).</p> Example<pre><code>// Use the following to start spark-shell\n// ./bin/spark-shell -c spark.my_component.resource.gpu.amount=5\n\nval sparkConf = sc.getConf\n\n// Component names must start with `spark.` prefix\n// Spark assumes valid Spark settings start with `spark.` prefix\nval componentName = \"spark.my_component\"\n\n// this is copied verbatim from ResourceUtils.listResourceIds\n// Note that `resource` is hardcoded\nsparkConf.getAllWithPrefix(s\"$componentName.resource.\").foreach(println)\n\n// (gpu.amount,5)\n</code></pre> <p><code>listResourceIds</code> asserts that resource settings include a <code>.</code> (dot) to separate their resource names from configs or throws the following <code>SparkException</code>:</p> <pre><code>You must specify an amount config for resource: [key] config: [componentName].resource.[key]\n</code></pre> SPARK-43947 <p>Although the exception says <code>You must specify an amount config for resource</code>, only the dot is checked.</p> <pre><code>// Use the following to start spark-shell\n// 1. No amount config specified\n// 2. spark.driver is a Spark built-in resource\n// ./bin/spark-shell -c spark.driver.resource.gpu=5\n</code></pre> <p>Reported as SPARK-43947.</p> <p>In the end, <code>listResourceIds</code> creates a ResourceID for every resource (with the given<code>componentName</code> and resource names discovered).</p> <p><code>listResourceIds</code> is used when:</p> <ul> <li><code>ResourceUtils</code> is requested to parseAllResourceRequests, addTaskResourceRequests, parseResourceRequirements, parseAllocatedOrDiscoverResources</li> </ul>"},{"location":"stage-level-scheduling/ResourceUtils/#parseAllResourceRequests","title":"parseAllResourceRequests","text":"<pre><code>parseAllResourceRequests(\n  sparkConf: SparkConf,\n  componentName: String): Seq[ResourceRequest]\n</code></pre> <p><code>parseAllResourceRequests</code>...FIXME</p> When componentName ResourceProfile <code>spark.executor</code> ResourceUtils <code>KubernetesUtils</code> (Spark on Kubernetes) <p><code>parseAllResourceRequests</code> is used when:</p> <ul> <li><code>ResourceProfile</code> is requested for the default executor resource requests</li> <li><code>ResourceUtils</code> is requested to getOrDiscoverAllResources</li> <li><code>KubernetesUtils</code> (Spark on Kubernetes) is requested to <code>buildResourcesQuantities</code></li> </ul>"},{"location":"stage-level-scheduling/ResourceUtils/#getOrDiscoverAllResources","title":"getOrDiscoverAllResources","text":"<pre><code>getOrDiscoverAllResources(\n  sparkConf: SparkConf,\n  componentName: String,\n  resourcesFileOpt: Option[String]): Map[String, ResourceInformation]\n</code></pre> <p><code>getOrDiscoverAllResources</code>...FIXME</p> When componentName resourcesFileOpt <code>SparkContext</code> <code>spark.driver</code> spark.driver.resourcesFile <code>Worker</code> (Spark Standalone) <code>spark.worker</code> spark.worker.resourcesFile <p><code>getOrDiscoverAllResources</code> is used when:</p> <ul> <li><code>SparkContext</code> is created (and initializes _resources)</li> <li><code>Worker</code> (Spark Standalone) is requested to <code>setupWorkerResources</code></li> </ul>"},{"location":"stage-level-scheduling/ResourceUtils/#parseAllocatedOrDiscoverResources","title":"parseAllocatedOrDiscoverResources","text":"<pre><code>parseAllocatedOrDiscoverResources(\n  sparkConf: SparkConf,\n  componentName: String,\n  resourcesFileOpt: Option[String]): Seq[ResourceAllocation]\n</code></pre> <p><code>parseAllocatedOrDiscoverResources</code>...FIXME</p>"},{"location":"stage-level-scheduling/ResourceUtils/#parseResourceRequirements","title":"parseResourceRequirements (Spark Standalone)","text":"<pre><code>parseResourceRequirements(\n  sparkConf: SparkConf,\n  componentName: String): Seq[ResourceRequirement]\n</code></pre> <p><code>parseResourceRequirements</code>...FIXME</p> <p>componentName</p> <p><code>componentName</code> seems to be always <code>spark.driver</code> for the use cases that seems to be Spark Standalone only.</p> <p><code>parseResourceRequirements</code> is used when:</p> <ul> <li><code>ClientEndpoint</code> (Spark Standalone) is requested to <code>onStart</code></li> <li><code>StandaloneSubmitRequestServlet</code> (Spark Standalone) is requested to <code>buildDriverDescription</code></li> </ul>"},{"location":"stage-level-scheduling/SparkListenerResourceProfileAdded/","title":"SparkListenerResourceProfileAdded","text":"<p><code>SparkListenerResourceProfileAdded</code> is a SparkListenerEvent.</p> <p><code>SparkListenerResourceProfileAdded</code> can be intercepted using the following Spark listeners:</p> <ul> <li><code>SparkFirehoseListener</code></li> <li>SparkListenerInterface</li> <li>SparkListener</li> </ul> <p><code>SparkListenerResourceProfileAdded</code> is recorded using AppStatusListener for status reporting and monitoring.</p>","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/SparkListenerResourceProfileAdded/#creating-instance","title":"Creating Instance","text":"<p><code>SparkListenerResourceProfileAdded</code> takes the following to be created:</p> <ul> <li> ResourceProfile <p><code>SparkListenerResourceProfileAdded</code> is created when:</p> <ul> <li><code>ResourceProfileManager</code> is requested to register a new ResourceProfile</li> <li><code>JsonProtocol</code> (Spark History Server) is requested to resourceProfileAddedFromJson</li> </ul>","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/SparkListenerResourceProfileAdded/#spark-history-server","title":"Spark History Server","text":"<p><code>SparkListenerResourceProfileAdded</code> is logged in Spark History Server using EventLoggingListener.</p> <p><code>SparkListenerResourceProfileAdded</code> is converted from and to JSON format using JsonProtocol (resourceProfileAddedFromJson and resourceProfileAddedToJson, respectively).</p>","tags":["DeveloperApi"]},{"location":"stage-level-scheduling/TaskResourceProfile/","title":"TaskResourceProfile","text":"<p><code>TaskResourceProfile</code> is a ResourceProfile.</p>"},{"location":"stage-level-scheduling/TaskResourceProfile/#creating-instance","title":"Creating Instance","text":"<p><code>TaskResourceProfile</code> takes the following to be created:</p> <ul> <li> Task Resources <p><code>TaskResourceProfile</code> is created when:</p> <ul> <li><code>ResourceProfileBuilder</code> is requested to build a ResourceProfile</li> <li><code>DAGScheduler</code> is requested to merge ResourceProfiles</li> </ul>"},{"location":"stage-level-scheduling/TaskResourceProfile/#getCustomExecutorResources","title":"getCustomExecutorResources","text":"ResourceProfile <pre><code>getCustomExecutorResources(): Map[String, ExecutorResourceRequest]\n</code></pre> <p><code>getCustomExecutorResources</code> is part of the ResourceProfile abstraction.</p> <p><code>getCustomExecutorResources</code>...FIXME</p>"},{"location":"stage-level-scheduling/TaskResourceRequest/","title":"TaskResourceRequest","text":"<p><code>TaskResourceRequest</code> is...FIXME</p>"},{"location":"stage-level-scheduling/TaskResourceRequests/","title":"TaskResourceRequests","text":"<p><code>TaskResourceRequests</code> is a convenience API to work with TaskResourceRequests (and hence the name \ud83d\ude09).</p> <p><code>TaskResourceRequests</code> can be defined as required using ResourceProfileBuilder.</p> <p><code>TaskResourceRequests</code> can be specified using configuration properties (using <code>spark.task</code> prefix).</p> Resource Name Registerer <code>cpus</code> cpus user-defined name resource, addRequest"},{"location":"stage-level-scheduling/TaskResourceRequests/#creating-instance","title":"Creating Instance","text":"<p><code>TaskResourceRequests</code> takes no arguments to be created.</p> <p><code>TaskResourceRequests</code> is created when:</p> <ul> <li><code>ResourceProfile</code> is requested for the default task resource requests</li> </ul>"},{"location":"stage-level-scheduling/TaskResourceRequests/#serializable","title":"Serializable","text":"<p><code>TaskResourceRequests</code> is <code>Serializable</code> (Java).</p>"},{"location":"stage-level-scheduling/TaskResourceRequests/#cpus","title":"cpus","text":"<pre><code>cpus(\n  amount: Int): this.type\n</code></pre> <p><code>cpus</code> registers a TaskResourceRequest with <code>cpus</code> resource name and the given <code>amount</code> (in the _taskResources registry) under the name <code>cpus</code>.</p> <p>Fluent API</p> <p><code>cpus</code> is part of the fluent API of (and hence this strange-looking <code>this.type</code> return type).</p> <p><code>cpus</code> is used when:</p> <ul> <li><code>ResourceProfile</code> is requested for the default task resource requests</li> </ul>"},{"location":"stage-level-scheduling/TaskResourceRequests/#_taskResources","title":"_taskResources","text":"<pre><code>_taskResources: ConcurrentHashMap[String, TaskResourceRequest]\n</code></pre> <p><code>_taskResources</code> is a collection of TaskResourceRequests by their resource name.</p> <p><code>_taskResources</code> is available as requests.</p>"},{"location":"stage-level-scheduling/TaskResourceRequests/#requests","title":"requests","text":"<pre><code>requests: Map[String, TaskResourceRequest]\n</code></pre> <p><code>requests</code> returns the _taskResources (converted to Scala).</p> <p><code>requests</code> is used when:</p> <ul> <li><code>ResourceProfile</code> is requested for the default task resource requests</li> <li><code>ResourceProfileBuilder</code> is requested to require</li> <li><code>TaskResourceRequests</code> is requested for the string representation</li> </ul>"},{"location":"status/","title":"Status","text":"<p>Status system uses AppStatusListener to write the state of a Spark application to AppStatusStore for reporting and monitoring:</p> <ul> <li>web UI</li> <li>REST API</li> <li>Spark History Server</li> <li>Metrics</li> </ul>"},{"location":"status/AppStatusListener/","title":"AppStatusListener","text":"<p><code>AppStatusListener</code> is a SparkListener that writes application state information to a data store.</p>"},{"location":"status/AppStatusListener/#event-handlers","title":"Event Handlers","text":"Event Handler LiveEntities onJobStart <ul><li><code>LiveJob</code><li><code>LiveStage</code><li><code>RDDOperationGraph</code> onStageSubmitted"},{"location":"status/AppStatusListener/#creating-instance","title":"Creating Instance","text":"<p><code>AppStatusListener</code> takes the following to be created:</p> <ul> <li>ElementTrackingStore</li> <li> SparkConf <li>live flag</li> <li> AppStatusSource (default: <code>None</code>) <li> Last Update Time (default: <code>None</code>) <p><code>AppStatusListener</code> is created when:</p> <ul> <li><code>AppStatusStore</code> is requested for a in-memory store for a running Spark application (with the live flag enabled)</li> <li><code>FsHistoryProvider</code> is requested to rebuildAppStore (with the live flag disabled)</li> </ul>"},{"location":"status/AppStatusListener/#elementtrackingstore","title":"ElementTrackingStore <p><code>AppStatusListener</code> is given an ElementTrackingStore when created.</p> <p><code>AppStatusListener</code> registers triggers to clean up state in the store:</p> <ul> <li>cleanupExecutors</li> <li>cleanupJobs</li> <li>cleanupStages</li> </ul> <p><code>ElementTrackingStore</code> is used to write and...FIXME</p>","text":""},{"location":"status/AppStatusListener/#live-flag","title":"live Flag <p><code>AppStatusListener</code> is given a <code>live</code> flag when created.</p> <p><code>live</code> flag indicates whether <code>AppStatusListener</code> is created for the following:</p> <ul> <li><code>true</code> when created for a active (live) Spark application (for AppStatusStore)</li> <li><code>false</code> when created for Spark History Server (for FsHistoryProvider)</li> </ul>","text":""},{"location":"status/AppStatusListener/#updating-elementtrackingstore-for-active-spark-application","title":"Updating ElementTrackingStore for Active Spark Application <pre><code>liveUpdate(\n  entity: LiveEntity,\n  now: Long): Unit\n</code></pre> <p><code>liveUpdate</code> update the ElementTrackingStore when the live flag is enabled.</p>","text":""},{"location":"status/AppStatusListener/#updating-elementtrackingstore","title":"Updating ElementTrackingStore <pre><code>update(\n  entity: LiveEntity,\n  now: Long,\n  last: Boolean = false): Unit\n</code></pre> <p><code>update</code> requests the given LiveEntity to write (with the ElementTrackingStore and <code>checkTriggers</code> flag being the given <code>last</code> flag).</p>","text":""},{"location":"status/AppStatusListener/#getorcreateexecutor","title":"getOrCreateExecutor <pre><code>getOrCreateExecutor(\n  executorId: String,\n  addTime: Long): LiveExecutor\n</code></pre> <p><code>getOrCreateExecutor</code>...FIXME</p> <p><code>getOrCreateExecutor</code>\u00a0is used when:</p> <ul> <li><code>AppStatusListener</code> is requested to onExecutorAdded and onBlockManagerAdded</li> </ul>","text":""},{"location":"status/AppStatusListener/#getorcreatestage","title":"getOrCreateStage <pre><code>getOrCreateStage(\n  info: StageInfo): LiveStage\n</code></pre> <p><code>getOrCreateStage</code>...FIXME</p> <p><code>getOrCreateStage</code>\u00a0is used when:</p> <ul> <li><code>AppStatusListener</code> is requested to onJobStart and onStageSubmitted</li> </ul>","text":""},{"location":"status/AppStatusSource/","title":"AppStatusSource","text":"<p><code>AppStatusSource</code> is...FIXME</p>"},{"location":"status/AppStatusStore/","title":"AppStatusStore","text":"<p><code>AppStatusStore</code> stores the state of a Spark application in a data store (listening to state changes using AppStatusListener).</p>"},{"location":"status/AppStatusStore/#creating-instance","title":"Creating Instance","text":"<p><code>AppStatusStore</code> takes the following to be created:</p> <ul> <li> KVStore <li> AppStatusListener <p><code>AppStatusStore</code> is created\u00a0using createLiveStore utility.</p> <p></p>"},{"location":"status/AppStatusStore/#creating-in-memory-store-for-live-spark-application","title":"Creating In-Memory Store for Live Spark Application <pre><code>createLiveStore(\n  conf: SparkConf,\n  appStatusSource: Option[AppStatusSource] = None): AppStatusStore\n</code></pre> <p><code>createLiveStore</code> creates an ElementTrackingStore (with InMemoryStore and the SparkConf).</p> <p><code>createLiveStore</code> creates an AppStatusListener (with the <code>ElementTrackingStore</code>, live flag on and the <code>AppStatusSource</code>).</p> <p>In the end, creates an AppStatusStore (with the <code>ElementTrackingStore</code> and <code>AppStatusListener</code>).</p> <p><code>createLiveStore</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is created</li> </ul>","text":""},{"location":"status/AppStatusStore/#accessing-appstatusstore","title":"Accessing AppStatusStore <p><code>AppStatusStore</code> is available using SparkContext.</p>","text":""},{"location":"status/AppStatusStore/#sparkstatustracker","title":"SparkStatusTracker <p><code>AppStatusStore</code> is used to create SparkStatusTracker.</p>","text":""},{"location":"status/AppStatusStore/#sparkui","title":"SparkUI <p><code>AppStatusStore</code> is used to create SparkUI.</p>","text":""},{"location":"status/AppStatusStore/#rdds","title":"RDDs <pre><code>rddList(\n  cachedOnly: Boolean = true): Seq[v1.RDDStorageInfo]\n</code></pre> <p><code>rddList</code> requests the KVStore for (a view over) <code>RDDStorageInfo</code>s (cached or not based on the given <code>cachedOnly</code> flag).</p> <p><code>rddList</code>\u00a0is used when:</p> <ul> <li><code>AbstractApplicationResource</code> is requested for the RDDs</li> <li><code>StageTableBase</code> is created (and renders a stage table for AllStagesPage, JobPage and PoolPage)</li> <li><code>StoragePage</code> is requested to render</li> </ul>","text":""},{"location":"status/AppStatusStore/#streaming-blocks","title":"Streaming Blocks <pre><code>streamBlocksList(): Seq[StreamBlockData]\n</code></pre> <p><code>streamBlocksList</code> requests the KVStore for (a view over) <code>StreamBlockData</code>s.</p> <p><code>streamBlocksList</code>\u00a0is used when:</p> <ul> <li><code>StoragePage</code> is requested to render</li> </ul>","text":""},{"location":"status/AppStatusStore/#stages","title":"Stages <pre><code>stageList(\n  statuses: JList[v1.StageStatus]): Seq[v1.StageData]\n</code></pre> <p><code>stageList</code> requests the KVStore for (a view over) <code>StageData</code>s.</p> <p><code>stageList</code>\u00a0is used when:</p> <ul> <li><code>SparkStatusTracker</code> is requested for active stage IDs</li> <li><code>StagesResource</code> is requested for stages</li> <li><code>AllStagesPage</code> is requested to render</li> </ul>","text":""},{"location":"status/AppStatusStore/#jobs","title":"Jobs <pre><code>jobsList(\n  statuses: JList[JobExecutionStatus]): Seq[v1.JobData]\n</code></pre> <p><code>jobsList</code> requests the KVStore for (a view over) <code>JobData</code>s.</p> <p><code>jobsList</code>\u00a0is used when:</p> <ul> <li><code>SparkStatusTracker</code> is requested for getJobIdsForGroup and getActiveJobIds</li> <li><code>AbstractApplicationResource</code> is requested for jobs</li> <li><code>AllJobsPage</code> is requested to render</li> </ul>","text":""},{"location":"status/AppStatusStore/#executors","title":"Executors <pre><code>executorList(\n  activeOnly: Boolean): Seq[v1.ExecutorSummary]\n</code></pre> <p><code>executorList</code> requests the KVStore for (a view over) <code>ExecutorSummary</code>s.</p> <p><code>executorList</code>\u00a0is used when:</p> <ul> <li>FIXME</li> </ul>","text":""},{"location":"status/AppStatusStore/#application-summary","title":"Application Summary <pre><code>appSummary(): AppSummary\n</code></pre> <p><code>appSummary</code> requests the KVStore to read the <code>AppSummary</code>.</p> <p><code>appSummary</code>\u00a0is used when:</p> <ul> <li><code>AllJobsPage</code> is requested to render</li> <li><code>AllStagesPage</code> is requested to render</li> </ul>","text":""},{"location":"status/ElementTrackingStore/","title":"ElementTrackingStore","text":"<p><code>ElementTrackingStore</code> is a KVStore that tracks the number of entities (elements) of specific types in a store and triggers actions once they reach a threshold.</p>"},{"location":"status/ElementTrackingStore/#creating-instance","title":"Creating Instance","text":"<p><code>ElementTrackingStore</code> takes the following to be created:</p> <ul> <li> KVStore <li> SparkConf <p><code>ElementTrackingStore</code> is created\u00a0when:</p> <ul> <li><code>AppStatusStore</code> is requested to createLiveStore</li> <li><code>FsHistoryProvider</code> is requested to rebuildAppStore</li> </ul>"},{"location":"status/ElementTrackingStore/#writing-value-to-store","title":"Writing Value to Store <pre><code>write(\n  value: Any): Unit\n</code></pre> <p><code>write</code>\u00a0is part of the KVStore abstraction.</p> <p><code>write</code> requests the KVStore to write the value</p>","text":""},{"location":"status/ElementTrackingStore/#writing-value-to-store-and-checking-triggers","title":"Writing Value to Store and Checking Triggers <pre><code>write(\n  value: Any,\n  checkTriggers: Boolean): WriteQueueResult\n</code></pre> <p><code>write</code> writes the value.</p> <p><code>write</code>...FIXME</p> <p><code>write</code> is used when:</p> <ul> <li><code>LiveEntity</code> is requested to write</li> <li><code>StreamingQueryStatusListener</code> (Spark Structured Streaming) is requested to <code>onQueryStarted</code> and <code>onQueryTerminated</code></li> </ul>","text":""},{"location":"status/ElementTrackingStore/#creating-view-of-specific-entities","title":"Creating View of Specific Entities <pre><code>view[T](\n  klass: Class[T]): KVStoreView[T]\n</code></pre> <p><code>view</code>\u00a0is part of the KVStore abstraction.</p> <p><code>view</code> requests the KVStore for a view of <code>klass</code> entities.</p>","text":""},{"location":"status/ElementTrackingStore/#registering-trigger","title":"Registering Trigger <pre><code>addTrigger(\n  klass: Class[_],\n  threshold: Long)(\n  action: Long =&gt; Unit): Unit\n</code></pre> <p><code>addTrigger</code>...FIXME</p> <p><code>addTrigger</code> is used when:</p> <ul> <li><code>AppStatusListener</code> is created</li> <li><code>HiveThriftServer2Listener</code> (Spark Thrift Server) is created</li> <li><code>SQLAppStatusListener</code> (Spark SQL) is created</li> <li><code>StreamingQueryStatusListener</code> (Spark Structured Streaming) is created</li> </ul>","text":""},{"location":"status/LiveEntity/","title":"LiveEntity","text":"<p><code>LiveEntity</code> is an abstraction of entities of a running (live) Spark application.</p>"},{"location":"status/LiveEntity/#contract","title":"Contract","text":""},{"location":"status/LiveEntity/#doupdate","title":"doUpdate <pre><code>doUpdate(): Any\n</code></pre> <p>Updated view of this entity's data</p> <p>Used when:</p> <ul> <li><code>LiveEntity</code> is requested to write out to the store</li> </ul>","text":""},{"location":"status/LiveEntity/#implementations","title":"Implementations","text":"<ul> <li>LiveExecutionData (Spark SQL)</li> <li>LiveExecutionData (Spark Thrift Server)</li> <li>LiveExecutor</li> <li>LiveExecutorStageSummary</li> <li>LiveJob</li> <li>LiveRDD</li> <li>LiveResourceProfile</li> <li>LiveSessionData</li> <li>LiveStage</li> <li>LiveTask</li> <li>SchedulerPool</li> </ul>"},{"location":"status/LiveEntity/#writing-out-to-store","title":"Writing Out to Store <pre><code>write(\n  store: ElementTrackingStore,\n  now: Long,\n  checkTriggers: Boolean = false): Unit\n</code></pre> <p><code>write</code>...FIXME</p> <p><code>write</code>\u00a0is used when:</p> <ul> <li><code>AppStatusListener</code> is requested to update</li> <li><code>HiveThriftServer2Listener</code> (Spark Thrift Server) is requested to <code>updateStoreWithTriggerEnabled</code> and <code>updateLiveStore</code></li> <li><code>SQLAppStatusListener</code> (Spark SQL) is requested to <code>update</code></li> </ul>","text":""},{"location":"storage/","title":"Storage System","text":"<p>Storage System is a core component of Apache Spark that uses BlockManager to manage blocks in memory and on disk (based on StorageLevel).</p>"},{"location":"storage/BlockData/","title":"BlockData","text":"<p>= BlockData</p> <p>BlockData is...FIXME</p>"},{"location":"storage/BlockDataManager/","title":"BlockDataManager","text":"<p><code>BlockDataManager</code> is an abstraction of block data managers that manage storage for blocks of data (aka block storage management API).</p> <p><code>BlockDataManager</code> uses BlockId to uniquely identify blocks of data and <code>ManagedBuffer</code> to represent them.</p> <p><code>BlockDataManager</code> is used to initialize a BlockTransferService.</p> <p><code>BlockDataManager</code> is used to create a NettyBlockRpcServer.</p>"},{"location":"storage/BlockDataManager/#contract","title":"Contract","text":""},{"location":"storage/BlockDataManager/#diagnoseshuffleblockcorruption","title":"diagnoseShuffleBlockCorruption <pre><code>diagnoseShuffleBlockCorruption(\n  blockId: BlockId,\n  checksumByReader: Long,\n  algorithm: String): Cause\n</code></pre>","text":""},{"location":"storage/BlockDataManager/#gethostlocalshuffledata","title":"getHostLocalShuffleData <pre><code>getHostLocalShuffleData(\n  blockId: BlockId,\n  dirs: Array[String]): ManagedBuffer\n</code></pre> <p>Used when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to fetchHostLocalBlock</li> </ul>","text":""},{"location":"storage/BlockDataManager/#getlocalblockdata","title":"getLocalBlockData <pre><code>getLocalBlockData(\n  blockId: BlockId): ManagedBuffer\n</code></pre> <p>Used when:</p> <ul> <li><code>NettyBlockRpcServer</code> is requested to receive a request (<code>OpenBlocks</code> and <code>FetchShuffleBlocks</code>)</li> </ul>","text":""},{"location":"storage/BlockDataManager/#getlocaldiskdirs","title":"getLocalDiskDirs <pre><code>getLocalDiskDirs: Array[String]\n</code></pre> <p>Used when:</p> <ul> <li><code>NettyBlockRpcServer</code> is requested to handle a GetLocalDirsForExecutors request</li> </ul>","text":""},{"location":"storage/BlockDataManager/#putblockdata","title":"putBlockData <pre><code>putBlockData(\n  blockId: BlockId,\n  data: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Boolean\n</code></pre> <p>Stores (puts) a block data (as a <code>ManagedBuffer</code>) for the given BlockId. Returns <code>true</code> when completed successfully or <code>false</code> when failed.</p> <p>Used when:</p> <ul> <li><code>NettyBlockRpcServer</code> is requested to receive a UploadBlock request</li> </ul>","text":""},{"location":"storage/BlockDataManager/#putblockdataasstream","title":"putBlockDataAsStream <pre><code>putBlockDataAsStream(\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[_]): StreamCallbackWithID\n</code></pre> <p>Used when:</p> <ul> <li><code>NettyBlockRpcServer</code> is requested to receiveStream</li> </ul>","text":""},{"location":"storage/BlockDataManager/#releaselock","title":"releaseLock <pre><code>releaseLock(\n  blockId: BlockId,\n  taskContext: Option[TaskContext]): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>TorrentBroadcast</code> is requested to releaseBlockManagerLock</li> <li><code>BlockManager</code> is requested to handleLocalReadFailure, getLocalValues, getOrElseUpdate, doPut, releaseLockAndDispose</li> </ul>","text":""},{"location":"storage/BlockDataManager/#implementations","title":"Implementations","text":"<ul> <li>BlockManager</li> </ul>"},{"location":"storage/BlockEvictionHandler/","title":"BlockEvictionHandler","text":"<p><code>BlockEvictionHandler</code> is an abstraction of block eviction handlers that can drop blocks from memory.</p>"},{"location":"storage/BlockEvictionHandler/#contract","title":"Contract","text":""},{"location":"storage/BlockEvictionHandler/#dropping-block-from-memory","title":"Dropping Block from Memory <pre><code>dropFromMemory[T: ClassTag](\n  blockId: BlockId,\n  data: () =&gt; Either[Array[T], ChunkedByteBuffer]): StorageLevel\n</code></pre> <p>Used when:</p> <ul> <li><code>MemoryStore</code> is requested to evict blocks</li> </ul>","text":""},{"location":"storage/BlockEvictionHandler/#implementations","title":"Implementations","text":"<ul> <li>BlockManager</li> </ul>"},{"location":"storage/BlockId/","title":"BlockId","text":"<p><code>BlockId</code> is an abstraction of data block identifiers based on an unique name.</p>"},{"location":"storage/BlockId/#contract","title":"Contract","text":""},{"location":"storage/BlockId/#name","title":"Name <pre><code>name: String\n</code></pre> <p>A globally unique identifier of this <code>Block</code></p> <p>Used when:</p> <ul> <li><code>BlockManager</code> is requested to putBlockDataAsStream and readDiskBlockFromSameHostExecutor</li> <li><code>UpdateBlockInfo</code> is requested to writeExternal</li> <li><code>DiskBlockManager</code> is requested to getFile and containsBlock</li> <li><code>DiskStore</code> is requested to getBytes, remove, moveFileToBlock, contains</li> </ul>","text":""},{"location":"storage/BlockId/#implementations","title":"Implementations","text":"Sealed Abstract Class <p><code>BlockId</code> is a Scala sealed abstract class which means that all of the implementations are in the same compilation unit (a single file).</p>"},{"location":"storage/BlockId/#broadcastblockid","title":"BroadcastBlockId <p><code>BlockId</code> for broadcast variable blocks:</p> <ul> <li><code>broadcastId</code> identifier</li> <li>Optional <code>field</code> name (default: <code>empty</code>)</li> </ul> <p>Uses broadcast_ prefix for the name</p> <p>Used when:</p> <ul> <li><code>TorrentBroadcast</code> is created, requested to store a broadcast and the blocks in a local BlockManager, and read blocks</li> <li><code>BlockManager</code> is requested to remove all the blocks of a broadcast variable</li> <li><code>SerializerManager</code> is requested to shouldCompress</li> <li><code>AppStatusListener</code> is requested to onBlockUpdated</li> </ul>","text":""},{"location":"storage/BlockId/#rddblockid","title":"RDDBlockId <p><code>BlockId</code> for RDD partitions:</p> <ul> <li><code>rddId</code> identifier</li> <li><code>splitIndex</code> identifier</li> </ul> <p>Uses rdd_ prefix for the name</p> <p>Used when:</p> <ul> <li><code>StorageStatus</code> is requested to register the status of a data block, get the status of a data block, updateStorageInfo</li> <li><code>LocalRDDCheckpointData</code> is requested to doCheckpoint</li> <li><code>RDD</code> is requested to getOrCompute</li> <li><code>DAGScheduler</code> is requested for the BlockManagers (executors) for cached RDD partitions</li> <li><code>BlockManagerMasterEndpoint</code> is requested to removeRdd</li> <li><code>AppStatusListener</code> is requested to updateRDDBlock (when onBlockUpdated for an <code>RDDBlockId</code>)</li> </ul> <p>Compressed when spark.rdd.compress configuration property is enabled</p>","text":""},{"location":"storage/BlockId/#shuffleblockbatchid","title":"ShuffleBlockBatchId","text":""},{"location":"storage/BlockId/#shuffleblockid","title":"ShuffleBlockId <p><code>BlockId</code> for shuffle blocks:</p> <ul> <li><code>shuffleId</code> identifier</li> <li><code>mapId</code> identifier</li> <li><code>reduceId</code> identifier</li> </ul> <p>Uses shuffle_ prefix for the name</p> <p>Used when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to throwFetchFailedException</li> <li><code>MapOutputTracker</code> utility is requested to convertMapStatuses</li> <li><code>NettyBlockRpcServer</code> is requested to handle a FetchShuffleBlocks message</li> <li><code>ExternalSorter</code> is requested to writePartitionedMapOutput</li> <li><code>ShuffleBlockFetcherIterator</code> is requested to mergeContinuousShuffleBlockIdsIfNeeded</li> <li><code>IndexShuffleBlockResolver</code> is requested to getBlockData</li> </ul> <p>Compressed when spark.shuffle.compress configuration property is enabled</p>","text":""},{"location":"storage/BlockId/#shuffledatablockid","title":"ShuffleDataBlockId","text":""},{"location":"storage/BlockId/#shuffleindexblockid","title":"ShuffleIndexBlockId","text":""},{"location":"storage/BlockId/#streamblockid","title":"StreamBlockId <p><code>BlockId</code> for ...FIXME:</p> <ul> <li><code>streamId</code></li> <li><code>uniqueId</code></li> </ul> <p>Uses the following name:</p> <pre><code>input-[streamId]-[uniqueId]\n</code></pre> <p>Used in Spark Streaming</p>","text":""},{"location":"storage/BlockId/#taskresultblockid","title":"TaskResultBlockId","text":""},{"location":"storage/BlockId/#templocalblockid","title":"TempLocalBlockId","text":""},{"location":"storage/BlockId/#tempshuffleblockid","title":"TempShuffleBlockId","text":""},{"location":"storage/BlockId/#testblockid","title":"TestBlockId","text":""},{"location":"storage/BlockId/#creating-blockid-by-name","title":"Creating BlockId by Name <pre><code>apply(\n  name: String): BlockId\n</code></pre> <p><code>apply</code> creates one of the available BlockIds by the given name (that uses a prefix to differentiate between different <code>BlockId</code>s).</p> <p><code>apply</code> is used when:</p> <ul> <li><code>NettyBlockRpcServer</code> is requested to handle OpenBlocks, UploadBlock messages and receiveStream</li> <li><code>UpdateBlockInfo</code> is requested to deserialize (<code>readExternal</code>)</li> <li><code>DiskBlockManager</code> is requested for all the blocks (from files stored on disk)</li> <li><code>ShuffleBlockFetcherIterator</code> is requested to sendRequest</li> <li><code>JsonProtocol</code> utility is used to accumValueFromJson, taskMetricsFromJson and blockUpdatedInfoFromJson</li> </ul>","text":""},{"location":"storage/BlockInfo/","title":"BlockInfo","text":"<p><code>BlockInfo</code> is a metadata of data blocks (stored in MemoryStore or DiskStore).</p>"},{"location":"storage/BlockInfo/#creating-instance","title":"Creating Instance","text":"<p><code>BlockInfo</code> takes the following to be created:</p> <ul> <li> StorageLevel <li> <code>ClassTag</code> (Scala) <li> <code>tellMaster</code> flag <p><code>BlockInfo</code> is created\u00a0when:</p> <ul> <li><code>BlockManager</code> is requested to doPut</li> </ul>"},{"location":"storage/BlockInfo/#block-size","title":"Block Size <p><code>BlockInfo</code> knows the size of the block (in bytes).</p> <p>The size is <code>0</code> by default and changes when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save</li> <li><code>BlockManager</code> is requested to doPutIterator</li> </ul>","text":""},{"location":"storage/BlockInfo/#reader-count","title":"Reader Count <p><code>readerCount</code> is the number of times that this block has been locked for reading</p> <p><code>readerCount</code> is <code>0</code> by default.</p> <p><code>readerCount</code> changes back to <code>0</code> when:</p> <ul> <li><code>BlockInfoManager</code> is requested to remove a block and clear</li> </ul> <p><code>readerCount</code> is incremented when a read lock is acquired and decreases when the following happens:</p> <ul> <li><code>BlockInfoManager</code> is requested to release a lock and releaseAllLocksForTask</li> </ul>","text":""},{"location":"storage/BlockInfo/#writer-task","title":"Writer Task <p><code>writerTask</code> attribute is the task ID that owns the write lock for the block or the following:</p> <ul> <li> <code>-1</code> for no writers and hence no write lock in use <li> <code>-1024</code> for non-task threads (by a driver thread or by unit test code)  <p><code>writerTask</code> is assigned a task ID when:</p> <ul> <li><code>BlockInfoManager</code> is requested to lockForWriting, unlock, releaseAllLocksForTask, removeBlock, clear</li> </ul>","text":""},{"location":"storage/BlockInfoManager/","title":"BlockInfoManager","text":"<p><code>BlockInfoManager</code> is used by BlockManager (and MemoryStore) to manage metadata of memory blocks and control concurrent access by locks for reading and writing.</p> <p><code>BlockInfoManager</code> is used to create a MemoryStore and a <code>BlockManagerManagedBuffer</code>.</p>"},{"location":"storage/BlockInfoManager/#creating-instance","title":"Creating Instance","text":"<p><code>BlockInfoManager</code> takes no arguments to be created.</p> <p><code>BlockInfoManager</code> is created\u00a0for BlockManager</p> <p></p>"},{"location":"storage/BlockInfoManager/#block-metadata","title":"Block Metadata <pre><code>infos: HashMap[BlockId, BlockInfo]\n</code></pre> <p><code>BlockInfoManager</code> uses a registry of block metadatas per block.</p>","text":""},{"location":"storage/BlockInfoManager/#locks","title":"Locks <p>Locks are the mechanism to control concurrent access to data and prevent destructive interaction between operations that use the same resource.</p> <p><code>BlockInfoManager</code> uses read and write locks by task attempts.</p>","text":""},{"location":"storage/BlockInfoManager/#read-locks","title":"Read Locks <pre><code>readLocksByTask: HashMap[TaskAttemptId, ConcurrentHashMultiset[BlockId]]\n</code></pre> <p><code>BlockInfoManager</code> uses <code>readLocksByTask</code> registry to track tasks (by <code>TaskAttemptId</code>) and the blocks they locked for reading (as BlockIds).</p> <p>A new entry is added when <code>BlockInfoManager</code> is requested to register a task (attempt).</p> <p>A new <code>BlockId</code> is added to an existing task attempt in lockForReading.</p>","text":""},{"location":"storage/BlockInfoManager/#write-locks","title":"Write Locks <p>Tracks tasks (by <code>TaskAttemptId</code>) and the blocks they locked for writing (as BlockId.md[]).</p>","text":""},{"location":"storage/BlockInfoManager/#registering-task-execution-attempt","title":"Registering Task (Execution Attempt) <pre><code>registerTask(\n  taskAttemptId: Long): Unit\n</code></pre> <p><code>registerTask</code> registers a new \"empty\" entry for the given task (by the task attempt ID) to the readLocksByTask internal registry.</p> <p><code>registerTask</code> is used when:</p> <ul> <li><code>BlockInfoManager</code> is created</li> <li><code>BlockManager</code> is requested to registerTask</li> </ul>","text":""},{"location":"storage/BlockInfoManager/#downgrading-exclusive-write-lock-to-shared-read-lock","title":"Downgrading Exclusive Write Lock to Shared Read Lock <pre><code>downgradeLock(\n  blockId: BlockId): Unit\n</code></pre> <p><code>downgradeLock</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] downgrading write lock for [blockId]\n</code></pre> <p><code>downgradeLock</code>...FIXME</p> <p><code>downgradeLock</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to doPut and downgradeLock</li> </ul>","text":""},{"location":"storage/BlockInfoManager/#obtaining-read-lock-for-block","title":"Obtaining Read Lock for Block <pre><code>lockForReading(\n  blockId: BlockId,\n  blocking: Boolean = true): Option[BlockInfo]\n</code></pre> <p><code>lockForReading</code> locks a given memory block for reading when the block was registered earlier and no writer tasks use it.</p> <p>When executed, <code>lockForReading</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] trying to acquire read lock for [blockId]\n</code></pre> <p><code>lockForReading</code> looks up the metadata of the <code>blockId</code> block (in the infos registry).</p> <p>If no metadata could be found, <code>lockForReading</code> returns <code>None</code> which means that the block does not exist or was removed (and anybody could acquire a write lock).</p> <p>Otherwise, when the metadata was found (i.e. registered) <code>lockForReading</code> checks so-called writerTask. Only when the block has no writer tasks, a read lock can be acquired. If so, the <code>readerCount</code> of the block metadata is incremented and the block is recorded (in the internal readLocksByTask registry). <code>lockForReading</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] acquired read lock for [blockId]\n</code></pre> <p>The <code>BlockInfo</code> for the <code>blockId</code> block is returned.</p>  <p>Note</p> <p><code>-1024</code> is a special <code>taskAttemptId</code> (NON_TASK_WRITER) used to mark a non-task thread, e.g. by a driver thread or by unit test code.</p>  <p>For blocks with <code>writerTask</code> other than NO_WRITER, when <code>blocking</code> is enabled, <code>lockForReading</code> waits (until another thread invokes the <code>Object.notify</code> method or the <code>Object.notifyAll</code> methods for this object).</p> <p>With <code>blocking</code> enabled, it will repeat the waiting-for-read-lock sequence until either <code>None</code> or the lock is obtained.</p> <p>When <code>blocking</code> is disabled and the lock could not be obtained, <code>None</code> is returned immediately.</p>  <p>Note</p> <p><code>lockForReading</code> is a <code>synchronized</code> method, i.e. no two objects can use this and other instance methods.</p>  <p><code>lockForReading</code> is used when:</p> <ul> <li><code>BlockInfoManager</code> is requested to downgradeLock and lockNewBlockForWriting</li> <li><code>BlockManager</code> is requested to getLocalValues, getLocalBytes and replicateBlock</li> <li><code>BlockManagerManagedBuffer</code> is requested to <code>retain</code></li> </ul>","text":""},{"location":"storage/BlockInfoManager/#obtaining-write-lock-for-block","title":"Obtaining Write Lock for Block <pre><code>lockForWriting(\n  blockId: BlockId,\n  blocking: Boolean = true): Option[BlockInfo]\n</code></pre> <p><code>lockForWriting</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] trying to acquire write lock for [blockId]\n</code></pre> <p><code>lockForWriting</code> finds the <code>blockId</code> (in the infos registry). When no BlockInfo could be found, <code>None</code> is returned. Otherwise, <code>blockId</code> block is checked for <code>writerTask</code> to be <code>BlockInfo.NO_WRITER</code> with no readers (i.e. <code>readerCount</code> is <code>0</code>) and only then the lock is returned.</p> <p>When the write lock can be returned, <code>BlockInfo.writerTask</code> is set to <code>currentTaskAttemptId</code> and a new binding is added to the internal writeLocksByTask registry. <code>lockForWriting</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] acquired write lock for [blockId]\n</code></pre> <p>If, for some reason, BlockInfo.md#writerTask[<code>blockId</code> has a writer] or the number of readers is positive (i.e. <code>BlockInfo.readerCount</code> is greater than <code>0</code>), the method will wait (based on the input <code>blocking</code> flag) and attempt the write lock acquisition process until it finishes with a write lock.</p> <p>NOTE: (deadlock possible) The method is <code>synchronized</code> and can block, i.e. <code>wait</code> that causes the current thread to wait until another thread invokes <code>Object.notify</code> or <code>Object.notifyAll</code> methods for this object.</p> <p><code>lockForWriting</code> returns <code>None</code> for no <code>blockId</code> in the internal infos registry or when <code>blocking</code> flag is disabled and the write lock could not be acquired.</p> <p><code>lockForWriting</code> is used when:</p> <ul> <li><code>BlockInfoManager</code> is requested to lockNewBlockForWriting</li> <li><code>BlockManager</code> is requested to removeBlock</li> <li><code>MemoryStore</code> is requested to evictBlocksToFreeSpace</li> </ul>","text":""},{"location":"storage/BlockInfoManager/#obtaining-write-lock-for-new-block","title":"Obtaining Write Lock for New Block <pre><code>lockNewBlockForWriting(\n  blockId: BlockId,\n  newBlockInfo: BlockInfo): Boolean\n</code></pre> <p><code>lockNewBlockForWriting</code> obtains a write lock for <code>blockId</code> but only when the method could register the block.</p>  <p>Note</p> <p><code>lockNewBlockForWriting</code> is similar to lockForWriting method but for brand new blocks.</p>  <p>When executed, <code>lockNewBlockForWriting</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] trying to put [blockId]\n</code></pre> <p>If some other thread has already created the block, <code>lockNewBlockForWriting</code> finishes returning <code>false</code>. Otherwise, when the block does not exist, <code>newBlockInfo</code> is recorded in the infos internal registry and the block is locked for this client for writing. <code>lockNewBlockForWriting</code> then returns <code>true</code>.</p>  <p>Note</p> <p><code>lockNewBlockForWriting</code> executes itself in <code>synchronized</code> block so once the <code>BlockInfoManager</code> is locked the other internal registries should be available for the current thread only.</p>  <p><code>lockNewBlockForWriting</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to doPut</li> </ul>","text":""},{"location":"storage/BlockInfoManager/#releasing-lock-on-block","title":"Releasing Lock on Block <pre><code>unlock(\n  blockId: BlockId,\n  taskAttemptId: Option[TaskAttemptId] = None): Unit\n</code></pre> <p><code>unlock</code> prints out the following TRACE message to the logs:</p> <pre><code>Task [currentTaskAttemptId] releasing lock for [blockId]\n</code></pre> <p><code>unlock</code> gets the metadata for <code>blockId</code> (and throws an <code>IllegalStateException</code> if the block was not found).</p> <p>If the writer task for the block is not NO_WRITER, it becomes so and the <code>blockId</code> block is removed from the internal writeLocksByTask registry for the current task attempt.</p> <p>Otherwise, if the writer task is indeed <code>NO_WRITER</code>, the block is assumed locked for reading. The <code>readerCount</code> counter is decremented for the <code>blockId</code> block and the read lock removed from the internal readLocksByTask registry for the task attempt.</p> <p>In the end, <code>unlock</code> wakes up all the threads waiting for the <code>BlockInfoManager</code>.</p> <p><code>unlock</code> is used when:</p> <ul> <li><code>BlockInfoManager</code> is requested to downgradeLock</li> <li><code>BlockManager</code> is requested to releaseLock and doPut</li> <li><code>BlockManagerManagedBuffer</code> is requested to <code>release</code></li> <li><code>MemoryStore</code> is requested to evictBlocksToFreeSpace</li> </ul>","text":""},{"location":"storage/BlockInfoManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.BlockInfoManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.BlockInfoManager=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/BlockManager/","title":"BlockManager","text":"<p><code>BlockManager</code> manages the storage for blocks (chunks of data) that can be stored in memory and on disk.</p> <p></p> <p><code>BlockManager</code> runs as part of the driver and executor processes.</p> <p><code>BlockManager</code> provides interface for uploading and fetching blocks both locally and remotely using various stores (i.e. memory, disk, and off-heap).</p> <p>Cached blocks are blocks with non-zero sum of memory and disk sizes.</p> <p>Tip</p> <p>Use Web UI (esp. Storage and Executors tabs) to monitor the memory used.</p> <p>Tip</p> <p>Use spark-submit's command-line options (i.e. --driver-memory for the driver and --executor-memory for executors) or their equivalents as Spark properties (i.e. spark.executor.memory and spark.driver.memory) to control the memory for storage memory.</p> <p>When External Shuffle Service is enabled, BlockManager uses ExternalShuffleClient to read shuffle files (of other executors).</p>"},{"location":"storage/BlockManager/#creating-instance","title":"Creating Instance","text":"<p><code>BlockManager</code> takes the following to be created:</p> <ul> <li>Executor ID</li> <li>RpcEnv</li> <li>BlockManagerMaster</li> <li> SerializerManager <li> SparkConf <li>MemoryManager</li> <li>MapOutputTracker</li> <li>ShuffleManager</li> <li>BlockTransferService</li> <li> <code>SecurityManager</code> <li> Optional ExternalBlockStoreClient <p>When created, <code>BlockManager</code> sets externalShuffleServiceEnabled internal flag based on spark.shuffle.service.enabled configuration property.</p> <p><code>BlockManager</code> then creates an instance of DiskBlockManager (requesting <code>deleteFilesOnStop</code> when an external shuffle service is not in use).</p> <p><code>BlockManager</code> creates block-manager-future daemon cached thread pool with 128 threads maximum (as <code>futureExecutionContext</code>).</p> <p><code>BlockManager</code> calculates the maximum memory to use (as <code>maxMemory</code>) by requesting the maximum on-heap and off-heap storage memory from the assigned <code>MemoryManager</code>.</p> <p><code>BlockManager</code> calculates the port used by the external shuffle service (as <code>externalShuffleServicePort</code>).</p> <p><code>BlockManager</code> creates a client to read other executors' shuffle files (as <code>shuffleClient</code>). If the external shuffle service is used...FIXME</p> <p><code>BlockManager</code> sets the maximum number of failures before this block manager refreshes the block locations from the driver (as <code>maxFailuresBeforeLocationRefresh</code>).</p> <p><code>BlockManager</code> registers a BlockManagerSlaveEndpoint with the input RpcEnv, itself, and MapOutputTracker (as <code>slaveEndpoint</code>).</p> <p><code>BlockManager</code> is created when <code>SparkEnv</code> is created (for the driver and executors) when a Spark application starts.</p> <p></p>"},{"location":"storage/BlockManager/#memorymanager","title":"MemoryManager <p><code>BlockManager</code> is given a MemoryManager when created.</p> <p><code>BlockManager</code> uses the <code>MemoryManager</code> for the following:</p> <ul> <li> <p>Create a MemoryStore (that is then assigned to MemoryManager as a \"circular dependency\")</p> </li> <li> <p>Initialize maxOnHeapMemory and maxOffHeapMemory (for reporting)</p> </li> </ul>","text":""},{"location":"storage/BlockManager/#diskblockmanager","title":"DiskBlockManager <p><code>BlockManager</code> creates a DiskBlockManager when created.</p> <p></p> <p><code>BlockManager</code> uses the <code>BlockManager</code> for the following:</p> <ul> <li>Creating a DiskStore</li> <li>Registering an executor with a local external shuffle service (when initialized on an executor with externalShuffleServiceEnabled)</li> </ul> <p>The <code>DiskBlockManager</code> is available as <code>diskBlockManager</code> reference to other Spark systems.</p> <pre><code>import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.diskBlockManager\n</code></pre>","text":""},{"location":"storage/BlockManager/#migratableresolver","title":"MigratableResolver <pre><code>migratableResolver: MigratableResolver\n</code></pre> <p><code>BlockManager</code> creates a reference to a MigratableResolver by requesting the ShuffleManager for the ShuffleBlockResolver (that is assumed a <code>MigratableResolver</code>).</p>  Lazy Value <p><code>migratableResolver</code> is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.</p>   <p>private[storage]</p> <p><code>migratableResolver</code> is a <code>private[storage]</code> so it is available to others in the <code>org.apache.spark.storage</code> package.</p>  <p><code>migratableResolver</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to putBlockDataAsStream</li> <li><code>ShuffleMigrationRunnable</code> is requested to run</li> <li><code>BlockManagerDecommissioner</code> is requested to refreshOffloadingShuffleBlocks</li> <li><code>FallbackStorage</code> is requested to copy</li> </ul>","text":""},{"location":"storage/BlockManager/#local-directories-for-block-storage","title":"Local Directories for Block Storage <pre><code>getLocalDiskDirs: Array[String]\n</code></pre> <p><code>getLocalDiskDirs</code>\u00a0requests the DiskBlockManager for the local directories for block storage.</p> <p><code>getLocalDiskDirs</code>\u00a0is part of the BlockDataManager abstraction.</p> <p><code>getLocalDiskDirs</code>\u00a0is also used by <code>BlockManager</code> when requested for the following:</p> <ul> <li>Register with a local external shuffle service</li> <li>Initialize</li> <li>Re-register</li> </ul>","text":""},{"location":"storage/BlockManager/#initializing-blockmanager","title":"Initializing BlockManager <pre><code>initialize(\n  appId: String): Unit\n</code></pre> <p><code>initialize</code> requests the BlockTransferService to initialize.</p> <p><code>initialize</code> requests the ExternalBlockStoreClient to initialize (if given).</p> <p><code>initialize</code> determines the BlockReplicationPolicy based on spark.storage.replication.policy configuration property and prints out the following INFO message to the logs:</p> <pre><code>Using [priorityClass] for block replication policy\n</code></pre> <p><code>initialize</code> creates a BlockManagerId and requests the BlockManagerMaster to registerBlockManager (with the <code>BlockManagerId</code>, the local directories of the DiskBlockManager, the maxOnHeapMemory, the maxOffHeapMemory and the slaveEndpoint).</p> <p><code>initialize</code> sets the internal BlockManagerId to be the response from the BlockManagerMaster (if available) or the <code>BlockManagerId</code> just created.</p> <p><code>initialize</code> initializes the External Shuffle Server's Address when enabled and prints out the following INFO message to the logs (with the externalShuffleServicePort):</p> <pre><code>external shuffle service port = [externalShuffleServicePort]\n</code></pre> <p>(only for executors and External Shuffle Service enabled) <code>initialize</code> registers with the External Shuffle Server.</p> <p><code>initialize</code> determines the hostLocalDirManager. With spark.shuffle.readHostLocalDisk configuration property enabled and spark.shuffle.useOldFetchProtocol disabled, <code>initialize</code> uses the ExternalBlockStoreClient to create a <code>HostLocalDirManager</code> (with spark.storage.localDiskByExecutors.cacheSize configuration property).</p> <p>In the end, <code>initialize</code> prints out the following INFO message to the logs (with the blockManagerId):</p> <pre><code>Initialized BlockManager: [blockManagerId]\n</code></pre> <p><code>initialize</code> is used when:</p> <ul> <li><code>SparkContext</code> is created (on the driver)</li> <li><code>Executor</code> is created (with <code>isLocal</code> flag disabled)</li> </ul>","text":""},{"location":"storage/BlockManager/#registering-executors-blockmanager-with-external-shuffle-server","title":"Registering Executor's BlockManager with External Shuffle Server <pre><code>registerWithExternalShuffleServer(): Unit\n</code></pre> <p><code>registerWithExternalShuffleServer</code> registers the <code>BlockManager</code> (for an executor) with External Shuffle Service.</p> <p><code>registerWithExternalShuffleServer</code> prints out the following INFO message to the logs:</p> <pre><code>Registering executor with local external shuffle service.\n</code></pre> <p><code>registerWithExternalShuffleServer</code> creates an ExecutorShuffleInfo (with the localDirs and subDirsPerLocalDir of the DiskBlockManager, and the class name of the ShuffleManager).</p> <p><code>registerWithExternalShuffleServer</code> uses spark.shuffle.registration.maxAttempts configuration property and <code>5</code> sleep time when requesting the ExternalBlockStoreClient to registerWithShuffleServer (using the BlockManagerId and the <code>ExecutorShuffleInfo</code>).</p> <p>In case of any exception that happen below the maximum number of attempts, <code>registerWithExternalShuffleServer</code> prints out the following ERROR message to the logs and sleeps 5 seconds:</p> <pre><code>Failed to connect to external shuffle server, will retry [attempts] more times after waiting 5 seconds...\n</code></pre>","text":""},{"location":"storage/BlockManager/#blockmanagerid","title":"BlockManagerId <p><code>BlockManager</code> uses a BlockManagerId for...FIXME</p>","text":""},{"location":"storage/BlockManager/#hostlocaldirmanager","title":"HostLocalDirManager <p><code>BlockManager</code> can use a <code>HostLocalDirManager</code>.</p> <p>Default: (undefined)</p>","text":""},{"location":"storage/BlockManager/#blockreplicationpolicy","title":"BlockReplicationPolicy <p><code>BlockManager</code> uses a BlockReplicationPolicy for...FIXME</p>","text":""},{"location":"storage/BlockManager/#external-shuffle-services-port","title":"External Shuffle Service's Port <p><code>BlockManager</code> determines the port of an external shuffle service when created.</p> <p>The port is used to create the shuffleServerId and a HostLocalDirManager.</p> <p>The port is also used for preferExecutors.</p>","text":""},{"location":"storage/BlockManager/#sparkdiskstoresubdirectories-configuration-property","title":"spark.diskStore.subDirectories Configuration Property <p><code>BlockManager</code> uses spark.diskStore.subDirectories configuration property to initialize a <code>subDirsPerLocalDir</code> local value.</p> <p><code>subDirsPerLocalDir</code> is used when:</p> <ul> <li><code>IndexShuffleBlockResolver</code> is requested to getDataFile and getIndexFile</li> <li><code>BlockManager</code> is requested to readDiskBlockFromSameHostExecutor</li> </ul>","text":""},{"location":"storage/BlockManager/#fetching-block-or-computing-and-storing-it","title":"Fetching Block or Computing (and Storing) it <pre><code>getOrElseUpdate[T](\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[T],\n  makeIterator: () =&gt; Iterator[T]): Either[BlockResult, Iterator[T]]\n</code></pre>  <p>Map.getOrElseUpdate</p> <p>I think it is fair to say that <code>getOrElseUpdate</code> is like getOrElseUpdate of scala.collection.mutable.Map in Scala.</p> <pre><code>getOrElseUpdate(key: K, op: \u21d2 V): V\n</code></pre> <p>Quoting the official scaladoc:</p>  <p>If given key <code>K</code> is already in this map, <code>getOrElseUpdate</code> returns the associated value <code>V</code>.</p> <p>Otherwise, <code>getOrElseUpdate</code> computes a value <code>V</code> from given expression <code>op</code>, stores with the key <code>K</code> in the map and returns that value.</p>  <p>Since <code>BlockManager</code> is a key-value store of blocks of data identified by a block ID that seems to fit so well.</p>  <p><code>getOrElseUpdate</code> first attempts to get the block by the <code>BlockId</code> (from the local block manager first and, if unavailable, requesting remote peers).</p> <p><code>getOrElseUpdate</code> gives the <code>BlockResult</code> of the block if found.</p> <p>If however the block was not found (in any block manager in a Spark cluster), <code>getOrElseUpdate</code> doPutIterator (for the input <code>BlockId</code>, the <code>makeIterator</code> function and the <code>StorageLevel</code>).</p> <p><code>getOrElseUpdate</code> branches off per the result:</p> <ul> <li>For <code>None</code>, <code>getOrElseUpdate</code> getLocalValues for the <code>BlockId</code> and eventually returns the <code>BlockResult</code> (unless terminated by a <code>SparkException</code> due to some internal error)</li> <li>For <code>Some(iter)</code>, <code>getOrElseUpdate</code> returns an iterator of <code>T</code> values</li> </ul> <p><code>getOrElseUpdate</code> is used when:</p> <ul> <li><code>RDD</code> is requested to get or compute an RDD partition (for an <code>RDDBlockId</code> with the RDD's id and partition index).</li> </ul>","text":""},{"location":"storage/BlockManager/#fetching-block","title":"Fetching Block <pre><code>get[T: ClassTag](\n  blockId: BlockId): Option[BlockResult]\n</code></pre> <p><code>get</code> attempts to fetch the block (BlockId) from a local block manager first before requesting it from remote block managers. <code>get</code> returns a BlockResult or <code>None</code> (to denote \"a block is not available\").</p>  <p>Internally, <code>get</code> tries to fetch the block from the local BlockManager. If found, <code>get</code> prints out the following INFO message to the logs and returns a <code>BlockResult</code>.</p> <pre><code>Found block [blockId] locally\n</code></pre> <p>If however the block was not found locally, <code>get</code> tries to fetch the block from remote BlockManagers. If fetched,  <code>get</code> prints out the following INFO message to the logs and returns a <code>BlockResult</code>.</p> <pre><code>Found block [blockId] remotely\n</code></pre>","text":""},{"location":"storage/BlockManager/#getremotevalues","title":"getRemoteValues <pre><code>getRemoteValues[T: ClassTag](\n  blockId: BlockId): Option[BlockResult]\n</code></pre> <p><code>getRemoteValues</code> getRemoteBlock with the <code>bufferTransformer</code> function that takes a <code>ManagedBuffer</code> and does the following:</p> <ul> <li>Requests the SerializerManager to deserialize values from an input stream from the <code>ManagedBuffer</code></li> <li>Creates a <code>BlockResult</code> with the values (and their total size, and <code>Network</code> read method)</li> </ul>","text":""},{"location":"storage/BlockManager/#fetching-block-bytes-from-remote-block-managers","title":"Fetching Block Bytes From Remote Block Managers <pre><code>getRemoteBytes(\n  blockId: BlockId): Option[ChunkedByteBuffer]\n</code></pre> <p><code>getRemoteBytes</code> getRemoteBlock with the <code>bufferTransformer</code> function that takes a <code>ManagedBuffer</code> and creates a <code>ChunkedByteBuffer</code>.</p> <p><code>getRemoteBytes</code> is used when:</p> <ul> <li><code>TorrentBroadcast</code> is requested to readBlocks</li> <li><code>TaskResultGetter</code> is requested to enqueueSuccessfulTask</li> </ul>","text":""},{"location":"storage/BlockManager/#fetching-remote-block","title":"Fetching Remote Block <pre><code>getRemoteBlock[T](\n  blockId: BlockId,\n  bufferTransformer: ManagedBuffer =&gt; T): Option[T]\n</code></pre> <p><code>getRemoteBlock</code>\u00a0is used for getRemoteValues and getRemoteBytes.</p> <p><code>getRemoteBlock</code> prints out the following DEBUG message to the logs:</p> <pre><code>Getting remote block [blockId]\n</code></pre> <p><code>getRemoteBlock</code> requests the BlockManagerMaster for locations and status of the input BlockId (with the host of BlockManagerId).</p> <p>With some locations, <code>getRemoteBlock</code> determines the size of the block (max of <code>diskSize</code> and <code>memSize</code>). <code>getRemoteBlock</code> tries to read the block from the local directories of another executor on the same host. <code>getRemoteBlock</code> prints out the following INFO message to the logs:</p> <pre><code>Read [blockId] from the disk of a same host executor is [successful|failed].\n</code></pre> <p>When a data block could not be found in any of the local directories, <code>getRemoteBlock</code> fetchRemoteManagedBuffer.</p> <p>For no locations from the BlockManagerMaster, <code>getRemoteBlock</code> prints out the following DEBUG message to the logs:</p>","text":""},{"location":"storage/BlockManager/#readdiskblockfromsamehostexecutor","title":"readDiskBlockFromSameHostExecutor <pre><code>readDiskBlockFromSameHostExecutor(\n  blockId: BlockId,\n  localDirs: Array[String],\n  blockSize: Long): Option[ManagedBuffer]\n</code></pre> <p><code>readDiskBlockFromSameHostExecutor</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#fetchremotemanagedbuffer","title":"fetchRemoteManagedBuffer <pre><code>fetchRemoteManagedBuffer(\n  blockId: BlockId,\n  blockSize: Long,\n  locationsAndStatus: BlockManagerMessages.BlockLocationsAndStatus): Option[ManagedBuffer]\n</code></pre> <p><code>fetchRemoteManagedBuffer</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#sortlocations","title":"sortLocations <pre><code>sortLocations(\n  locations: Seq[BlockManagerId]): Seq[BlockManagerId]\n</code></pre> <p><code>sortLocations</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#preferexecutors","title":"preferExecutors <pre><code>preferExecutors(\n  locations: Seq[BlockManagerId]): Seq[BlockManagerId]\n</code></pre> <p><code>preferExecutors</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#readdiskblockfromsamehostexecutor_1","title":"readDiskBlockFromSameHostExecutor <pre><code>readDiskBlockFromSameHostExecutor(\n  blockId: BlockId,\n  localDirs: Array[String],\n  blockSize: Long): Option[ManagedBuffer]\n</code></pre> <p><code>readDiskBlockFromSameHostExecutor</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#executioncontextexecutorservice","title":"ExecutionContextExecutorService <p><code>BlockManager</code> uses a Scala ExecutionContextExecutorService to execute FIXME asynchronously (on a thread pool with block-manager-future prefix and maximum of 128 threads).</p>","text":""},{"location":"storage/BlockManager/#blockevictionhandler","title":"BlockEvictionHandler <p><code>BlockManager</code> is a BlockEvictionHandler that can drop a block from memory (and store it on a disk when necessary).</p>","text":""},{"location":"storage/BlockManager/#shuffleclient-and-external-shuffle-service","title":"ShuffleClient and External Shuffle Service  <p>Danger</p> <p>FIXME <code>ShuffleClient</code> and <code>ExternalShuffleClient</code> are dead. Long live BlockStoreClient and ExternalBlockStoreClient.</p>  <p><code>BlockManager</code> manages the lifecycle of a <code>ShuffleClient</code>:</p> <ul> <li> <p>Creates when created</p> </li> <li> <p>Inits (and possibly registers with an external shuffle server) when requested to initialize</p> </li> <li> <p>Closes when requested to stop</p> </li> </ul> <p>The <code>ShuffleClient</code> can be an <code>ExternalShuffleClient</code> or the given BlockTransferService based on spark.shuffle.service.enabled configuration property. When enabled, BlockManager uses the <code>ExternalShuffleClient</code>.</p> <p>The <code>ShuffleClient</code> is available to other Spark services (using <code>shuffleClient</code> value) and is used when BlockStoreShuffleReader is requested to read combined key-value records for a reduce task.</p> <p>When requested for shuffle metrics, BlockManager simply requests them from the <code>ShuffleClient</code>.</p>","text":""},{"location":"storage/BlockManager/#blockmanager-and-rpcenv","title":"BlockManager and RpcEnv <p><code>BlockManager</code> is given a RpcEnv when created.</p> <p>The <code>RpcEnv</code> is used to set up a BlockManagerSlaveEndpoint.</p>","text":""},{"location":"storage/BlockManager/#blockinfomanager","title":"BlockInfoManager <p><code>BlockManager</code> creates a BlockInfoManager when created.</p> <p><code>BlockManager</code> requests the <code>BlockInfoManager</code> to clear when requested to stop.</p> <p><code>BlockManager</code> uses the <code>BlockInfoManager</code> to create a MemoryStore.</p> <p><code>BlockManager</code> uses the <code>BlockInfoManager</code> when requested for the following:</p> <ul> <li> <p>reportAllBlocks</p> </li> <li> <p>getStatus</p> </li> <li> <p>getMatchingBlockIds</p> </li> <li> <p>getLocalValues and getLocalBytes</p> </li> <li> <p>doPut</p> </li> <li> <p>replicateBlock</p> </li> <li> <p>dropFromMemory</p> </li> <li> <p>removeRdd, removeBroadcast, removeBlock, removeBlockInternal</p> </li> <li> <p>downgradeLock, releaseLock, registerTask, releaseAllLocksForTask</p> </li> </ul>","text":""},{"location":"storage/BlockManager/#blockmanager-and-blockmanagermaster","title":"BlockManager and BlockManagerMaster <p><code>BlockManager</code> is given a BlockManagerMaster when created.</p>","text":""},{"location":"storage/BlockManager/#blockmanager-as-blockdatamanager","title":"BlockManager as BlockDataManager <p><code>BlockManager</code> is a BlockDataManager.</p>","text":""},{"location":"storage/BlockManager/#blockmanager-and-mapoutputtracker","title":"BlockManager and MapOutputTracker <p><code>BlockManager</code> is given a MapOutputTracker when created.</p>","text":""},{"location":"storage/BlockManager/#executor-id","title":"Executor ID <p><code>BlockManager</code> is given an Executor ID when created.</p> <p>The Executor ID is one of the following:</p> <ul> <li> <p>driver (<code>SparkContext.DRIVER_IDENTIFIER</code>) for the driver</p> </li> <li> <p>Value of --executor-id command-line argument for CoarseGrainedExecutorBackend executors</p> </li> </ul>","text":""},{"location":"storage/BlockManager/#blockmanagerendpoint-rpc-endpoint","title":"BlockManagerEndpoint RPC Endpoint <p><code>BlockManager</code> requests the RpcEnv to register a BlockManagerSlaveEndpoint under the name <code>BlockManagerEndpoint[ID]</code>.</p> <p>The RPC endpoint is used when <code>BlockManager</code> is requested to initialize and reregister (to register the <code>BlockManager</code> on an executor with the BlockManagerMaster on the driver).</p> <p>The endpoint is stopped (by requesting the RpcEnv to stop the reference) when <code>BlockManager</code> is requested to stop.</p>","text":""},{"location":"storage/BlockManager/#accessing-blockmanager","title":"Accessing BlockManager <p><code>BlockManager</code> is available using SparkEnv on the driver and executors.</p> <pre><code>import org.apache.spark.SparkEnv\nval bm = SparkEnv.get.blockManager\n\nscala&gt; :type bm\norg.apache.spark.storage.BlockManager\n</code></pre>","text":""},{"location":"storage/BlockManager/#blockstoreclient","title":"BlockStoreClient <p><code>BlockManager</code> uses a BlockStoreClient to read other executors' blocks. This is an ExternalBlockStoreClient (when given and an external shuffle service is used) or a BlockTransferService (to directly connect to other executors).</p> <p>This <code>BlockStoreClient</code> is used when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined key-values for a reduce task</li> <li>Create the HostLocalDirManager (when <code>BlockManager</code> is initialized)</li> <li>As the shuffleMetricsSource</li> <li>registerWithExternalShuffleServer (when an external shuffle server is used and the ExternalBlockStoreClient defined)</li> </ul>","text":""},{"location":"storage/BlockManager/#blocktransferservice","title":"BlockTransferService <p><code>BlockManager</code> is given a BlockTransferService when created.</p>  <p>Note</p> <p>There is only one concrete <code>BlockTransferService</code> that is NettyBlockTransferService and there seem to be no way to reconfigure Apache Spark to use a different implementation (if there were any).</p>  <p><code>BlockTransferService</code> is used when <code>BlockManager</code> is requested to fetch a block from and replicate a block to remote block managers.</p> <p><code>BlockTransferService</code> is used as the BlockStoreClient (unless an ExternalBlockStoreClient is specified).</p> <p><code>BlockTransferService</code> is initialized with this BlockManager.</p> <p><code>BlockTransferService</code> is closed when <code>BlockManager</code> is requested to stop.</p>","text":""},{"location":"storage/BlockManager/#shufflemanager","title":"ShuffleManager <p><code>BlockManager</code> is given a ShuffleManager when created.</p> <p><code>BlockManager</code> uses the <code>ShuffleManager</code> for the following:</p> <ul> <li> <p>Retrieving a block data (for shuffle blocks)</p> </li> <li> <p>Retrieving a non-shuffle block data (for shuffle blocks anyway)</p> </li> <li> <p>Registering an executor with a local external shuffle service (when initialized on an executor with externalShuffleServiceEnabled)</p> </li> </ul>","text":""},{"location":"storage/BlockManager/#memorystore","title":"MemoryStore <p><code>BlockManager</code> creates a MemoryStore when created (with the BlockInfoManager, the SerializerManager, the MemoryManager and itself as a BlockEvictionHandler).</p> <p></p> <p><code>BlockManager</code> requests the MemoryManager to use the <code>MemoryStore</code>.</p> <p><code>BlockManager</code> uses the <code>MemoryStore</code> for the following:</p> <ul> <li> <p>getStatus and getCurrentBlockStatus</p> </li> <li> <p>getLocalValues</p> </li> <li> <p>doGetLocalBytes</p> </li> <li> <p>doPutBytes and doPutIterator</p> </li> <li> <p>maybeCacheDiskBytesInMemory and maybeCacheDiskValuesInMemory</p> </li> <li> <p>dropFromMemory</p> </li> <li> <p>removeBlockInternal</p> </li> </ul> <p>The <code>MemoryStore</code> is requested to clear when <code>BlockManager</code> is requested to stop.</p> <p>The <code>MemoryStore</code> is available as <code>memoryStore</code> private reference to other Spark services.</p> <pre><code>import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.memoryStore\n</code></pre> <p>The <code>MemoryStore</code> is used (via <code>SparkEnv.get.blockManager.memoryStore</code> reference) when <code>Task</code> is requested to run (that has just finished execution and requests the <code>MemoryStore</code> to release unroll memory).</p>","text":""},{"location":"storage/BlockManager/#diskstore","title":"DiskStore <p><code>BlockManager</code> creates a DiskStore (with the DiskBlockManager) when created.</p> <p></p> <p><code>BlockManager</code> uses the <code>DiskStore</code> when requested for the following:</p> <ul> <li>getStatus</li> <li>getCurrentBlockStatus</li> <li>getLocalValues</li> <li>doGetLocalBytes</li> <li>doPutIterator</li> <li>dropFromMemory</li> <li>removeBlockInternal</li> </ul> <p><code>DiskStore</code> is used when:</p> <ul> <li><code>ByteBufferBlockStoreUpdater</code> is requested to saveToDiskStore</li> <li><code>TempFileBasedBlockStoreUpdater</code> is requested to blockData and saveToDiskStore</li> </ul>","text":""},{"location":"storage/BlockManager/#performance-metrics","title":"Performance Metrics <p>BlockManager uses BlockManagerSource to report metrics under the name BlockManager.</p>","text":""},{"location":"storage/BlockManager/#getpeers","title":"getPeers <pre><code>getPeers(\n  forceFetch: Boolean): Seq[BlockManagerId]\n</code></pre> <p><code>getPeers</code>...FIXME</p> <p><code>getPeers</code> is used when <code>BlockManager</code> is requested to replicateBlock and replicate.</p>","text":""},{"location":"storage/BlockManager/#releasing-all-locks-for-task","title":"Releasing All Locks For Task <pre><code>releaseAllLocksForTask(\n  taskAttemptId: Long): Seq[BlockId]\n</code></pre> <p><code>releaseAllLocksForTask</code>...FIXME</p> <p><code>releaseAllLocksForTask</code> is used when <code>TaskRunner</code> is requested to run (at the end of a task).</p>","text":""},{"location":"storage/BlockManager/#stopping-blockmanager","title":"Stopping BlockManager <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>...FIXME</p> <p><code>stop</code> is used when <code>SparkEnv</code> is requested to stop.</p>","text":""},{"location":"storage/BlockManager/#getting-ids-of-existing-blocks-for-a-given-filter","title":"Getting IDs of Existing Blocks (For a Given Filter) <pre><code>getMatchingBlockIds(\n  filter: BlockId =&gt; Boolean): Seq[BlockId]\n</code></pre> <p><code>getMatchingBlockIds</code>...FIXME</p> <p><code>getMatchingBlockIds</code> is used when <code>BlockManagerSlaveEndpoint</code> is requested to handle a GetMatchingBlockIds message.</p>","text":""},{"location":"storage/BlockManager/#getting-local-block","title":"Getting Local Block <pre><code>getLocalValues(\n  blockId: BlockId): Option[BlockResult]\n</code></pre> <p><code>getLocalValues</code> prints out the following DEBUG message to the logs:</p> <pre><code>Getting local block [blockId]\n</code></pre> <p><code>getLocalValues</code> obtains a read lock for <code>blockId</code>.</p> <p>When no <code>blockId</code> block was found, you should see the following DEBUG message in the logs and <code>getLocalValues</code> returns \"nothing\" (i.e. <code>NONE</code>).</p> <pre><code>Block [blockId] was not found\n</code></pre> <p>When the <code>blockId</code> block was found, you should see the following DEBUG message in the logs:</p> <pre><code>Level for block [blockId] is [level]\n</code></pre> <p>If <code>blockId</code> block has memory level and is registered in <code>MemoryStore</code>, <code>getLocalValues</code> returns a BlockResult as <code>Memory</code> read method and with a <code>CompletionIterator</code> for an interator:</p> <ol> <li>Values iterator from <code>MemoryStore</code> for <code>blockId</code> for \"deserialized\" persistence levels.</li> <li>Iterator from <code>SerializerManager</code> after the data stream has been deserialized for the <code>blockId</code> block and the bytes for <code>blockId</code> block for \"serialized\" persistence levels.</li> </ol> <p><code>getLocalValues</code> is used when:</p> <ul> <li> <p><code>TorrentBroadcast</code> is requested to readBroadcastBlock</p> </li> <li> <p><code>BlockManager</code> is requested to get and getOrElseUpdate</p> </li> </ul>","text":""},{"location":"storage/BlockManager/#maybecachediskvaluesinmemory","title":"maybeCacheDiskValuesInMemory <pre><code>maybeCacheDiskValuesInMemory[T](\n  blockInfo: BlockInfo,\n  blockId: BlockId,\n  level: StorageLevel,\n  diskIterator: Iterator[T]): Iterator[T]\n</code></pre> <p><code>maybeCacheDiskValuesInMemory</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#retrieving-block-data","title":"Retrieving Block Data <pre><code>getBlockData(\n  blockId: BlockId): ManagedBuffer\n</code></pre> <p><code>getBlockData</code> is part of the BlockDataManager abstraction.</p> <p>For a BlockId.md[] of a shuffle (a ShuffleBlockId), getBlockData requests the &lt;&gt; for the shuffle:ShuffleManager.md#shuffleBlockResolver[ShuffleBlockResolver] that is then requested for shuffle:ShuffleBlockResolver.md#getBlockData[getBlockData]. <p>Otherwise, getBlockData &lt;&gt; for the given BlockId. <p>If found, getBlockData creates a new BlockManagerManagedBuffer (with the &lt;&gt;, the input BlockId, the retrieved BlockData and the dispose flag enabled). <p>If not found, getBlockData &lt;&gt; that the block could not be found (and that the master should no longer assume the block is available on this executor) and throws a BlockNotFoundException. <p>NOTE: <code>getBlockData</code> is executed for shuffle blocks or local blocks that the BlockManagerMaster knows this executor really has (unless BlockManagerMaster is outdated).</p>","text":""},{"location":"storage/BlockManager/#retrieving-non-shuffle-local-block-data","title":"Retrieving Non-Shuffle Local Block Data <pre><code>getLocalBytes(\n  blockId: BlockId): Option[BlockData]\n</code></pre> <p><code>getLocalBytes</code>...FIXME</p> <p><code>getLocalBytes</code> is used when:</p> <ul> <li><code>TorrentBroadcast</code> is requested to readBlocks</li> <li><code>BlockManager</code> is requested for the block data (of a non-shuffle block)</li> </ul>","text":""},{"location":"storage/BlockManager/#storing-block-data-locally","title":"Storing Block Data Locally <pre><code>putBlockData(\n  blockId: BlockId,\n  data: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Boolean\n</code></pre> <p><code>putBlockData</code> is part of the BlockDataManager abstraction.</p> <p><code>putBlockData</code> putBytes with Java NIO's <code>ByteBuffer</code> of the given <code>ManagedBuffer</code>.</p>","text":""},{"location":"storage/BlockManager/#storing-block-bytebuffer-locally","title":"Storing Block (ByteBuffer) Locally <pre><code>putBytes(\n  blockId: BlockId,\n  bytes: ChunkedByteBuffer,\n  level: StorageLevel,\n  tellMaster: Boolean = true): Boolean\n</code></pre> <p><code>putBytes</code> creates a ByteBufferBlockStoreUpdater that is then requested to store the bytes.</p> <p><code>putBytes</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to puts a block data locally</li> <li><code>TaskRunner</code> is requested to run (and the result size is above maxDirectResultSize)</li> <li><code>TorrentBroadcast</code> is requested to writeBlocks and readBlocks</li> </ul>","text":""},{"location":"storage/BlockManager/#doputbytes","title":"doPutBytes <pre><code>doPutBytes[T](\n  blockId: BlockId,\n  bytes: ChunkedByteBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[T],\n  tellMaster: Boolean = true,\n  keepReadLock: Boolean = false): Boolean\n</code></pre> <p><code>doPutBytes</code> calls the internal helper &lt;&gt; with a function that accepts a <code>BlockInfo</code> and does the uploading. <p>Inside the function, if the StorageLevel.md[storage <code>level</code>]'s replication is greater than 1, it immediately starts &lt;&gt; of the <code>blockId</code> block on a separate thread (from <code>futureExecutionContext</code> thread pool). The replication uses the input <code>bytes</code> and <code>level</code> storage level. <p>For a memory storage level, the function checks whether the storage <code>level</code> is deserialized or not. For a deserialized storage <code>level</code>, <code>BlockManager</code>'s serializer:SerializerManager.md#dataDeserializeStream[<code>SerializerManager</code> deserializes <code>bytes</code> into an iterator of values] that MemoryStore.md#putIteratorAsValues[<code>MemoryStore</code> stores]. If however the storage <code>level</code> is not deserialized, the function requests MemoryStore.md#putBytes[<code>MemoryStore</code> to store the bytes]</p> <p>If the put did not succeed and the storage level is to use disk, you should see the following WARN message in the logs:</p> <pre><code>Persisting block [blockId] to disk instead.\n</code></pre> <p>And DiskStore.md#putBytes[<code>DiskStore</code> stores the bytes].</p> <p>NOTE: DiskStore.md[DiskStore] is requested to store the bytes of a block with memory and disk storage level only when MemoryStore.md[MemoryStore] has failed.</p> <p>If the storage level is to use disk only, DiskStore.md#putBytes[<code>DiskStore</code> stores the bytes].</p> <p><code>doPutBytes</code> requests &lt;&gt; and if the block was successfully stored, and the driver should know about it (<code>tellMaster</code>), the function &lt;&gt;. The executor:TaskMetrics.md#incUpdatedBlockStatuses[current <code>TaskContext</code> metrics are updated with the updated block status] (only when executed inside a task where <code>TaskContext</code> is available). <p>You should see the following DEBUG message in the logs:</p> <pre><code>Put block [blockId] locally took [time] ms\n</code></pre> <p>The function waits till the earlier asynchronous replication finishes for a block with replication level greater than <code>1</code>.</p> <p>The final result of <code>doPutBytes</code> is the result of storing the block successful or not (as computed earlier).</p> <p>NOTE: <code>doPutBytes</code> is used exclusively when BlockManager is requested to &lt;&gt;.","text":""},{"location":"storage/BlockManager/#putting-new-block","title":"Putting New Block <pre><code>doPut[T](\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[_],\n  tellMaster: Boolean,\n  keepReadLock: Boolean)(putBody: BlockInfo =&gt; Option[T]): Option[T]\n</code></pre> <p><code>doPut</code> requires that the given StorageLevel is valid.</p> <p><code>doPut</code> creates a new BlockInfo and requests the BlockInfoManager for a write lock for the block.</p> <p><code>doPut</code> executes the given <code>putBody</code> function (with the <code>BlockInfo</code>).</p> <p>If the result of <code>putBody</code> function is <code>None</code>, the block is considered saved successfully.</p> <p>For successful save, <code>doPut</code> requests the BlockInfoManager to downgradeLock or unlock based on the given <code>keepReadLock</code> flag (<code>true</code> and <code>false</code>, respectively).</p> <p>For unsuccessful save (when <code>putBody</code> returned some value), <code>doPut</code> removeBlockInternal and prints out the following WARN message to the logs:</p> <pre><code>Putting block [blockId] failed\n</code></pre> <p>In the end, <code>doPut</code> prints out the following DEBUG message to the logs:</p> <pre><code>Putting block [blockId] [withOrWithout] replication took [usedTime] ms\n</code></pre> <p><code>doPut</code> is used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save</li> <li><code>BlockManager</code> is requested to doPutIterator</li> </ul>","text":""},{"location":"storage/BlockManager/#removing-block","title":"Removing Block <pre><code>removeBlock(\n  blockId: BlockId,\n  tellMaster: Boolean = true): Unit\n</code></pre> <p><code>removeBlock</code> prints out the following DEBUG message to the logs:</p> <pre><code>Removing block [blockId]\n</code></pre> <p><code>removeBlock</code> requests the BlockInfoManager for write lock on the block.</p> <p>With a write lock on the block, <code>removeBlock</code> removeBlockInternal (with the <code>tellMaster</code> flag turned on when the input <code>tellMaster</code> flag and the tellMaster of the block itself are both turned on).</p> <p>In the end, <code>removeBlock</code> addUpdatedBlockStatusToTaskMetrics (with an empty <code>BlockStatus</code>).</p>  <p>In case the block is no longer available (<code>None</code>), <code>removeBlock</code> prints out the following WARN message to the logs:</p> <pre><code>Asked to remove block [blockId], which does not exist\n</code></pre>  <p><code>removeBlock</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to handleLocalReadFailure, removeRdd, removeBroadcast</li> <li><code>BlockManagerDecommissioner</code> is requested to migrate a block</li> <li><code>BlockManagerStorageEndpoint</code> is requested to handle a RemoveBlock message</li> </ul>","text":""},{"location":"storage/BlockManager/#removing-rdd-blocks","title":"Removing RDD Blocks <pre><code>removeRdd(\n  rddId: Int): Int\n</code></pre> <p><code>removeRdd</code> removes all the blocks that belong to the <code>rddId</code> RDD.</p> <p>It prints out the following INFO message to the logs:</p> <pre><code>Removing RDD [rddId]\n</code></pre> <p>It then requests RDD blocks from BlockInfoManager.md[] and &lt;&gt; (without informing the driver). <p>The number of blocks removed is the final result.</p> <p>NOTE: It is used by BlockManagerSlaveEndpoint.md#RemoveRdd[<code>BlockManagerSlaveEndpoint</code> while handling <code>RemoveRdd</code> messages].</p>","text":""},{"location":"storage/BlockManager/#removing-all-blocks-of-broadcast-variable","title":"Removing All Blocks of Broadcast Variable <pre><code>removeBroadcast(broadcastId: Long, tellMaster: Boolean): Int\n</code></pre> <p><code>removeBroadcast</code> removes all the blocks of the input <code>broadcastId</code> broadcast.</p> <p>Internally, it starts by printing out the following DEBUG message to the logs:</p> <pre><code>Removing broadcast [broadcastId]\n</code></pre> <p>It then requests all the BlockId.md#BroadcastBlockId[BroadcastBlockId] objects that belong to the <code>broadcastId</code> broadcast from BlockInfoManager.md[] and &lt;&gt;. <p>The number of blocks removed is the final result.</p> <p>NOTE: It is used by BlockManagerSlaveEndpoint.md#RemoveBroadcast[<code>BlockManagerSlaveEndpoint</code> while handling <code>RemoveBroadcast</code> messages].</p>","text":""},{"location":"storage/BlockManager/#external-shuffle-servers-address","title":"External Shuffle Server's Address <pre><code>shuffleServerId: BlockManagerId\n</code></pre> <p>When requested to initialize, <code>BlockManager</code> records the location (BlockManagerId) of External Shuffle Service if enabled or simply uses the non-external-shuffle-service BlockManagerId.</p> <p>The <code>BlockManagerId</code> is used to register an executor with a local external shuffle service.</p> <p>The <code>BlockManagerId</code> is used as the location of a shuffle map output when:</p> <ul> <li><code>BypassMergeSortShuffleWriter</code> is requested to write partition records to a shuffle file</li> <li><code>UnsafeShuffleWriter</code> is requested to close and write output</li> <li><code>SortShuffleWriter</code> is requested to write output</li> </ul>","text":""},{"location":"storage/BlockManager/#getstatus","title":"getStatus <pre><code>getStatus(\n  blockId: BlockId): Option[BlockStatus]\n</code></pre> <p><code>getStatus</code>...FIXME</p> <p><code>getStatus</code> is used when <code>BlockManagerSlaveEndpoint</code> is requested to handle GetBlockStatus message.</p>","text":""},{"location":"storage/BlockManager/#re-registering-blockmanager-with-driver","title":"Re-registering BlockManager with Driver <pre><code>reregister(): Unit\n</code></pre> <p><code>reregister</code> prints out the following INFO message to the logs:</p> <pre><code>BlockManager [blockManagerId] re-registering with master\n</code></pre> <p><code>reregister</code> requests the BlockManagerMaster to register this BlockManager.</p> <p>In the end, <code>reregister</code> reportAllBlocks.</p> <p><code>reregister</code> is used when:</p> <ul> <li><code>Executor</code> is requested to reportHeartBeat (and informed to re-register)</li> <li><code>BlockManager</code> is requested to asyncReregister</li> </ul>","text":""},{"location":"storage/BlockManager/#reporting-all-blocks","title":"Reporting All Blocks <pre><code>reportAllBlocks(): Unit\n</code></pre> <p><code>reportAllBlocks</code> prints out the following INFO message to the logs:</p> <pre><code>Reporting [n] blocks to the master.\n</code></pre> <p>For all the blocks in the BlockInfoManager, <code>reportAllBlocks</code> getCurrentBlockStatus and tryToReportBlockStatus (for blocks tracked by the master).</p> <p><code>reportAllBlocks</code> prints out the following ERROR message to the logs and exits when block status reporting fails for any block:</p> <pre><code>Failed to report [blockId] to master; giving up.\n</code></pre>","text":""},{"location":"storage/BlockManager/#calculate-current-block-status","title":"Calculate Current Block Status <pre><code>getCurrentBlockStatus(\n  blockId: BlockId,\n  info: BlockInfo): BlockStatus\n</code></pre> <p><code>getCurrentBlockStatus</code> gives the current <code>BlockStatus</code> of the <code>BlockId</code> block (with the block's current StorageLevel.md[StorageLevel], memory and disk sizes). It uses MemoryStore.md[MemoryStore] and DiskStore.md[DiskStore] for size and other information.</p> <p>NOTE: Most of the information to build <code>BlockStatus</code> is already in <code>BlockInfo</code> except that it may not necessarily reflect the current state per MemoryStore.md[MemoryStore] and DiskStore.md[DiskStore].</p> <p>Internally, it uses the input BlockInfo.md[] to know about the block's storage level. If the storage level is not set (i.e. <code>null</code>), the returned <code>BlockStatus</code> assumes the StorageLevel.md[default <code>NONE</code> storage level] and the memory and disk sizes being <code>0</code>.</p> <p>If however the storage level is set, <code>getCurrentBlockStatus</code> uses MemoryStore.md[MemoryStore] and DiskStore.md[DiskStore] to check whether the block is stored in the storages or not and request for their sizes in the storages respectively (using their <code>getSize</code> or assume <code>0</code>).</p> <p>NOTE: It is acceptable that the <code>BlockInfo</code> says to use memory or disk yet the block is not in the storages (yet or anymore). The method will give current status.</p> <p><code>getCurrentBlockStatus</code> is used when &lt;&gt;, &lt;&gt; or &lt;&gt; or &lt;&gt;.","text":""},{"location":"storage/BlockManager/#reporting-current-storage-status-of-block-to-driver","title":"Reporting Current Storage Status of Block to Driver <pre><code>reportBlockStatus(\n  blockId: BlockId,\n  status: BlockStatus,\n  droppedMemorySize: Long = 0L): Unit\n</code></pre> <p><code>reportBlockStatus</code> tryToReportBlockStatus.</p> <p>If told to re-register, <code>reportBlockStatus</code> prints out the following INFO message to the logs followed by asynchronous re-registration:</p> <pre><code>Got told to re-register updating block [blockId]\n</code></pre> <p>In the end, <code>reportBlockStatus</code> prints out the following DEBUG message to the logs:</p> <pre><code>Told master about block [blockId]\n</code></pre> <p><code>reportBlockStatus</code> is used when:</p> <ul> <li><code>IndexShuffleBlockResolver</code> is requested to </li> <li><code>BlockStoreUpdater</code> is requested to save</li> <li><code>BlockManager</code> is requested to getLocalBlockData, doPutIterator, dropFromMemory, removeBlockInternal</li> </ul>","text":""},{"location":"storage/BlockManager/#reporting-block-status-update-to-driver","title":"Reporting Block Status Update to Driver <pre><code>tryToReportBlockStatus(\n  blockId: BlockId,\n  status: BlockStatus,\n  droppedMemorySize: Long = 0L): Boolean\n</code></pre> <p><code>tryToReportBlockStatus</code> reports block status update to the BlockManagerMaster and returns its response.</p> <p><code>tryToReportBlockStatus</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to reportAllBlocks, reportBlockStatus</li> </ul>","text":""},{"location":"storage/BlockManager/#execution-context","title":"Execution Context <p>block-manager-future is the execution context for...FIXME</p>","text":""},{"location":"storage/BlockManager/#bytebuffer","title":"ByteBuffer <p>The underlying abstraction for blocks in Spark is a <code>ByteBuffer</code> that limits the size of a block to 2GB (<code>Integer.MAX_VALUE</code> - see Why does FileChannel.map take up to Integer.MAX_VALUE of data? and SPARK-1476 2GB limit in spark for blocks). This has implication not just for managed blocks in use, but also for shuffle blocks (memory mapped blocks are limited to 2GB, even though the API allows for <code>long</code>), ser-deser via byte array-backed output streams.</p>","text":""},{"location":"storage/BlockManager/#blockresult","title":"BlockResult <p><code>BlockResult</code> is a metadata of a fetched block:</p> <ul> <li> Data (<code>Iterator[Any]</code>) <li> DataReadMethod <li> Size (bytes)  <p><code>BlockResult</code> is created and returned when <code>BlockManager</code> is requested for the following:</p> <ul> <li>getOrElseUpdate</li> <li>get</li> <li>getLocalValues</li> <li>getRemoteValues</li> </ul>","text":""},{"location":"storage/BlockManager/#datareadmethod","title":"DataReadMethod <p><code>DataReadMethod</code> describes how block data was read.</p>    DataReadMethod Source     <code>Disk</code> DiskStore (while getLocalValues)   <code>Hadoop</code> seems unused   <code>Memory</code> MemoryStore (while getLocalValues)   <code>Network</code> Remote BlockManagers (aka network)","text":""},{"location":"storage/BlockManager/#registering-task","title":"Registering Task <pre><code>registerTask(\n  taskAttemptId: Long): Unit\n</code></pre> <p><code>registerTask</code> requests the BlockInfoManager to register a given task.</p> <p><code>registerTask</code> is used when <code>Task</code> is requested to run (at the start of a task).</p>","text":""},{"location":"storage/BlockManager/#creating-diskblockobjectwriter","title":"Creating DiskBlockObjectWriter <pre><code>getDiskWriter(\n  blockId: BlockId,\n  file: File,\n  serializerInstance: SerializerInstance,\n  bufferSize: Int,\n  writeMetrics: ShuffleWriteMetrics): DiskBlockObjectWriter\n</code></pre> <p>getDiskWriter creates a DiskBlockObjectWriter (with spark.shuffle.sync configuration property for <code>syncWrites</code> argument).</p> <p><code>getDiskWriter</code> uses the SerializerManager.</p> <p><code>getDiskWriter</code> is used when:</p> <ul> <li> <p><code>BypassMergeSortShuffleWriter</code> is requested to write records (of a partition)</p> </li> <li> <p><code>ShuffleExternalSorter</code> is requested to writeSortedFile</p> </li> <li> <p><code>ExternalAppendOnlyMap</code> is requested to spillMemoryIteratorToDisk</p> </li> <li> <p><code>ExternalSorter</code> is requested to spillMemoryIteratorToDisk and writePartitionedFile</p> </li> <li> <p>UnsafeSorterSpillWriter is created</p> </li> </ul>","text":""},{"location":"storage/BlockManager/#recording-updated-blockstatus-in-taskmetrics-of-current-task","title":"Recording Updated BlockStatus in TaskMetrics (of Current Task) <pre><code>addUpdatedBlockStatusToTaskMetrics(\n  blockId: BlockId,\n  status: BlockStatus): Unit\n</code></pre> <p><code>addUpdatedBlockStatusToTaskMetrics</code> takes an active <code>TaskContext</code> (if available) and records updated <code>BlockStatus</code> for <code>Block</code> (in the task's <code>TaskMetrics</code>).</p> <p><code>addUpdatedBlockStatusToTaskMetrics</code> is used when BlockManager doPutBytes (for a block that was successfully stored), doPut, doPutIterator, removes blocks from memory (possibly spilling it to disk) and removes block from memory and disk.</p>","text":""},{"location":"storage/BlockManager/#shuffle-metrics-source","title":"Shuffle Metrics Source <pre><code>shuffleMetricsSource: Source\n</code></pre> <p><code>shuffleMetricsSource</code> creates a ShuffleMetricsSource with the shuffleMetrics (of the  BlockStoreClient) and the source name as follows:</p> <ul> <li>ExternalShuffle when ExternalBlockStoreClient is specified</li> <li>NettyBlockTransfer otherwise</li> </ul> <p><code>shuffleMetricsSource</code> is available using SparkEnv:</p> <pre><code>env.blockManager.shuffleMetricsSource\n</code></pre> <p><code>shuffleMetricsSource</code> is used when:</p> <ul> <li>Executor is created (for non-local / cluster modes)</li> </ul>","text":""},{"location":"storage/BlockManager/#replicating-block-to-peers","title":"Replicating Block To Peers <pre><code>replicate(\n  blockId: BlockId,\n  data: BlockData,\n  level: StorageLevel,\n  classTag: ClassTag[_],\n  existingReplicas: Set[BlockManagerId] = Set.empty): Unit\n</code></pre> <p><code>replicate</code>...FIXME</p> <p><code>replicate</code> is used when <code>BlockManager</code> is requested to doPutBytes, doPutIterator and replicateBlock.</p>","text":""},{"location":"storage/BlockManager/#replicateblock","title":"replicateBlock <pre><code>replicateBlock(\n  blockId: BlockId,\n  existingReplicas: Set[BlockManagerId],\n  maxReplicas: Int): Unit\n</code></pre> <p><code>replicateBlock</code>...FIXME</p> <p><code>replicateBlock</code> is used when <code>BlockManagerSlaveEndpoint</code> is requested to handle a ReplicateBlock message.</p>","text":""},{"location":"storage/BlockManager/#putiterator","title":"putIterator <pre><code>putIterator[T: ClassTag](\n  blockId: BlockId,\n  values: Iterator[T],\n  level: StorageLevel,\n  tellMaster: Boolean = true): Boolean\n</code></pre> <p><code>putIterator</code>...FIXME</p> <p><code>putIterator</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to putSingle</li> </ul>","text":""},{"location":"storage/BlockManager/#putsingle","title":"putSingle <pre><code>putSingle[T: ClassTag](\n  blockId: BlockId,\n  value: T,\n  level: StorageLevel,\n  tellMaster: Boolean = true): Boolean\n</code></pre> <p><code>putSingle</code>...FIXME</p> <p><code>putSingle</code> is used when <code>TorrentBroadcast</code> is requested to write the blocks and readBroadcastBlock.</p>","text":""},{"location":"storage/BlockManager/#doputiterator","title":"doPutIterator <pre><code>doPutIterator[T](\n  blockId: BlockId,\n  iterator: () =&gt; Iterator[T],\n  level: StorageLevel,\n  classTag: ClassTag[T],\n  tellMaster: Boolean = true,\n  keepReadLock: Boolean = false): Option[PartiallyUnrolledIterator[T]]\n</code></pre> <p><code>doPutIterator</code> doPut with the putBody function.</p> <p><code>doPutIterator</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to getOrElseUpdate and putIterator</li> </ul>","text":""},{"location":"storage/BlockManager/#putbody","title":"putBody <pre><code>putBody: BlockInfo =&gt; Option[T]\n</code></pre> <p>For the given StorageLevel that indicates to use memory for storage, <code>putBody</code> requests the MemoryStore to putIteratorAsValues or putIteratorAsBytes based on the <code>StorageLevel</code> (that indicates to use deserialized format or not, respectively).</p> <p>In case storing the block in memory was not possible (due to lack of available memory), <code>putBody</code> prints out the following WARN message to the logs and falls back on the DiskStore to store the block.</p> <pre><code>Persisting block [blockId] to disk instead.\n</code></pre> <p>For the given StorageLevel that indicates to use disk storage only (useMemory flag is disabled), <code>putBody</code> requests the DiskStore to store the block.</p> <p><code>putBody</code> gets the current block status and checks whether the <code>StorageLevel</code> is valid (that indicates that the block was stored successfully).</p> <p>If the block was stored successfully, <code>putBody</code> reports the block status (only if indicated by the the given <code>tellMaster</code> flag and the tellMaster flag of the associated BlockInfo) and addUpdatedBlockStatusToTaskMetrics.</p> <p><code>putBody</code> prints out the following DEBUG message to the logs:</p> <pre><code>Put block [blockId] locally took [duration] ms\n</code></pre> <p>For the given StorageLevel with replication enabled (above <code>1</code>), <code>putBody</code> doGetLocalBytes and replicates the block (to other BlockManagers). <code>putBody</code> prints out the following DEBUG message to the logs:</p> <pre><code>Put block [blockId] remotely took [duration] ms\n</code></pre>","text":""},{"location":"storage/BlockManager/#dogetlocalbytes","title":"doGetLocalBytes <pre><code>doGetLocalBytes(\n  blockId: BlockId,\n  info: BlockInfo): BlockData\n</code></pre> <p><code>doGetLocalBytes</code>...FIXME</p> <p><code>doGetLocalBytes</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to getLocalBytes, doPutIterator and replicateBlock</li> </ul>","text":""},{"location":"storage/BlockManager/#dropping-block-from-memory","title":"Dropping Block from Memory <pre><code>dropFromMemory(\n  blockId: BlockId,\n  data: () =&gt; Either[Array[T], ChunkedByteBuffer]): StorageLevel\n</code></pre> <p><code>dropFromMemory</code> prints out the following INFO message to the logs:</p> <pre><code>Dropping block [blockId] from memory\n</code></pre> <p><code>dropFromMemory</code> requests the BlockInfoManager to assert that the block is locked for writing (that gives a BlockInfo or throws a <code>SparkException</code>).</p>  <p><code>dropFromMemory</code> drops to disk if the current storage level requires so (based on the given <code>BlockInfo</code>) and the block is not in the DiskStore already. <code>dropFromMemory</code> prints out the following INFO message to the logs:</p> <pre><code>Writing block [blockId] to disk\n</code></pre> <p><code>dropFromMemory</code> uses the given <code>data</code> to determine whether the DiskStore is requested to put or putBytes (<code>Array[T]</code> or <code>ChunkedByteBuffer</code>, respectively).</p>  <p><code>dropFromMemory</code> requests the MemoryStore to remove the block. <code>dropFromMemory</code> prints out the following WARN message to the logs if the block was not found in the MemoryStore:</p> <pre><code>Block [blockId] could not be dropped from memory as it does not exist\n</code></pre> <p><code>dropFromMemory</code> gets the current block status and reportBlockStatus when requested (when the tellMaster flag of the <code>BlockInfo</code> is turned on).</p> <p><code>dropFromMemory</code> addUpdatedBlockStatusToTaskMetrics when the block has been updated (dropped to disk or removed from the <code>MemoryStore</code>).</p> <p>In the end, <code>dropFromMemory</code> returns the current StorageLevel of the block (off the <code>BlockStatus</code>).</p>  <p><code>dropFromMemory</code> is part of the BlockEvictionHandler abstraction.</p>","text":""},{"location":"storage/BlockManager/#releaselock-method","title":"releaseLock Method <pre><code>releaseLock(\n  blockId: BlockId,\n  taskAttemptId: Option[Long] = None): Unit\n</code></pre> <p>releaseLock requests the BlockInfoManager to unlock the given block.</p> <p>releaseLock is part of the BlockDataManager abstraction.</p>","text":""},{"location":"storage/BlockManager/#putblockdataasstream","title":"putBlockDataAsStream <pre><code>putBlockDataAsStream(\n  blockId: BlockId,\n  level: StorageLevel,\n  classTag: ClassTag[_]): StreamCallbackWithID\n</code></pre> <p><code>putBlockDataAsStream</code> is part of the BlockDataManager abstraction.</p> <p><code>putBlockDataAsStream</code>...FIXME</p>","text":""},{"location":"storage/BlockManager/#maximum-memory","title":"Maximum Memory <p>Total maximum value that <code>BlockManager</code> can ever possibly use (that depends on MemoryManager and may vary over time).</p> <p>Total available on-heap and off-heap memory for storage (in bytes)</p>","text":""},{"location":"storage/BlockManager/#maximum-off-heap-memory","title":"Maximum Off-Heap Memory","text":""},{"location":"storage/BlockManager/#maximum-on-heap-memory","title":"Maximum On-Heap Memory","text":""},{"location":"storage/BlockManager/#decommissionself","title":"decommissionSelf <pre><code>decommissionSelf(): Unit\n</code></pre> <p><code>decommissionSelf</code>...FIXME</p> <p><code>decommissionSelf</code> is used when:</p> <ul> <li><code>BlockManagerStorageEndpoint</code> is requested to handle a DecommissionBlockManager message</li> </ul>","text":""},{"location":"storage/BlockManager/#decommissionblockmanager","title":"decommissionBlockManager <pre><code>decommissionBlockManager(): Unit\n</code></pre> <p><code>decommissionBlockManager</code> sends a <code>DecommissionBlockManager</code> message to the BlockManagerStorageEndpoint.</p> <p><code>decommissionBlockManager</code> is used when:</p> <ul> <li><code>CoarseGrainedExecutorBackend</code> is requested to decommissionSelf</li> </ul>","text":""},{"location":"storage/BlockManager/#blockmanagerstorageendpoint","title":"BlockManagerStorageEndpoint <pre><code>storageEndpoint: RpcEndpointRef\n</code></pre> <p><code>BlockManager</code> sets up a RpcEndpointRef (within the RpcEnv) under the name <code>BlockManagerEndpoint[ID]</code> with a BlockManagerStorageEndpoint message handler.</p>","text":""},{"location":"storage/BlockManager/#blockmanagerdecommissioner","title":"BlockManagerDecommissioner <pre><code>decommissioner: Option[BlockManagerDecommissioner]\n</code></pre> <p><code>BlockManager</code> defines <code>decommissioner</code> internal registry for a BlockManagerDecommissioner.</p> <p><code>decommissioner</code> is undefined (<code>None</code>) by default.</p> <p><code>BlockManager</code> creates and starts a <code>BlockManagerDecommissioner</code> when requested to decommissionSelf.</p> <p><code>decommissioner</code> is used for isDecommissioning and lastMigrationInfo.</p> <p><code>BlockManager</code> requests the <code>BlockManagerDecommissioner</code> to stop when stopped.</p>","text":""},{"location":"storage/BlockManager/#removing-block-from-memory-and-disk","title":"Removing Block from Memory and Disk <pre><code>removeBlockInternal(\n  blockId: BlockId,\n  tellMaster: Boolean): Unit\n</code></pre> <p>For <code>tellMaster</code> turned on, <code>removeBlockInternal</code> requests the BlockInfoManager to assert that the block is locked for writing and remembers the current block status. Otherwise, <code>removeBlockInternal</code> leaves the block status undetermined.</p> <p><code>removeBlockInternal</code> requests the MemoryStore to remove the block.</p> <p><code>removeBlockInternal</code> requests the DiskStore to remove the block.</p> <p><code>removeBlockInternal</code> requests the BlockInfoManager to remove the block metadata.</p> <p>In the end, <code>removeBlockInternal</code> reports the block status (to the master) with the storage level changed to <code>NONE</code>.</p>  <p><code>removeBlockInternal</code> prints out the following WARN message when the block was not stored in the MemoryStore and the DiskStore:</p> <pre><code>Block [blockId] could not be removed as it was not found on disk or in memory\n</code></pre>  <p><code>removeBlockInternal</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to put a new block and remove a block</li> </ul>","text":""},{"location":"storage/BlockManager/#maybecachediskbytesinmemory","title":"maybeCacheDiskBytesInMemory <pre><code>maybeCacheDiskBytesInMemory(\n  blockInfo: BlockInfo,\n  blockId: BlockId,\n  level: StorageLevel,\n  diskData: BlockData): Option[ChunkedByteBuffer]\n</code></pre> <p><code>maybeCacheDiskBytesInMemory</code>...FIXME</p> <p><code>maybeCacheDiskBytesInMemory</code> is used when:</p> <ul> <li><code>BlockManager</code> is requested to getLocalValues and doGetLocalBytes</li> </ul>","text":""},{"location":"storage/BlockManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.BlockManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.BlockManager=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/BlockManagerDecommissioner/","title":"BlockManagerDecommissioner","text":"<p><code>BlockManagerDecommissioner</code> is a decommissioning process used by BlockManager.</p>"},{"location":"storage/BlockManagerDecommissioner/#creating-instance","title":"Creating Instance","text":"<p><code>BlockManagerDecommissioner</code> takes the following to be created:</p> <ul> <li> SparkConf <li> BlockManager <p><code>BlockManagerDecommissioner</code> is created\u00a0when:</p> <ul> <li><code>BlockManager</code> is requested to decommissionSelf</li> </ul>"},{"location":"storage/BlockManagerId/","title":"BlockManagerId","text":"<p>BlockManagerId is a unique identifier (address) of a BlockManager.</p>"},{"location":"storage/BlockManagerInfo/","title":"BlockManagerInfo","text":"<p>= BlockManagerInfo</p> <p>BlockManagerInfo is...FIXME</p>"},{"location":"storage/BlockManagerMaster/","title":"BlockManagerMaster","text":"<p><code>BlockManagerMaster</code> runs on the driver and executors to exchange block metadata (status and locations) in a Spark application.</p> <p><code>BlockManagerMaster</code> uses BlockManagerMasterEndpoint (registered as BlockManagerMaster RPC endpoint on the driver with the endpoint references on executors) for executors to send block status updates and so let the driver keep track of block status and locations.</p>"},{"location":"storage/BlockManagerMaster/#creating-instance","title":"Creating Instance","text":"<p><code>BlockManagerMaster</code> takes the following to be created:</p> <ul> <li>Driver Endpoint</li> <li>Heartbeat Endpoint</li> <li> SparkConf <li> <code>isDriver</code> flag (whether it is created for the driver or executors) <p><code>BlockManagerMaster</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv (and create a BlockManager)</li> </ul>"},{"location":"storage/BlockManagerMaster/#driver-endpoint","title":"Driver Endpoint <p><code>BlockManagerMaster</code> is given a RpcEndpointRef of the BlockManagerMaster RPC Endpoint (on the driver) when created.</p>","text":""},{"location":"storage/BlockManagerMaster/#heartbeat-endpoint","title":"Heartbeat Endpoint <p><code>BlockManagerMaster</code> is given a RpcEndpointRef of the BlockManagerMasterHeartbeat RPC Endpoint (on the driver) when created.</p> <p>The endpoint is used (mainly) when:</p> <ul> <li><code>DAGScheduler</code> is requested to executorHeartbeatReceived</li> </ul>","text":""},{"location":"storage/BlockManagerMaster/#registering-blockmanager-on-executor-with-driver","title":"Registering BlockManager (on Executor) with Driver <pre><code>registerBlockManager(\n  id: BlockManagerId,\n  localDirs: Array[String],\n  maxOnHeapMemSize: Long,\n  maxOffHeapMemSize: Long,\n  storageEndpoint: RpcEndpointRef): BlockManagerId\n</code></pre> <p><code>registerBlockManager</code> prints out the following INFO message to the logs (with the given BlockManagerId):</p> <pre><code>Registering BlockManager [id]\n</code></pre> <p></p> <p><code>registerBlockManager</code> notifies the driver (using the BlockManagerMaster RPC endpoint) that the BlockManagerId wants to register (and sends a blocking RegisterBlockManager message).</p>  <p>Note</p> <p>The input <code>maxMemSize</code> is the total available on-heap and off-heap memory for storage on the <code>BlockManager</code>.</p>  <p><code>registerBlockManager</code> waits until a confirmation comes (as a possibly-updated BlockManagerId).</p> <p>In the end, <code>registerBlockManager</code> prints out the following INFO message to the logs and returns the BlockManagerId received.</p> <pre><code>Registered BlockManager [updatedId]\n</code></pre> <p><code>registerBlockManager</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to initialize and reregister</li> <li><code>FallbackStorage</code> utility is used to registerBlockManagerIfNeeded</li> </ul>","text":""},{"location":"storage/BlockManagerMaster/#finding-block-locations-for-single-block","title":"Finding Block Locations for Single Block <pre><code>getLocations(\n  blockId: BlockId): Seq[BlockManagerId]\n</code></pre> <p><code>getLocations</code> requests the driver (using the BlockManagerMaster RPC endpoint) for BlockManagerIds of the given BlockId (and sends a blocking GetLocations message).</p> <p><code>getLocations</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to fetchRemoteManagedBuffer</li> <li><code>BlockManagerMaster</code> is requested to contains a BlockId</li> </ul>","text":""},{"location":"storage/BlockManagerMaster/#finding-block-locations-for-multiple-blocks","title":"Finding Block Locations for Multiple Blocks <pre><code>getLocations(\n  blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]]\n</code></pre> <p><code>getLocations</code> requests the driver (using the BlockManagerMaster RPC endpoint) for BlockManagerIds of the given BlockIds (and sends a blocking GetLocationsMultipleBlockIds message).</p> <p><code>getLocations</code>\u00a0is used when:</p> <ul> <li><code>DAGScheduler</code> is requested for BlockManagers (executors) for cached RDD partitions</li> <li><code>BlockManager</code> is requested to getLocationBlockIds</li> <li><code>BlockManager</code> utility is used to blockIdsToLocations</li> </ul>","text":""},{"location":"storage/BlockManagerMaster/#contains","title":"contains <pre><code>contains(\n  blockId: BlockId): Boolean\n</code></pre> <p><code>contains</code> is positive (<code>true</code>) when there is at least one executor with the given BlockId.</p> <p><code>contains</code>\u00a0is used when:</p> <ul> <li><code>LocalRDDCheckpointData</code> is requested to doCheckpoint</li> </ul>","text":""},{"location":"storage/BlockManagerMaster/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.BlockManagerMaster</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.BlockManagerMaster=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/BlockManagerMasterEndpoint/","title":"BlockManagerMasterEndpoint","text":"<p>BlockManagerMasterEndpoint is a rpc:RpcEndpoint.md#ThreadSafeRpcEndpoint[ThreadSafeRpcEndpoint] for storage:BlockManagerMaster.md[BlockManagerMaster].</p> <p>BlockManagerMasterEndpoint is registered under BlockManagerMaster name.</p> <p>BlockManagerMasterEndpoint tracks status of the storage:BlockManager.md[BlockManagers] (on the executors) in a Spark application.</p> <p>== [[creating-instance]] Creating Instance</p> <p>BlockManagerMasterEndpoint takes the following to be created:</p> <ul> <li>[[rpcEnv]] rpc:RpcEnv.md[]</li> <li>[[isLocal]] Flag whether BlockManagerMasterEndpoint works in local or cluster mode</li> <li>[[conf]] SparkConf.md[]</li> <li>[[listenerBus]] scheduler:LiveListenerBus.md[]</li> </ul> <p>BlockManagerMasterEndpoint is created for the core:SparkEnv.md#create[SparkEnv] on the driver (to create a storage:BlockManagerMaster.md[] for a storage:BlockManager.md#master[BlockManager]).</p> <p>When created, BlockManagerMasterEndpoint prints out the following INFO message to the logs:</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blockmanagermasterendpoint-up","title":"BlockManagerMasterEndpoint up","text":"<p>== [[messages]][[receiveAndReply]] Messages</p> <p>As an rpc:RpcEndpoint.md[], BlockManagerMasterEndpoint handles RPC messages.</p> <p>=== [[BlockManagerHeartbeat]] BlockManagerHeartbeat</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala","title":"[source, scala]","text":"<p>BlockManagerHeartbeat(   blockManagerId: BlockManagerId)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[GetLocations]] GetLocations</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_1","title":"[source, scala]","text":"<p>GetLocations(   blockId: BlockId)</p> <p>When received, BlockManagerMasterEndpoint replies with the &lt;&gt; of <code>blockId</code>. <p>Posted when BlockManagerMaster.md#getLocations-block[<code>BlockManagerMaster</code> requests the block locations of a single block].</p> <p>=== [[GetLocationsAndStatus]] GetLocationsAndStatus</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_2","title":"[source, scala]","text":"<p>GetLocationsAndStatus(   blockId: BlockId)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[GetLocationsMultipleBlockIds]] GetLocationsMultipleBlockIds</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_3","title":"[source, scala]","text":"<p>GetLocationsMultipleBlockIds(   blockIds: Array[BlockId])</p> <p>When received, BlockManagerMasterEndpoint replies with the &lt;&gt; for the given storage:BlockId.md[]. <p>Posted when BlockManagerMaster.md#getLocations[<code>BlockManagerMaster</code> requests the block locations for multiple blocks].</p> <p>=== [[GetPeers]] GetPeers</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_4","title":"[source, scala]","text":"<p>GetPeers(   blockManagerId: BlockManagerId)</p> <p>When received, BlockManagerMasterEndpoint replies with the &lt;&gt; of <code>blockManagerId</code>. <p>Peers of a storage:BlockManager.md[BlockManager] are the other BlockManagers in a cluster (except the driver's BlockManager). Peers are used to know the available executors in a Spark application.</p> <p>Posted when BlockManagerMaster.md#getPeers[<code>BlockManagerMaster</code> requests the peers of a <code>BlockManager</code>].</p> <p>=== [[GetExecutorEndpointRef]] GetExecutorEndpointRef</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_5","title":"[source, scala]","text":"<p>GetExecutorEndpointRef(   executorId: String)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[GetMemoryStatus]] GetMemoryStatus</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_6","title":"[source, scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#getmemorystatus","title":"GetMemoryStatus","text":"<p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[GetStorageStatus]] GetStorageStatus</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_7","title":"[source, scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#getstoragestatus","title":"GetStorageStatus","text":"<p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[GetBlockStatus]] GetBlockStatus</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_8","title":"[source, scala]","text":"<p>GetBlockStatus(   blockId: BlockId,   askSlaves: Boolean = true)</p> <p>When received, BlockManagerMasterEndpoint is requested to &lt;&gt;. <p>Posted when...FIXME</p> <p>=== [[GetMatchingBlockIds]] GetMatchingBlockIds</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_9","title":"[source, scala]","text":"<p>GetMatchingBlockIds(   filter: BlockId =&gt; Boolean,   askSlaves: Boolean = true)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[HasCachedBlocks]] HasCachedBlocks</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_10","title":"[source, scala]","text":"<p>HasCachedBlocks(   executorId: String)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[RegisterBlockManager]] RegisterBlockManager</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala","title":"[source,scala]","text":"<p>RegisterBlockManager(   blockManagerId: BlockManagerId,   maxOnHeapMemSize: Long,   maxOffHeapMemSize: Long,   sender: RpcEndpointRef)</p> <p>When received, BlockManagerMasterEndpoint is requested to &lt;&gt; (by the given storage:BlockManagerId.md[]). <p>Posted when BlockManagerMaster is requested to storage:BlockManagerMaster.md#registerBlockManager[register a BlockManager]</p> <p>=== [[RemoveRdd]] RemoveRdd</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_11","title":"[source, scala]","text":"<p>RemoveRdd(   rddId: Int)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[RemoveShuffle]] RemoveShuffle</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_12","title":"[source, scala]","text":"<p>RemoveShuffle(   shuffleId: Int)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[RemoveBroadcast]] RemoveBroadcast</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_13","title":"[source, scala]","text":"<p>RemoveBroadcast(   broadcastId: Long,   removeFromDriver: Boolean = true)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[RemoveBlock]] RemoveBlock</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_14","title":"[source, scala]","text":"<p>RemoveBlock(   blockId: BlockId)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[RemoveExecutor]] RemoveExecutor</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_15","title":"[source, scala]","text":"<p>RemoveExecutor(   execId: String)</p> <p>When received, BlockManagerMasterEndpoint &lt;execId is removed&gt;&gt; and the response <code>true</code> sent back. <p>Posted when BlockManagerMaster.md#removeExecutor[<code>BlockManagerMaster</code> removes an executor].</p> <p>=== [[StopBlockManagerMaster]] StopBlockManagerMaster</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_16","title":"[source, scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#stopblockmanagermaster","title":"StopBlockManagerMaster","text":"<p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[UpdateBlockInfo]] UpdateBlockInfo</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_17","title":"[source, scala]","text":"<p>UpdateBlockInfo(   blockManagerId: BlockManagerId,   blockId: BlockId,   storageLevel: StorageLevel,   memSize: Long,   diskSize: Long)</p> <p>When received, BlockManagerMasterEndpoint...FIXME</p> <p>Posted when BlockManagerMaster is requested to storage:BlockManagerMaster.md#updateBlockInfo[handle a block status update (from BlockManager on an executor)].</p> <p>== [[storageStatus]] storageStatus Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_1","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#storagestatus-arraystoragestatus","title":"storageStatus: Array[StorageStatus]","text":"<p>storageStatus...FIXME</p> <p>storageStatus is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; message. <p>== [[getLocationsMultipleBlockIds]] getLocationsMultipleBlockIds Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_2","title":"[source,scala]","text":"<p>getLocationsMultipleBlockIds(   blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]]</p> <p>getLocationsMultipleBlockIds...FIXME</p> <p>getLocationsMultipleBlockIds is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; message. <p>== [[removeShuffle]] removeShuffle Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_3","title":"[source,scala]","text":"<p>removeShuffle(   shuffleId: Int): Future[Seq[Boolean]]</p> <p>removeShuffle...FIXME</p> <p>removeShuffle is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; message. <p>== [[getPeers]] getPeers Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_18","title":"[source, scala]","text":"<p>getPeers(   blockManagerId: BlockManagerId): Seq[BlockManagerId]</p> <p>getPeers finds all the registered <code>BlockManagers</code> (using &lt;&gt; internal registry) and checks if the input <code>blockManagerId</code> is amongst them. <p>If the input <code>blockManagerId</code> is registered, getPeers returns all the registered <code>BlockManagers</code> but the one on the driver and <code>blockManagerId</code>.</p> <p>Otherwise, getPeers returns no <code>BlockManagers</code>.</p> <p>NOTE: Peers of a storage:BlockManager.md[BlockManager] are the other BlockManagers in a cluster (except the driver's BlockManager). Peers are used to know the available executors in a Spark application.</p> <p>getPeers is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; message. <p>== [[register]] register Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_19","title":"[source, scala]","text":"<p>register(   idWithoutTopologyInfo: BlockManagerId,   maxOnHeapMemSize: Long,   maxOffHeapMemSize: Long,   slaveEndpoint: RpcEndpointRef): BlockManagerId</p> <p>register registers a storage:BlockManager.md[] (based on the given storage:BlockManagerId.md[]) in the &lt;&gt; and &lt;&gt; registries and posts a SparkListenerBlockManagerAdded message (to the &lt;&gt;). <p>NOTE: The input <code>maxMemSize</code> is the storage:BlockManager.md#maxMemory[total available on-heap and off-heap memory for storage on a <code>BlockManager</code>].</p> <p>NOTE: Registering a <code>BlockManager</code> can only happen once for an executor (identified by <code>BlockManagerId.executorId</code> in &lt;&gt; internal registry). <p>If another <code>BlockManager</code> has earlier been registered for the executor, you should see the following ERROR message in the logs:</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#got-two-different-block-manager-registrations-on-same-executor-will-replace-old-one-oldid-with-new-one-id","title":"Got two different block manager registrations on same executor - will replace old one [oldId] with new one [id]","text":"<p>And then &lt;&gt;. <p>register prints out the following INFO message to the logs:</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_2","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#registering-block-manager-hostport-with-bytes-ram-id","title":"Registering block manager [hostPort] with [bytes] RAM, [id]","text":"<p>The <code>BlockManager</code> is recorded in the internal registries:</p> <ul> <li>&lt;&gt; <li>&lt;&gt; <p>In the end, register requests the &lt;&gt; to scheduler:LiveListenerBus.md#post[post] a SparkListener.md#SparkListenerBlockManagerAdded[SparkListenerBlockManagerAdded] message. <p>register is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; message. <p>== [[removeExecutor]] removeExecutor Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_20","title":"[source, scala]","text":"<p>removeExecutor(   execId: String): Unit</p> <p>removeExecutor prints the following INFO message to the logs:</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#trying-to-remove-executor-execid-from-blockmanagermaster","title":"Trying to remove executor [execId] from BlockManagerMaster.","text":"<p>If the <code>execId</code> executor is registered (in the internal &lt;&gt; internal registry), removeExecutor &lt;BlockManager&gt;&gt;. <p>removeExecutor is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; or &lt;&gt; messages. <p>== [[removeBlockManager]] removeBlockManager Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_21","title":"[source, scala]","text":"<p>removeBlockManager(   blockManagerId: BlockManagerId): Unit</p> <p>removeBlockManager looks up <code>blockManagerId</code> and removes the executor it was working on from the internal registries:</p> <ul> <li>&lt;&gt; <li>&lt;&gt; <p>It then goes over all the blocks for the <code>BlockManager</code>, and removes the executor for each block from <code>blockLocations</code> registry.</p> <p>SparkListener.md#SparkListenerBlockManagerRemoved[SparkListenerBlockManagerRemoved(System.currentTimeMillis(), blockManagerId)] is posted to SparkContext.md#listenerBus[listenerBus].</p> <p>You should then see the following INFO message in the logs:</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourceplaintext_4","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#removing-block-manager-blockmanagerid","title":"Removing block manager [blockManagerId]","text":"<p>removeBlockManager is used when BlockManagerMasterEndpoint is requested to &lt;&gt; (to handle &lt;&gt; or &lt;&gt; messages). <p>== [[getLocations]] getLocations Internal Method</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source-scala_22","title":"[source, scala]","text":"<p>getLocations(   blockId: BlockId): Seq[BlockManagerId]</p> <p>getLocations looks up the given storage:BlockId.md[] in the <code>blockLocations</code> internal registry and returns the locations (as a collection of <code>BlockManagerId</code>) or an empty collection.</p> <p>getLocations is used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; and &lt;&gt; messages. <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.BlockManagerMasterEndpoint</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>"},{"location":"storage/BlockManagerMasterEndpoint/#source","title":"[source]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#log4jloggerorgapachesparkstorageblockmanagermasterendpointall","title":"log4j.logger.org.apache.spark.storage.BlockManagerMasterEndpoint=ALL","text":"<p>Refer to spark-logging.md[Logging].</p> <p>== [[internal-properties]] Internal Properties</p> <p>=== [[blockManagerIdByExecutor]] blockManagerIdByExecutor Lookup Table</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_4","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blockmanageridbyexecutor-mapstring-blockmanagerid","title":"blockManagerIdByExecutor: Map[String, BlockManagerId]","text":"<p>Lookup table of storage:BlockManagerId.md[]s by executor ID</p> <p>A new executor is added when BlockManagerMasterEndpoint is requested to handle a &lt;&gt; message (and &lt;&gt;). <p>An executor is removed when BlockManagerMasterEndpoint is requested to handle a &lt;&gt; and a &lt;&gt; messages (via &lt;&gt;) <p>Used when BlockManagerMasterEndpoint is requested to handle &lt;&gt; message, &lt;&gt;, &lt;&gt; and &lt;&gt;. <p>=== [[blockManagerInfo]] blockManagerInfo Lookup Table</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_5","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blockmanageridbyexecutor-mapstring-blockmanagerid_1","title":"blockManagerIdByExecutor: Map[String, BlockManagerId]","text":"<p>Lookup table of storage:BlockManagerInfo.md[] by storage:BlockManagerId.md[]</p> <p>A new BlockManagerInfo is added when BlockManagerMasterEndpoint is requested to handle a &lt;&gt; message (and &lt;&gt;). <p>A BlockManagerInfo is removed when BlockManagerMasterEndpoint is requested to &lt;&gt; (to handle &lt;&gt; and &lt;&gt; messages). <p>=== [[blockLocations]] blockLocations</p>"},{"location":"storage/BlockManagerMasterEndpoint/#sourcescala_6","title":"[source,scala]","text":""},{"location":"storage/BlockManagerMasterEndpoint/#blocklocations-mapblockid-setblockmanagerid","title":"blockLocations: Map[BlockId, Set[BlockManagerId]]","text":"<p>Collection of storage:BlockId.md[] and their locations (as <code>BlockManagerId</code>).</p> <p>Used in <code>removeRdd</code> to remove blocks for a RDD, removeBlockManager to remove blocks after a BlockManager gets removed, <code>removeBlockFromWorkers</code>, <code>updateBlockInfo</code>, and &lt;&gt;."},{"location":"storage/BlockManagerMasterHeartbeatEndpoint/","title":"BlockManagerMasterHeartbeatEndpoint","text":"<p><code>BlockManagerMasterHeartbeatEndpoint</code> is...FIXME</p>"},{"location":"storage/BlockManagerSlaveEndpoint/","title":"BlockManagerSlaveEndpoint","text":"<p>BlockManagerSlaveEndpoint is a ThreadSafeRpcEndpoint for BlockManager.</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#creating-instance","title":"Creating Instance","text":"<p>BlockManagerSlaveEndpoint takes the following to be created:</p> <ul> <li>[[rpcEnv]] rpc:RpcEnv.md[]</li> <li>[[blockManager]] Parent BlockManager.md[]</li> <li>[[mapOutputTracker]] scheduler:MapOutputTracker.md[]</li> </ul> <p>BlockManagerSlaveEndpoint is created for BlockManager.md#slaveEndpoint[BlockManager] (and registered under the name BlockManagerEndpoint[ID]).</p> <p>== [[messages]] Messages</p> <p>=== [[GetBlockStatus]] GetBlockStatus</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala","title":"[source, scala]","text":"<p>GetBlockStatus(   blockId: BlockId,   askSlaves: Boolean = true)</p> <p>When received, BlockManagerSlaveEndpoint requests the &lt;&gt; for the BlockManager.md#getStatus[status of a given block] (by BlockId.md[]) and sends it back to a sender. <p>Posted when...FIXME</p> <p>=== [[GetMatchingBlockIds]] GetMatchingBlockIds</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_1","title":"[source, scala]","text":"<p>GetMatchingBlockIds(   filter: BlockId =&gt; Boolean,   askSlaves: Boolean = true)</p> <p>When received, BlockManagerSlaveEndpoint requests the &lt;&gt; to storage:BlockManager.md#getMatchingBlockIds[find IDs of existing blocks for a given filter] and sends them back to a sender. <p>Posted when...FIXME</p> <p>=== [[RemoveBlock]] RemoveBlock</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_2","title":"[source, scala]","text":"<p>RemoveBlock(   blockId: BlockId)</p> <p>When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#removing-block-blockid","title":"removing block [blockId]","text":"<p>BlockManagerSlaveEndpoint then &lt;blockId block&gt;&gt;. <p>When the computation is successful, you should see the following DEBUG in the logs:</p> <pre><code>Done removing block [blockId], response is [response]\n</code></pre> <p>And <code>true</code> response is sent back. You should see the following DEBUG in the logs:</p> <pre><code>Sent response: true to [senderAddress]\n</code></pre> <p>In case of failure, you should see the following ERROR in the logs and the stack trace.</p> <pre><code>Error in removing block [blockId]\n</code></pre> <p>=== [[RemoveBroadcast]] RemoveBroadcast</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_3","title":"[source, scala]","text":"<p>RemoveBroadcast(   broadcastId: Long,   removeFromDriver: Boolean = true)</p> <p>When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext_1","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#removing-broadcast-broadcastid","title":"removing broadcast [broadcastId]","text":"<p>It then calls &lt;broadcastId broadcast&gt;&gt;. <p>When the computation is successful, you should see the following DEBUG in the logs:</p> <pre><code>Done removing broadcast [broadcastId], response is [response]\n</code></pre> <p>And the result is sent back. You should see the following DEBUG in the logs:</p> <pre><code>Sent response: [response] to [senderAddress]\n</code></pre> <p>In case of failure, you should see the following ERROR in the logs and the stack trace.</p> <pre><code>Error in removing broadcast [broadcastId]\n</code></pre> <p>=== [[RemoveRdd]] RemoveRdd</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_4","title":"[source, scala]","text":"<p>RemoveRdd(   rddId: Int)</p> <p>When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:</p> <pre><code>removing RDD [rddId]\n</code></pre> <p>It then calls &lt;rddId RDD&gt;&gt;. <p>NOTE: Handling <code>RemoveRdd</code> messages happens on a separate thread. See &lt;&gt;. <p>When the computation is successful, you should see the following DEBUG in the logs:</p> <pre><code>Done removing RDD [rddId], response is [response]\n</code></pre> <p>And the number of blocks removed is sent back. You should see the following DEBUG in the logs:</p> <pre><code>Sent response: [#blocks] to [senderAddress]\n</code></pre> <p>In case of failure, you should see the following ERROR in the logs and the stack trace.</p> <pre><code>Error in removing RDD [rddId]\n</code></pre> <p>=== [[RemoveShuffle]] RemoveShuffle</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_5","title":"[source, scala]","text":"<p>RemoveShuffle(   shuffleId: Int)</p> <p>When received, BlockManagerSlaveEndpoint prints out the following DEBUG message to the logs:</p> <pre><code>removing shuffle [shuffleId]\n</code></pre> <p>If scheduler:MapOutputTracker.md[MapOutputTracker] was given (when the RPC endpoint was created), it calls scheduler:MapOutputTracker.md#unregisterShuffle[MapOutputTracker to unregister the <code>shuffleId</code> shuffle].</p> <p>It then calls shuffle:ShuffleManager.md#unregisterShuffle[ShuffleManager to unregister the <code>shuffleId</code> shuffle].</p> <p>NOTE: Handling <code>RemoveShuffle</code> messages happens on a separate thread. See &lt;&gt;. <p>When the computation is successful, you should see the following DEBUG in the logs:</p> <pre><code>Done removing shuffle [shuffleId], response is [response]\n</code></pre> <p>And the result is sent back. You should see the following DEBUG in the logs:</p> <pre><code>Sent response: [response] to [senderAddress]\n</code></pre> <p>In case of failure, you should see the following ERROR in the logs and the stack trace.</p> <pre><code>Error in removing shuffle [shuffleId]\n</code></pre> <p>Posted when BlockManagerMaster.md#removeShuffle[BlockManagerMaster] and storage:BlockManagerMasterEndpoint.md#removeShuffle[BlockManagerMasterEndpoint] are requested to remove all blocks of a shuffle.</p> <p>=== [[ReplicateBlock]] ReplicateBlock</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source-scala_6","title":"[source, scala]","text":"<p>ReplicateBlock(   blockId: BlockId,   replicas: Seq[BlockManagerId],   maxReplicas: Int)</p> <p>When received, BlockManagerSlaveEndpoint...FIXME</p> <p>Posted when...FIXME</p> <p>=== [[TriggerThreadDump]] TriggerThreadDump</p> <p>When received, BlockManagerSlaveEndpoint is requested for the thread info for all live threads with stack trace and synchronization information.</p> <p>== [[asyncThreadPool]][[asyncExecutionContext]] block-manager-slave-async-thread-pool Thread Pool</p> <p>BlockManagerSlaveEndpoint creates a thread pool of maximum 100 daemon threads with block-manager-slave-async-thread-pool thread prefix (using {java-javadoc-url}/java/util/concurrent/ThreadPoolExecutor.html[java.util.concurrent.ThreadPoolExecutor]).</p> <p>BlockManagerSlaveEndpoint uses the thread pool (as a Scala implicit value) when requested to &lt;&gt; to communicate in a non-blocking, asynchronous way. <p>The thread pool is shut down when BlockManagerSlaveEndpoint is requested to &lt;&gt;. <p>The reason for the async thread pool is that the block-related operations might take quite some time and to release the main RPC thread other threads are spawned to talk to the external services and pass responses on to the clients.</p> <p>== [[doAsync]] doAsync Internal Method</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#sourcescala","title":"[source,scala]","text":"<p>doAsyncT(   body: =&gt; T)</p> <p>doAsync creates a Scala Future to execute the following asynchronously (i.e. on a separate thread from the &lt;&gt;): <p>. Prints out the given <code>actionMessage</code> as a DEBUG message to the logs</p> <p>. Executes the given <code>body</code></p> <p>When completed successfully, doAsync prints out the following DEBUG messages to the logs and requests the given RpcCallContext to reply the response to the sender.</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext_2","title":"[source,plaintext]","text":"<p>Done [actionMessage], response is [response] Sent response: [response] to [senderAddress]</p> <p>In case of a failure, doAsync prints out the following ERROR message to the logs and requests the given RpcCallContext to send the failure to the sender.</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#sourceplaintext_3","title":"[source,plaintext]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#error-in-actionmessage","title":"Error in [actionMessage]","text":"<p>doAsync is used when BlockManagerSlaveEndpoint is requested to handle &lt;&gt;, &lt;&gt;, &lt;&gt; and &lt;&gt; messages. <p>== [[logging]] Logging</p> <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.BlockManagerSlaveEndpoint</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p>"},{"location":"storage/BlockManagerSlaveEndpoint/#source","title":"[source]","text":""},{"location":"storage/BlockManagerSlaveEndpoint/#log4jloggerorgapachesparkstorageblockmanagerslaveendpointall","title":"log4j.logger.org.apache.spark.storage.BlockManagerSlaveEndpoint=ALL","text":"<p>Refer to spark-logging.md[Logging].</p>"},{"location":"storage/BlockManagerSource/","title":"BlockManagerSource -- Metrics Source for BlockManager","text":"<p><code>BlockManagerSource</code> is the spark-metrics-Source.md[metrics source] of a storage:BlockManager.md[BlockManager].</p> <p>[[sourceName]] <code>BlockManagerSource</code> is registered under the name BlockManager (when SparkContext is created).</p> <p>[[metrics]] .BlockManagerSource's Gauge Metrics (in alphabetical order) [width=\"100%\",cols=\"1,1,2\",options=\"header\"] |=== | Name | Type | Description</p> <p>| <code>disk.diskSpaceUsed_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their disk space used (<code>diskUsed</code>).</p> <p>| <code>memory.maxMem_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their maximum memory limit (<code>maxMem</code>).</p> <p>| <code>memory.maxOffHeapMem_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their off-heap memory remaining (<code>maxOffHeapMem</code>).</p> <p>| <code>memory.maxOnHeapMem_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their on-heap memory remaining (<code>maxOnHeapMem</code>).</p> <p>| <code>memory.memUsed_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their memory used (<code>memUsed</code>).</p> <p>| <code>memory.offHeapMemUsed_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their off-heap memory used (<code>offHeapMemUsed</code>).</p> <p>| <code>memory.onHeapMemUsed_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their on-heap memory used (<code>onHeapMemUsed</code>).</p> <p>| <code>memory.remainingMem_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their memory remaining (<code>memRemaining</code>).</p> <p>| <code>memory.remainingOffHeapMem_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their off-heap memory remaining (<code>offHeapMemRemaining</code>).</p> <p>| <code>memory.remainingOnHeapMem_MB</code> | long | Requests <code>BlockManagerMaster</code> for BlockManagerMaster.md#getStorageStatus[storage status] (for every storage:BlockManager.md[BlockManager]) and sums up their on-heap memory remaining (<code>onHeapMemRemaining</code>). |===</p> <p>You can access the <code>BlockManagerSource</code> &lt;&gt; using the web UI's port (as spark-webui-properties.md#spark.ui.port[spark.ui.port] configuration property). <pre><code>$ http --follow http://localhost:4040/metrics/json \\\n    | jq '.gauges | keys | .[] | select(test(\".driver.BlockManager\"; \"g\"))'\n\"local-1528725411625.driver.BlockManager.disk.diskSpaceUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.maxMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.maxOffHeapMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.maxOnHeapMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.memUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.offHeapMemUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.onHeapMemUsed_MB\"\n\"local-1528725411625.driver.BlockManager.memory.remainingMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.remainingOffHeapMem_MB\"\n\"local-1528725411625.driver.BlockManager.memory.remainingOnHeapMem_MB\"\n</code></pre> <p>[[creating-instance]] [[blockManager]] <code>BlockManagerSource</code> takes a storage:BlockManager.md[BlockManager] when created.</p> <p><code>BlockManagerSource</code> is created when SparkContext is created.</p>"},{"location":"storage/BlockManagerStorageEndpoint/","title":"BlockManagerStorageEndpoint","text":"<p><code>BlockManagerStorageEndpoint</code> is an IsolatedRpcEndpoint.</p>"},{"location":"storage/BlockManagerStorageEndpoint/#creating-instance","title":"Creating Instance","text":"<p><code>BlockManagerStorageEndpoint</code> takes the following to be created:</p> <ul> <li> RpcEnv <li> BlockManager <li> MapOutputTracker <p><code>BlockManagerStorageEndpoint</code> is created\u00a0when:</p> <ul> <li><code>BlockManager</code> is created</li> </ul>"},{"location":"storage/BlockManagerStorageEndpoint/#messages","title":"Messages","text":""},{"location":"storage/BlockManagerStorageEndpoint/#decommissionblockmanager","title":"DecommissionBlockManager <p>When received, <code>receiveAndReply</code> requests the BlockManager to decommissionSelf.</p> <p><code>DecommissionBlockManager</code> is sent out when <code>BlockManager</code> is requested to decommissionBlockManager.</p>","text":""},{"location":"storage/BlockReplicationPolicy/","title":"BlockReplicationPolicy","text":"<p><code>BlockReplicationPolicy</code> is...FIXME</p>"},{"location":"storage/BlockStoreClient/","title":"BlockStoreClient","text":"<p><code>BlockStoreClient</code> is an abstraction of block clients that can fetch blocks from a remote node (an executor or an external service).</p> <p><code>BlockStoreClient</code> is a Java Closeable.</p> <p>Note</p> <p><code>BlockStoreClient</code> was known previously as <code>ShuffleClient</code> (SPARK-28593).</p>"},{"location":"storage/BlockStoreClient/#contract","title":"Contract","text":""},{"location":"storage/BlockStoreClient/#fetching-blocks","title":"Fetching Blocks <pre><code>void fetchBlocks(\n  String host,\n  int port,\n  String execId,\n  String[] blockIds,\n  BlockFetchingListener listener,\n  DownloadFileManager downloadFileManager)\n</code></pre> <p>Fetches blocks from a remote node (using DownloadFileManager)</p> <p>Used when:</p> <ul> <li><code>BlockTransferService</code> is requested to fetchBlockSync</li> <li><code>ShuffleBlockFetcherIterator</code> is requested to sendRequest</li> </ul>","text":""},{"location":"storage/BlockStoreClient/#shuffle-metrics","title":"Shuffle Metrics <pre><code>MetricSet shuffleMetrics()\n</code></pre> <p>Shuffle <code>MetricsSet</code></p> <p>Default: (empty)</p> <p>Used when:</p> <ul> <li><code>BlockManager</code> is requested for the Shuffle Metrics Source</li> </ul>","text":""},{"location":"storage/BlockStoreClient/#implementations","title":"Implementations","text":"<ul> <li>BlockTransferService</li> <li>ExternalBlockStoreClient</li> </ul>"},{"location":"storage/BlockStoreUpdater/","title":"BlockStoreUpdater","text":"<p><code>BlockStoreUpdater</code> is an abstraction of block store updaters that store blocks (from bytes, whether they start in memory or on disk).</p> <p><code>BlockStoreUpdater</code> is an internal class of BlockManager.</p>"},{"location":"storage/BlockStoreUpdater/#contract","title":"Contract","text":""},{"location":"storage/BlockStoreUpdater/#block-data","title":"Block Data <pre><code>blockData(): BlockData\n</code></pre> <p>BlockData</p> <p>Used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save</li> <li><code>TempFileBasedBlockStoreUpdater</code> is requested to readToByteBuffer</li> </ul>","text":""},{"location":"storage/BlockStoreUpdater/#readtobytebuffer","title":"readToByteBuffer <pre><code>readToByteBuffer(): ChunkedByteBuffer\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save</li> </ul>","text":""},{"location":"storage/BlockStoreUpdater/#storing-block-to-disk","title":"Storing Block to Disk <pre><code>saveToDiskStore(): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save</li> </ul>","text":""},{"location":"storage/BlockStoreUpdater/#implementations","title":"Implementations","text":"<ul> <li>ByteBufferBlockStoreUpdater</li> <li>TempFileBasedBlockStoreUpdater</li> </ul>"},{"location":"storage/BlockStoreUpdater/#creating-instance","title":"Creating Instance","text":"<p><code>BlockStoreUpdater</code> takes the following to be created:</p> <ul> <li> Block Size <li> BlockId <li> StorageLevel <li> Scala's <code>ClassTag</code> <li> <code>tellMaster</code> flag <li> <code>keepReadLock</code> flag Abstract Class <p><code>BlockStoreUpdater</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete BlockStoreUpdaters.</p>"},{"location":"storage/BlockStoreUpdater/#saving-block-to-block-store","title":"Saving Block to Block Store <pre><code>save(): Boolean\n</code></pre> <p><code>save</code> doPut with the putBody function.</p> <p><code>save</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to putBlockDataAsStream and store block bytes locally</li> </ul>","text":""},{"location":"storage/BlockStoreUpdater/#putbody-function","title":"putBody Function <p>With the StorageLevel with replication (above <code>1</code>), the <code>putBody</code> function triggers replication concurrently (using a <code>Future</code> (Scala) on a separate thread from the ExecutionContextExecutorService).</p> <p>In general, <code>putBody</code> stores the block in the MemoryStore first (if requested based on useMemory of the StorageLevel). <code>putBody</code> saves to a DiskStore (if useMemory is not specified or storing to the <code>MemoryStore</code> failed).</p>  <p>Note</p> <p><code>putBody</code> stores the block in the <code>MemoryStore</code> only even if the useMemory and useDisk flags could both be turned on (<code>true</code>).</p> <p>Spark drops the block to disk later if the memory store can't hold it.</p>  <p>With the useMemory of the StorageLevel set, <code>putBody</code> saveDeserializedValuesToMemoryStore for deserialized storage level or saveSerializedValuesToMemoryStore otherwise.</p> <p><code>putBody</code> saves to a DiskStore when either of the following happens:</p> <ol> <li>Storing in memory fails and the useDisk (of the StorageLevel) is set</li> <li>useMemory of the StorageLevel is not set yet the useDisk is</li> </ol> <p><code>putBody</code> getCurrentBlockStatus and checks if it is in either the memory or disk store.</p> <p>In the end, <code>putBody</code> reportBlockStatus (if the given tellMaster flag and the tellMaster flag of the <code>BlockInfo</code> are both enabled) and addUpdatedBlockStatusToTaskMetrics.</p> <p><code>putBody</code> prints out the following DEBUG message to the logs:</p> <pre><code>Put block [blockId] locally took [timeUsed] ms\n</code></pre>  <p><code>putBody</code> prints out the following WARN message to the logs when an attempt to store a block in memory fails and the useDisk is set:</p> <pre><code>Persisting block [blockId] to disk instead.\n</code></pre>","text":""},{"location":"storage/BlockStoreUpdater/#saving-deserialized-values-to-memorystore","title":"Saving Deserialized Values to MemoryStore <pre><code>saveDeserializedValuesToMemoryStore(\n  inputStream: InputStream): Boolean\n</code></pre> <p><code>saveDeserializedValuesToMemoryStore</code>...FIXME</p> <p><code>saveDeserializedValuesToMemoryStore</code>\u00a0is used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save a block (with memory deserialized storage level)</li> </ul>","text":""},{"location":"storage/BlockStoreUpdater/#saving-serialized-values-to-memorystore","title":"Saving Serialized Values to MemoryStore <pre><code>saveSerializedValuesToMemoryStore(\n  bytes: ChunkedByteBuffer): Boolean\n</code></pre> <p><code>saveSerializedValuesToMemoryStore</code>...FIXME</p> <p><code>saveSerializedValuesToMemoryStore</code>\u00a0is used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save a block (with memory serialized storage level)</li> </ul>","text":""},{"location":"storage/BlockStoreUpdater/#logging","title":"Logging <p><code>BlockStoreUpdater</code> is an abstract class and logging is configured using the logger of the implementations.</p>","text":""},{"location":"storage/BlockTransferService/","title":"BlockTransferService","text":"<p><code>BlockTransferService</code> is an extension of the BlockStoreClient abstraction for shuffle clients that can fetch and upload blocks of data (synchronously or asynchronously).</p> <p><code>BlockTransferService</code> is a network service available by a host name and a port.</p> <p><code>BlockTransferService</code> was introduced in SPARK-3019 Pluggable block transfer interface (BlockTransferService).</p>"},{"location":"storage/BlockTransferService/#contract","title":"Contract","text":""},{"location":"storage/BlockTransferService/#host-name","title":"Host Name <pre><code>hostName: String\n</code></pre> <p>Host name this service is listening on</p> <p>Used when:</p> <ul> <li><code>BlockManager</code> is requested to initialize</li> </ul>","text":""},{"location":"storage/BlockTransferService/#initializing","title":"Initializing <pre><code>init(\n  blockDataManager: BlockDataManager): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockManager</code> is requested to initialize</li> </ul>","text":""},{"location":"storage/BlockTransferService/#port","title":"Port <pre><code>port: Int\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockManager</code> is requested to initialize</li> </ul>","text":""},{"location":"storage/BlockTransferService/#uploading-block-asynchronously","title":"Uploading Block Asynchronously <pre><code>uploadBlock(\n  hostname: String,\n  port: Int,\n  execId: String,\n  blockId: BlockId,\n  blockData: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Future[Unit]\n</code></pre> <p>Used when:</p> <ul> <li><code>BlockTransferService</code> is requested to uploadBlockSync</li> </ul>","text":""},{"location":"storage/BlockTransferService/#implementations","title":"Implementations","text":"<ul> <li>NettyBlockTransferService</li> </ul>"},{"location":"storage/BlockTransferService/#uploading-block-synchronously","title":"Uploading Block Synchronously <pre><code>uploadBlockSync(\n  hostname: String,\n  port: Int,\n  execId: String,\n  blockId: BlockId,\n  blockData: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Unit\n</code></pre> <p><code>uploadBlockSync</code> uploadBlock and waits till it finishes.</p> <p><code>uploadBlockSync</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to replicate</li> <li><code>ShuffleMigrationRunnable</code> is requested to run</li> </ul>","text":""},{"location":"storage/ByteBufferBlockStoreUpdater/","title":"ByteBufferBlockStoreUpdater","text":"<p><code>ByteBufferBlockStoreUpdater</code> is a BlockStoreUpdater (that BlockManager uses for storing a block from bytes already in memory).</p>"},{"location":"storage/ByteBufferBlockStoreUpdater/#creating-instance","title":"Creating Instance","text":"<p><code>ByteBufferBlockStoreUpdater</code> takes the following to be created:</p> <ul> <li> BlockId <li> StorageLevel <li> <code>ClassTag</code> (Scala) <li> <code>ChunkedByteBuffer</code> <li> <code>tellMaster</code> flag (default: <code>true</code>) <li> <code>keepReadLock</code> flag (default: <code>false</code>) <p><code>ByteBufferBlockStoreUpdater</code> is created\u00a0when:</p> <ul> <li><code>BlockManager</code> is requested to store a block (bytes) locally</li> </ul>"},{"location":"storage/ByteBufferBlockStoreUpdater/#block-data","title":"Block Data <pre><code>blockData(): BlockData\n</code></pre> <p><code>blockData</code> creates a <code>ByteBufferBlockData</code> (with the ChunkedByteBuffer).</p> <p><code>blockData</code>\u00a0is part of the BlockStoreUpdater abstraction.</p>","text":""},{"location":"storage/ByteBufferBlockStoreUpdater/#readtobytebuffer","title":"readToByteBuffer <pre><code>readToByteBuffer(): ChunkedByteBuffer\n</code></pre> <p><code>readToByteBuffer</code> simply gives the ChunkedByteBuffer (it was created with).</p> <p><code>readToByteBuffer</code>\u00a0is part of the BlockStoreUpdater abstraction.</p>","text":""},{"location":"storage/ByteBufferBlockStoreUpdater/#storing-block-to-disk","title":"Storing Block to Disk <pre><code>saveToDiskStore(): Unit\n</code></pre> <p><code>saveToDiskStore</code> requests the DiskStore (of the parent BlockManager) to putBytes (with the BlockId and the ChunkedByteBuffer).</p> <p><code>saveToDiskStore</code>\u00a0is part of the BlockStoreUpdater abstraction.</p>","text":""},{"location":"storage/DiskBlockManager/","title":"DiskBlockManager","text":"<p><code>DiskBlockManager</code> manages a logical mapping of logical blocks and their physical on-disk locations for a BlockManager.</p> <p></p> <p>By default, one block is mapped to one file with a name given by BlockId. It is however possible to have a block to be mapped to a segment of a file only.</p> <p>Block files are hashed among the local directories.</p> <p><code>DiskBlockManager</code> is used to create a DiskStore.</p>"},{"location":"storage/DiskBlockManager/#creating-instance","title":"Creating Instance","text":"<p><code>DiskBlockManager</code> takes the following to be created:</p> <ul> <li> SparkConf <li> <code>deleteFilesOnStop</code> flag <p>When created, <code>DiskBlockManager</code> creates the local directories for block storage and initializes the internal subDirs collection of locks for every local directory.</p> <p><code>DiskBlockManager</code> createLocalDirsForMergedShuffleBlocks.</p> <p>In the end, <code>DiskBlockManager</code> registers a shutdown hook to clean up the local directories for blocks.</p> <p><code>DiskBlockManager</code> is created for BlockManager.</p>"},{"location":"storage/DiskBlockManager/#createlocaldirsformergedshuffleblocks","title":"createLocalDirsForMergedShuffleBlocks <pre><code>createLocalDirsForMergedShuffleBlocks(): Unit\n</code></pre> <p><code>createLocalDirsForMergedShuffleBlocks</code> is a noop with isPushBasedShuffleEnabled disabled (YARN mode only).</p> <p><code>createLocalDirsForMergedShuffleBlocks</code>...FIXME</p>","text":""},{"location":"storage/DiskBlockManager/#accessing-diskblockmanager","title":"Accessing DiskBlockManager","text":"<p><code>DiskBlockManager</code> is available using SparkEnv.</p> <pre><code>org.apache.spark.SparkEnv.get.blockManager.diskBlockManager\n</code></pre>"},{"location":"storage/DiskBlockManager/#local-directories-for-block-storage","title":"Local Directories for Block Storage <p><code>DiskBlockManager</code> creates blockmgr directory in every local root directory when created.</p> <p><code>DiskBlockManager</code> uses <code>localDirs</code> internal registry of all the <code>blockmgr</code> directories.</p> <p><code>DiskBlockManager</code> expects at least one local directory or prints out the following ERROR message to the logs and exits the JVM (with exit code 53):</p> <pre><code>Failed to create any local dir.\n</code></pre> <p><code>localDirs</code> is used when:</p> <ul> <li><code>DiskBlockManager</code> is created (and creates localDirsString and subDirs), requested to look up a file (among local subdirectories) and doStop</li> <li><code>BlockManager</code> is requested to register with an external shuffle server</li> <li><code>BasePythonRunner</code> (PySpark) is requested to <code>compute</code></li> </ul>","text":""},{"location":"storage/DiskBlockManager/#localdirsstring","title":"localDirsString <p><code>DiskBlockManager</code> uses <code>localDirsString</code> internal registry of the paths of the local blockmgr directories.</p> <p><code>localDirsString</code> is used by <code>BlockManager</code> when requested for getLocalDiskDirs.</p>","text":""},{"location":"storage/DiskBlockManager/#creating-blockmgr-directory-in-every-local-root-directory","title":"Creating blockmgr Directory in Every Local Root Directory <pre><code>createLocalDirs(\n  conf: SparkConf): Array[File]\n</code></pre> <p><code>createLocalDirs</code> creates <code>blockmgr</code> local directories for storing block data.</p>  <p><code>createLocalDirs</code> creates a <code>blockmgr-[randomUUID]</code> directory under every root directory for local storage and prints out the following INFO message to the logs:</p> <pre><code>Created local directory at [localDir]\n</code></pre>  <p>In case of an exception, <code>createLocalDirs</code> prints out the following ERROR message to the logs and ignore the directory:</p> <pre><code>Failed to create local dir in [rootDir]. Ignoring this directory.\n</code></pre>","text":""},{"location":"storage/DiskBlockManager/#file-locks-for-local-block-store-directories","title":"File Locks for Local Block Store Directories <pre><code>subDirs: Array[Array[File]]\n</code></pre> <p><code>subDirs</code> is a lookup table for file locks of every local block directory (with the first dimension for local directories and the second for locks).</p> <p>The number of block subdirectories is controlled by spark.diskStore.subDirectories configuration property.</p> <p><code>subDirs(dirId)(subDirId)</code> is used to access <code>subDirId</code> subdirectory in <code>dirId</code> local directory.</p> <p><code>subDirs</code> is used when:</p> <ul> <li><code>DiskBlockManager</code> is requested for a block file and all the block files</li> </ul>","text":""},{"location":"storage/DiskBlockManager/#finding-block-file-and-creating-parent-directories","title":"Finding Block File (and Creating Parent Directories) <pre><code>getFile(\n  blockId: BlockId): File\ngetFile(\n  filename: String): File\n</code></pre> <p><code>getFile</code> computes a hash of the file name of the input BlockId that is used for the name of the parent directory and subdirectory.</p> <p><code>getFile</code> creates the subdirectory unless it already exists.</p> <p><code>getFile</code> is used when:</p> <ul> <li> <p><code>DiskBlockManager</code> is requested to containsBlock, createTempLocalBlock, createTempShuffleBlock</p> </li> <li> <p><code>DiskStore</code> is requested to getBytes, remove, contains, and put</p> </li> <li> <p><code>IndexShuffleBlockResolver</code> is requested to getDataFile and getIndexFile</p> </li> </ul>","text":""},{"location":"storage/DiskBlockManager/#createtempshuffleblock","title":"createTempShuffleBlock <pre><code>createTempShuffleBlock(): (TempShuffleBlockId, File)\n</code></pre> <p><code>createTempShuffleBlock</code> creates a temporary <code>TempShuffleBlockId</code> block.</p> <p><code>createTempShuffleBlock</code>...FIXME</p>","text":""},{"location":"storage/DiskBlockManager/#registering-shutdown-hook","title":"Registering Shutdown Hook <pre><code>addShutdownHook(): AnyRef\n</code></pre> <p><code>addShutdownHook</code> registers a shutdown hook to execute doStop at shutdown.</p> <p>When executed, you should see the following DEBUG message in the logs:</p> <pre><code>Adding shutdown hook\n</code></pre> <p><code>addShutdownHook</code> adds the shutdown hook so it prints the following INFO message and executes doStop:</p> <pre><code>Shutdown hook called\n</code></pre>","text":""},{"location":"storage/DiskBlockManager/#getting-writable-directories-in-yarn","title":"Getting Writable Directories in YARN <pre><code>getYarnLocalDirs(\n  conf: SparkConf): String\n</code></pre> <p><code>getYarnLocalDirs</code> uses <code>conf</code> SparkConf to read <code>LOCAL_DIRS</code> environment variable with comma-separated local directories (that have already been created and secured so that only the user has access to them).</p> <p><code>getYarnLocalDirs</code> throws an <code>Exception</code> when <code>LOCAL_DIRS</code> environment variable was not set:</p> <pre><code>Yarn Local dirs can't be empty\n</code></pre>","text":""},{"location":"storage/DiskBlockManager/#checking-whether-spark-runs-on-yarn","title":"Checking Whether Spark Runs on YARN <pre><code>isRunningInYarnContainer(\n  conf: SparkConf): Boolean\n</code></pre> <p><code>isRunningInYarnContainer</code> uses <code>conf</code> SparkConf to read Hadoop YARN's CONTAINER_ID environment variable to find out if Spark runs in a YARN container (that is exported by YARN NodeManager).</p>","text":""},{"location":"storage/DiskBlockManager/#getting-all-blocks-from-files-stored-on-disk","title":"Getting All Blocks (From Files Stored On Disk) <pre><code>getAllBlocks(): Seq[BlockId]\n</code></pre> <p><code>getAllBlocks</code> gets all the blocks stored on disk.</p> <p>Internally, <code>getAllBlocks</code> takes the block files and returns their names (as <code>BlockId</code>).</p> <p><code>getAllBlocks</code> is used when <code>BlockManager</code> is requested to find IDs of existing blocks for a given filter.</p>","text":""},{"location":"storage/DiskBlockManager/#all-block-files","title":"All Block Files <pre><code>getAllFiles(): Seq[File]\n</code></pre> <p><code>getAllFiles</code> uses the subDirs registry to list all the files (in all the directories) that are currently stored on disk by this disk manager.</p>","text":""},{"location":"storage/DiskBlockManager/#stopping","title":"Stopping <pre><code>stop(): Unit\n</code></pre> <p><code>stop</code>...FIXME</p> <p><code>stop</code> is used when <code>BlockManager</code> is requested to stop.</p>","text":""},{"location":"storage/DiskBlockManager/#stopping-diskblockmanager-removing-local-directories-for-blocks","title":"Stopping DiskBlockManager (Removing Local Directories for Blocks) <pre><code>doStop(): Unit\n</code></pre> <p><code>doStop</code> deletes the local directories recursively (only when the constructor's <code>deleteFilesOnStop</code> is enabled and the parent directories are not registered to be removed at shutdown).</p> <p><code>doStop</code> is used when:</p> <ul> <li><code>DiskBlockManager</code> is requested to shut down or stop</li> </ul>","text":""},{"location":"storage/DiskBlockManager/#demo","title":"Demo <p>Demo: DiskBlockManager and Block Data</p>","text":""},{"location":"storage/DiskBlockManager/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.DiskBlockManager</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.DiskBlockManager=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/DiskBlockObjectWriter/","title":"DiskBlockObjectWriter","text":"<p><code>DiskBlockObjectWriter</code> is a disk writer of BlockManager.</p> <p><code>DiskBlockObjectWriter</code> is an <code>OutputStream</code> (Java) that BlockManager offers for writing data blocks to disk.</p> <p><code>DiskBlockObjectWriter</code> is used when:</p> <ul> <li> <p><code>BypassMergeSortShuffleWriter</code> is requested for partition writers</p> </li> <li> <p><code>UnsafeSorterSpillWriter</code> is requested for a partition writer</p> </li> <li> <p><code>ShuffleExternalSorter</code> is requested to writeSortedFile</p> </li> <li> <p><code>ExternalSorter</code> is requested to spillMemoryIteratorToDisk</p> </li> </ul>"},{"location":"storage/DiskBlockObjectWriter/#creating-instance","title":"Creating Instance","text":"<p><code>DiskBlockObjectWriter</code> takes the following to be created:</p> <ul> <li> <code>File</code> (Java) <li> SerializerManager <li> SerializerInstance <li>Buffer size</li> <li> <code>syncWrites</code> flag (based on spark.shuffle.sync configuration property) <li> ShuffleWriteMetricsReporter <li> BlockId (default: <code>null</code>) <p><code>DiskBlockObjectWriter</code> is created when:</p> <ul> <li><code>BlockManager</code> is requested for a disk writer</li> </ul>"},{"location":"storage/DiskBlockObjectWriter/#buffer-size","title":"Buffer Size <p><code>DiskBlockObjectWriter</code> is given a buffer size when created.</p> <p>The buffer size is specified by BlockManager and is based on spark.shuffle.file.buffer configuration property (in most cases) or is hardcoded to <code>32k</code> (in some cases but is in fact the default value).</p> <p>The buffer size is exactly the buffer size of the BufferedOutputStream.</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#serializationstream","title":"SerializationStream <p><code>DiskBlockObjectWriter</code> manages a SerializationStream for writing a key-value record:</p> <ul> <li> <p>Opens it when requested to open</p> </li> <li> <p>Closes it when requested to commitAndGet</p> </li> <li> <p>Dereferences it (<code>null</code>s it) when closeResources</p> </li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#states","title":"States <p><code>DiskBlockObjectWriter</code> can be in one of the following states (that match the state of the underlying output streams):</p> <ul> <li>Initialized</li> <li>Open</li> <li>Closed</li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#writing-out-record","title":"Writing Out Record <pre><code>write(\n  key: Any,\n  value: Any): Unit\n</code></pre> <p><code>write</code> opens the underlying stream unless open already.</p> <p><code>write</code> requests the SerializationStream to write the key and then the value.</p> <p>In the end, <code>write</code> updates the write metrics.</p>  <p><code>write</code> is used when:</p> <ul> <li> <p><code>BypassMergeSortShuffleWriter</code> is requested to write records of a partition</p> </li> <li> <p><code>ExternalAppendOnlyMap</code> is requested to spillMemoryIteratorToDisk</p> </li> <li> <p><code>ExternalSorter</code> is requested to write all records into a partitioned file</p> <ul> <li><code>SpillableIterator</code> is requested to <code>spill</code></li> </ul> </li> <li> <p><code>WritablePartitionedPairCollection</code> is requested for a <code>destructiveSortedWritablePartitionedIterator</code></p> </li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#commitandget","title":"commitAndGet <pre><code>commitAndGet(): FileSegment\n</code></pre> <p>With streamOpen enabled, <code>commitAndGet</code>...FIXME</p> <p>Otherwise, <code>commitAndGet</code> returns a new <code>FileSegment</code> (with the File, committedPosition and <code>0</code> length).</p>  <p><code>commitAndGet</code> is used when:</p> <ul> <li><code>BypassMergeSortShuffleWriter</code> is requested to write</li> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile</li> <li><code>DiskBlockObjectWriter</code> is requested to close</li> <li><code>ExternalAppendOnlyMap</code> is requested to spillMemoryIteratorToDisk</li> <li><code>ExternalSorter</code> is requested to spillMemoryIteratorToDisk, writePartitionedFile</li> <li><code>UnsafeSorterSpillWriter</code> is requested to close</li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#committing-writes-and-closing-resources","title":"Committing Writes and Closing Resources <pre><code>close(): Unit\n</code></pre> <p>Only if initialized, <code>close</code> commitAndGet followed by closeResources. Otherwise, <code>close</code> does nothing.</p>  <p><code>close</code> is used when:</p> <ul> <li>FIXME</li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#revertpartialwritesandclose","title":"revertPartialWritesAndClose <pre><code>revertPartialWritesAndClose(): File\n</code></pre> <p><code>revertPartialWritesAndClose</code>...FIXME</p> <p><code>revertPartialWritesAndClose</code> is used when...FIXME</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#writing-bytes-from-byte-array-starting-from-offset","title":"Writing Bytes (From Byte Array Starting From Offset) <pre><code>write(\n  kvBytes: Array[Byte],\n  offs: Int,\n  len: Int): Unit\n</code></pre> <p><code>write</code>...FIXME</p> <p><code>write</code> is used when...FIXME</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#opening-diskblockobjectwriter","title":"Opening DiskBlockObjectWriter <pre><code>open(): DiskBlockObjectWriter\n</code></pre> <p><code>open</code> opens the <code>DiskBlockObjectWriter</code>, i.e. initializes and re-sets bs and objOut internal output streams.</p> <p>Internally, <code>open</code> makes sure that <code>DiskBlockObjectWriter</code> is not closed (hasBeenClosed flag is disabled). If it was, <code>open</code> throws a <code>IllegalStateException</code>:</p> <pre><code>Writer already closed. Cannot be reopened.\n</code></pre> <p>Unless <code>DiskBlockObjectWriter</code> has already been initialized (initialized flag is enabled), <code>open</code> initializes it (and turns initialized flag on).</p> <p>Regardless of whether <code>DiskBlockObjectWriter</code> was already initialized or not, <code>open</code> requests <code>SerializerManager</code> to wrap <code>mcs</code> output stream for encryption and compression (for blockId) and sets it as bs.</p> <p><code>open</code> requests the SerializerInstance to serialize <code>bs</code> output stream and sets it as objOut.</p>  <p>Note</p> <p><code>open</code> uses the SerializerInstance that was used to create the <code>DiskBlockObjectWriter</code>.</p>  <p>In the end, <code>open</code> turns streamOpen flag on.</p> <p><code>open</code> is used when <code>DiskBlockObjectWriter</code> writes out a record or bytes from a specified byte array and the stream is not open yet.</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#initialization","title":"Initialization <pre><code>initialize(): Unit\n</code></pre> <p><code>initialize</code> creates a FileOutputStream to write to the file (with the<code>append</code> enabled) and takes the FileChannel associated with this file output stream.</p> <p><code>initialize</code> creates a TimeTrackingOutputStream (with the ShuffleWriteMetricsReporter and the FileOutputStream).</p> <p>With checksumEnabled, <code>initialize</code>...FIXME</p> <p>In the end, <code>initialize</code> creates a BufferedOutputStream.</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#checksumenabled-flag","title":"checksumEnabled Flag <p><code>DiskBlockObjectWriter</code> defines <code>checksumEnabled</code> flag to...FIXME</p> <p><code>checksumEnabled</code> is <code>false</code> by default and can be enabled using setChecksum.</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#setchecksum","title":"setChecksum <pre><code>setChecksum(\n  checksum: Checksum): Unit\n</code></pre> <p><code>setChecksum</code>...FIXME</p>  <p><code>setChecksum</code> is used when:</p> <ul> <li><code>BypassMergeSortShuffleWriter</code> is requested to write records (with spark.shuffle.checksum.enabled enabled)</li> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile (with spark.shuffle.checksum.enabled enabled)</li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#recording-bytes-written","title":"Recording Bytes Written <pre><code>recordWritten(): Unit\n</code></pre> <p><code>recordWritten</code> increases the numRecordsWritten counter.</p> <p><code>recordWritten</code> requests the ShuffleWriteMetricsReporter to incRecordsWritten.</p> <p><code>recordWritten</code> updates the bytes written metric every <code>16384</code> bytes written (based on the numRecordsWritten counter).</p>  <p><code>recordWritten</code> is used when:</p> <ul> <li><code>ShuffleExternalSorter</code> is requested to writeSortedFile</li> <li><code>DiskBlockObjectWriter</code> is requested to write</li> <li><code>UnsafeSorterSpillWriter</code> is requested to write</li> </ul>","text":""},{"location":"storage/DiskBlockObjectWriter/#updating-bytes-written-metric","title":"Updating Bytes Written Metric <pre><code>updateBytesWritten(): Unit\n</code></pre> <p><code>updateBytesWritten</code> requests the FileChannel for the file position (i.e., the number of bytes from the beginning of the file to the current position) that is used to incBytesWritten (using the ShuffleWriteMetricsReporter and the reportedPosition counter).</p> <p>In the end, <code>updateBytesWritten</code> updates the reportedPosition counter to the current file position (so it can report incBytesWritten properly).</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#bufferedoutputstream","title":"BufferedOutputStream <pre><code>mcs: ManualCloseOutputStream\n</code></pre> <p><code>DiskBlockObjectWriter</code> creates a custom <code>BufferedOutputStream</code> (Java) when requested to initialize.</p> <p>The <code>BufferedOutputStream</code> is closed (and dereferenced) in closeResources.</p> <p>The <code>BufferedOutputStream</code> is used to create the OutputStream when requested to open.</p>","text":""},{"location":"storage/DiskBlockObjectWriter/#outputstream","title":"OutputStream <pre><code>bs: OutputStream\n</code></pre> <p><code>DiskBlockObjectWriter</code> creates an OutputStream when requested to open. The <code>OutputStream</code> can be encrypted and compressed if enabled.</p> <p>The <code>OutputStream</code> is closed (and dereferenced) in closeResources.</p> <p>The <code>OutputStream</code> is used to create the SerializationStream when requested to open.</p> <p>The <code>OutputStream</code> is requested for the following:</p> <ul> <li>Write bytes out in write</li> <li>Flush in flush (and commitAndGet)</li> </ul>","text":""},{"location":"storage/DiskStore/","title":"DiskStore","text":"<p><code>DiskStore</code> manages data blocks on disk for BlockManager.</p> <p></p>"},{"location":"storage/DiskStore/#creating-instance","title":"Creating Instance","text":"<p><code>DiskStore</code> takes the following to be created:</p> <ul> <li> SparkConf <li> DiskBlockManager <li> <code>SecurityManager</code> <p><code>DiskStore</code> is created\u00a0for BlockManager.</p>"},{"location":"storage/DiskStore/#block-sizes","title":"Block Sizes <pre><code>blockSizes: ConcurrentHashMap[BlockId, Long]\n</code></pre> <p><code>DiskStore</code> uses <code>ConcurrentHashMap</code> (Java) as a registry of blocks and the data size (on disk).</p> <p>A new entry is added when put and moveFileToBlock.</p> <p>An entry is removed when remove.</p>","text":""},{"location":"storage/DiskStore/#putbytes","title":"putBytes <pre><code>putBytes(\n  blockId: BlockId,\n  bytes: ChunkedByteBuffer): Unit\n</code></pre> <p><code>putBytes</code> put the block and writes the buffer out (to the given channel).</p> <p><code>putBytes</code>\u00a0is used when:</p> <ul> <li><code>ByteBufferBlockStoreUpdater</code> is requested to saveToDiskStore</li> <li><code>BlockManager</code> is requested to dropFromMemory</li> </ul>","text":""},{"location":"storage/DiskStore/#getbytes","title":"getBytes <pre><code>getBytes(\n  blockId: BlockId): BlockData\ngetBytes(\n  f: File,\n  blockSize: Long): BlockData\n</code></pre> <p><code>getBytes</code> requests the DiskBlockManager for the block file and the size.</p> <p><code>getBytes</code> requests the SecurityManager for <code>getIOEncryptionKey</code> and returns a <code>EncryptedBlockData</code> if available or a <code>DiskBlockData</code> otherwise.</p> <p><code>getBytes</code>\u00a0is used when:</p> <ul> <li><code>TempFileBasedBlockStoreUpdater</code> is requested to blockData</li> <li><code>BlockManager</code> is requested to getLocalValues, doGetLocalBytes</li> </ul>","text":""},{"location":"storage/DiskStore/#getsize","title":"getSize <pre><code>getSize(\n  blockId: BlockId): Long\n</code></pre> <p><code>getSize</code> looks up the block in the blockSizes registry.</p> <p><code>getSize</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to getStatus, getCurrentBlockStatus, doPutIterator</li> <li><code>DiskStore</code> is requested for the block bytes</li> </ul>","text":""},{"location":"storage/DiskStore/#movefiletoblock","title":"moveFileToBlock <pre><code>moveFileToBlock(\n  sourceFile: File,\n  blockSize: Long,\n  targetBlockId: BlockId): Unit\n</code></pre> <p><code>moveFileToBlock</code>...FIXME</p> <p><code>moveFileToBlock</code>\u00a0is used when:</p> <ul> <li><code>TempFileBasedBlockStoreUpdater</code> is requested to saveToDiskStore</li> </ul>","text":""},{"location":"storage/DiskStore/#checking-if-block-file-exists","title":"Checking if Block File Exists <pre><code>contains(\n  blockId: BlockId): Boolean\n</code></pre> <p><code>contains</code> requests the DiskBlockManager for the block file and checks whether the file actually exists or not.</p> <p><code>contains</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to getStatus, getCurrentBlockStatus, getLocalValues, doGetLocalBytes, dropFromMemory</li> <li><code>DiskStore</code> is requested to put</li> </ul>","text":""},{"location":"storage/DiskStore/#persisting-block-to-disk","title":"Persisting Block to Disk <pre><code>put(\n  blockId: BlockId)(\n  writeFunc: WritableByteChannel =&gt; Unit): Unit\n</code></pre> <p><code>put</code> prints out the following DEBUG message to the logs:</p> <pre><code>Attempting to put block [blockId]\n</code></pre> <p><code>put</code> requests the DiskBlockManager for the block file for the input BlockId.</p> <p><code>put</code> opens the block file for writing (wrapped into a <code>CountingWritableChannel</code> to count the bytes written). <code>put</code> executes the given <code>writeFunc</code> function (with the <code>WritableByteChannel</code> of the block file) and saves the bytes written (to the blockSizes registry).</p> <p>In the end, <code>put</code> prints out the following DEBUG message to the logs:</p> <pre><code>Block [fileName] stored as [size] file on disk in [time] ms\n</code></pre> <p>In case of any exception, <code>put</code> deletes the block file.</p> <p><code>put</code> throws an <code>IllegalStateException</code> when the block is already stored:</p> <pre><code>Block [blockId] is already present in the disk store\n</code></pre> <p><code>put</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to doPutIterator and dropFromMemory</li> <li><code>DiskStore</code> is requested to putBytes</li> </ul>","text":""},{"location":"storage/DiskStore/#removing-block","title":"Removing Block <pre><code>remove(\n  blockId: BlockId): Boolean\n</code></pre> <p><code>remove</code>...FIXME</p> <p><code>remove</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to removeBlockInternal</li> <li><code>DiskStore</code> is requested to put (and an <code>IOException</code> is thrown)</li> </ul>","text":""},{"location":"storage/DiskStore/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.DiskStore</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.DiskStore=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/ExternalBlockStoreClient/","title":"ExternalBlockStoreClient","text":"<p><code>ExternalBlockStoreClient</code> is a BlockStoreClient that the driver and executors use when spark.shuffle.service.enabled configuration property is enabled.</p>"},{"location":"storage/ExternalBlockStoreClient/#creating-instance","title":"Creating Instance","text":"<p><code>ExternalBlockStoreClient</code> takes the following to be created:</p> <ul> <li> TransportConf <li> <code>SecretKeyHolder</code> <li> <code>authEnabled</code> flag <li> <code>registrationTimeoutMs</code> <p><code>ExternalBlockStoreClient</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is requested to create a SparkEnv (for the driver and executors) with spark.shuffle.service.enabled configuration property enabled</li> </ul>"},{"location":"storage/FallbackStorage/","title":"FallbackStorage","text":"<p><code>FallbackStorage</code> is...FIXME</p>"},{"location":"storage/MemoryStore/","title":"MemoryStore","text":"<p><code>MemoryStore</code> manages blocks of data in memory for BlockManager.</p> <p></p>"},{"location":"storage/MemoryStore/#creating-instance","title":"Creating Instance","text":"<p><code>MemoryStore</code> takes the following to be created:</p> <ul> <li> SparkConf <li>BlockInfoManager</li> <li> SerializerManager <li> MemoryManager <li> BlockEvictionHandler <p><code>MemoryStore</code> is created\u00a0when:</p> <ul> <li><code>BlockManager</code> is created</li> </ul> <p></p>"},{"location":"storage/MemoryStore/#blocks","title":"Blocks <pre><code>entries: LinkedHashMap[BlockId, MemoryEntry[_]]\n</code></pre> <p><code>MemoryStore</code> creates a <code>LinkedHashMap</code> (Java) of blocks (as <code>MemoryEntries</code> per BlockId) when created.</p> <p><code>entries</code> uses access-order ordering mode where the order of iteration is the order in which the entries were last accessed (from least-recently accessed to most-recently). That gives LRU cache behaviour when <code>MemoryStore</code> is requested to evict blocks.</p> <p><code>MemoryEntries</code> are added in putBytes and putIterator.</p> <p><code>MemoryEntries</code> are removed in remove, clear, and while evicting blocks to free up memory.</p>","text":""},{"location":"storage/MemoryStore/#deserializedmemoryentry","title":"DeserializedMemoryEntry <p><code>DeserializedMemoryEntry</code> is a <code>MemoryEntry</code> for block values with the following:</p> <ul> <li><code>Array[T]</code> (for the values)</li> <li><code>size</code></li> <li><code>ON_HEAP</code> memory mode</li> </ul>","text":""},{"location":"storage/MemoryStore/#serializedmemoryentry","title":"SerializedMemoryEntry <p><code>SerializedMemoryEntry</code> is a <code>MemoryEntry</code> for block bytes with the following:</p> <ul> <li><code>ChunkedByteBuffer</code> (for the serialized values)</li> <li><code>size</code></li> <li><code>MemoryMode</code></li> </ul>","text":""},{"location":"storage/MemoryStore/#sparkstorageunrollmemorythreshold","title":"spark.storage.unrollMemoryThreshold <p><code>MemoryStore</code> uses spark.storage.unrollMemoryThreshold configuration property when requested for the following:</p> <ul> <li>putIterator</li> <li>putIteratorAsBytes</li> </ul>","text":""},{"location":"storage/MemoryStore/#evicting-blocks","title":"Evicting Blocks <pre><code>evictBlocksToFreeSpace(\n  blockId: Option[BlockId],\n  space: Long,\n  memoryMode: MemoryMode): Long\n</code></pre> <p><code>evictBlocksToFreeSpace</code> finds blocks to evict in the entries registry (based on least-recently accessed order and until the required <code>space</code> to free up is met or there are no more blocks).</p> <p>Once done, <code>evictBlocksToFreeSpace</code> returns the memory freed up.</p> <p>When there is enough blocks to drop to free up memory, <code>evictBlocksToFreeSpace</code> prints out the following INFO message to the logs:</p> <pre><code>[n] blocks selected for dropping ([freedMemory]) bytes)\n</code></pre> <p><code>evictBlocksToFreeSpace</code> drops the blocks one by one.</p> <p><code>evictBlocksToFreeSpace</code> prints out the following INFO message to the logs:</p> <pre><code>After dropping [n] blocks, free memory is [memory]\n</code></pre> <p>When there is not enough blocks to drop to make room for the given block (if any), <code>evictBlocksToFreeSpace</code> prints out the following INFO message to the logs:</p> <pre><code>Will not store [blockId]\n</code></pre> <p><code>evictBlocksToFreeSpace</code>\u00a0is used when:</p> <ul> <li><code>StorageMemoryPool</code> is requested to acquire memory and free up space to shrink pool</li> </ul>","text":""},{"location":"storage/MemoryStore/#dropping-block","title":"Dropping Block <pre><code>dropBlock[T](\n  blockId: BlockId,\n  entry: MemoryEntry[T]): Unit\n</code></pre> <p><code>dropBlock</code> requests the BlockEvictionHandler to drop the block from memory.</p> <p>If the block is no longer available in any other store, <code>dropBlock</code> requests the BlockInfoManager to remove the block (info).</p>","text":""},{"location":"storage/MemoryStore/#blockinfomanager","title":"BlockInfoManager <p><code>MemoryStore</code> is given a BlockInfoManager when created.</p> <p><code>MemoryStore</code> uses the <code>BlockInfoManager</code> when requested to evict blocks.</p>","text":""},{"location":"storage/MemoryStore/#accessing-memorystore","title":"Accessing MemoryStore <p><code>MemoryStore</code> is available to other Spark services using BlockManager.memoryStore.</p> <pre><code>import org.apache.spark.SparkEnv\nSparkEnv.get.blockManager.memoryStore\n</code></pre>","text":""},{"location":"storage/MemoryStore/#serialized-block-bytes","title":"Serialized Block Bytes <pre><code>getBytes(\n  blockId: BlockId): Option[ChunkedByteBuffer]\n</code></pre> <p><code>getBytes</code> returns the bytes of the SerializedMemoryEntry of a block (if found in the entries registry).</p> <p><code>getBytes</code> is used (for blocks with a serialized and in-memory storage level) when:</p> <ul> <li><code>BlockManager</code> is requested for the serialized bytes of a block (from a local block manager), getLocalValues, maybeCacheDiskBytesInMemory</li> </ul>","text":""},{"location":"storage/MemoryStore/#fetching-deserialized-block-values","title":"Fetching Deserialized Block Values <pre><code>getValues(\n  blockId: BlockId): Option[Iterator[_]]\n</code></pre> <p><code>getValues</code> returns the values of the DeserializedMemoryEntry of the given block (if available in the entries registry).</p> <p><code>getValues</code> is used (for blocks with a deserialized and in-memory storage level) when:</p> <ul> <li><code>BlockManager</code> is requested for the serialized bytes of a block (from a local block manager), getLocalValues, maybeCacheDiskBytesInMemory</li> </ul>","text":""},{"location":"storage/MemoryStore/#putiteratorasbytes","title":"putIteratorAsBytes <pre><code>putIteratorAsBytes[T](\n  blockId: BlockId,\n  values: Iterator[T],\n  classTag: ClassTag[T],\n  memoryMode: MemoryMode): Either[PartiallySerializedBlock[T], Long]\n</code></pre> <p><code>putIteratorAsBytes</code> requires that the block is not already stored.</p> <p><code>putIteratorAsBytes</code> putIterator (with the given BlockId, the values, the <code>MemoryMode</code> and a new <code>SerializedValuesHolder</code>).</p> <p>If successful, <code>putIteratorAsBytes</code> returns the estimated size of the block. Otherwise, a <code>PartiallySerializedBlock</code>.</p>  <p><code>putIteratorAsBytes</code> prints out the following WARN message to the logs when the initial memory threshold is too large:</p> <pre><code>Initial memory threshold of [initialMemoryThreshold] is too large to be set as chunk size.\nChunk size has been capped to \"MAX_ROUNDED_ARRAY_LENGTH\"\n</code></pre>  <p><code>putIteratorAsBytes</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to doPutIterator (for a block with StorageLevel with useMemory and serialized)</li> </ul>","text":""},{"location":"storage/MemoryStore/#putiteratorasvalues","title":"putIteratorAsValues <pre><code>putIteratorAsValues[T](\n  blockId: BlockId,\n  values: Iterator[T],\n  memoryMode: MemoryMode,\n  classTag: ClassTag[T]): Either[PartiallyUnrolledIterator[T], Long]\n</code></pre> <p><code>putIteratorAsValues</code> putIterator (with the given BlockId, the values, the <code>MemoryMode</code> and a new <code>DeserializedValuesHolder</code>).</p> <p>If successful, <code>putIteratorAsValues</code> returns the estimated size of the block. Otherwise, a <code>PartiallyUnrolledIterator</code>.</p> <p><code>putIteratorAsValues</code>\u00a0is used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to saveDeserializedValuesToMemoryStore</li> <li><code>BlockManager</code> is requested to doPutIterator and maybeCacheDiskValuesInMemory</li> </ul>","text":""},{"location":"storage/MemoryStore/#putiterator","title":"putIterator <pre><code>putIterator[T](\n  blockId: BlockId,\n  values: Iterator[T],\n  classTag: ClassTag[T],\n  memoryMode: MemoryMode,\n  valuesHolder: ValuesHolder[T]): Either[Long, Long]\n</code></pre> <p><code>putIterator</code> returns the (estimated) size of the block (as <code>Right</code>) or the <code>unrollMemoryUsedByThisBlock</code> (as <code>Left</code>).</p> <p><code>putIterator</code> requires that the block is not already in the MemoryStore.</p> <p><code>putIterator</code> reserveUnrollMemoryForThisTask (with the spark.storage.unrollMemoryThreshold for the initial memory threshold).</p> <p>If <code>putIterator</code> did not manage to reserve the memory for unrolling (computing block in memory), it prints out the following WARN message to the logs:</p> <pre><code>Failed to reserve initial memory threshold of [initialMemoryThreshold]\nfor computing block [blockId] in memory.\n</code></pre> <p><code>putIterator</code> requests the <code>ValuesHolder</code> to <code>storeValue</code> for every value in the given <code>values</code> iterator. <code>putIterator</code> checks memory usage regularly (whether it may have exceeded the threshold) and reserveUnrollMemoryForThisTask when needed.</p> <p><code>putIterator</code> requests the <code>ValuesHolder</code> for a <code>MemoryEntryBuilder</code> (<code>getBuilder</code>) that in turn is requested to <code>build</code> a <code>MemoryEntry</code>.</p> <p><code>putIterator</code> releaseUnrollMemoryForThisTask.</p> <p><code>putIterator</code> requests the MemoryManager to acquireStorageMemory and stores the block (in the entries registry).</p> <p>In the end, <code>putIterator</code> prints out the following INFO message to the logs:</p> <pre><code>Block [blockId] stored as values in memory (estimated size [size], free [free])\n</code></pre>  <p>In case of <code>putIterator</code> not having enough memory to store the block, <code>putIterator</code> logUnrollFailureMessage and returns the <code>unrollMemoryUsedByThisBlock</code>.</p>  <p><code>putIterator</code>\u00a0is used when:</p> <ul> <li><code>MemoryStore</code> is requested to putIteratorAsValues and putIteratorAsBytes</li> </ul>","text":""},{"location":"storage/MemoryStore/#logunrollfailuremessage","title":"logUnrollFailureMessage <pre><code>logUnrollFailureMessage(\n  blockId: BlockId,\n  finalVectorSize: Long): Unit\n</code></pre> <p><code>logUnrollFailureMessage</code> prints out the following WARN message to the logs and logMemoryUsage.</p> <pre><code>Not enough space to cache [blockId] in memory! (computed [size] so far)\n</code></pre>","text":""},{"location":"storage/MemoryStore/#logmemoryusage","title":"logMemoryUsage <pre><code>logMemoryUsage(): Unit\n</code></pre> <p><code>logMemoryUsage</code> prints out the following INFO message to the logs (with the blocksMemoryUsed, currentUnrollMemory, numTasksUnrolling, memoryUsed, and maxMemory):</p> <pre><code>Memory use = [blocksMemoryUsed] (blocks) + [currentUnrollMemory]\n(scratch space shared across [numTasksUnrolling] tasks(s)) = [memoryUsed].\nStorage limit = [maxMemory].\n</code></pre>","text":""},{"location":"storage/MemoryStore/#storing-block","title":"Storing Block <pre><code>putBytes[T: ClassTag](\n  blockId: BlockId,\n  size: Long,\n  memoryMode: MemoryMode,\n  _bytes: () =&gt; ChunkedByteBuffer): Boolean\n</code></pre> <p><code>putBytes</code> returns <code>true</code> only after there was enough memory to store the block (BlockId) in entries registry.</p>  <p><code>putBytes</code> asserts that the block is not stored yet.</p> <p><code>putBytes</code> requests the MemoryManager for memory (to store the block) and, when successful, adds the block to the entries registry (as a SerializedMemoryEntry with the <code>_bytes</code> and the <code>MemoryMode</code>).</p> <p>In the end, <code>putBytes</code> prints out the following INFO message to the logs:</p> <pre><code>Block [blockId] stored as bytes in memory (estimated size [size], free [size])\n</code></pre> <p><code>putBytes</code> is used when:</p> <ul> <li><code>BlockStoreUpdater</code> is requested to save serialized values (to MemoryStore)</li> <li><code>BlockManager</code> is requested to maybeCacheDiskBytesInMemory</li> </ul>","text":""},{"location":"storage/MemoryStore/#memory-used-for-caching-blocks","title":"Memory Used for Caching Blocks <pre><code>blocksMemoryUsed: Long\n</code></pre> <p><code>blocksMemoryUsed</code> is the total memory used without (minus) the memory used for unrolling.</p> <p><code>blocksMemoryUsed</code> is used for logging purposes (when <code>MemoryStore</code> is requested to putBytes, putIterator, remove, evictBlocksToFreeSpace and logMemoryUsage).</p>","text":""},{"location":"storage/MemoryStore/#total-storage-memory-in-use","title":"Total Storage Memory in Use <pre><code>memoryUsed: Long\n</code></pre> <p><code>memoryUsed</code> requests the MemoryManager for the total storage memory.</p> <p><code>memoryUsed</code> is used when:</p> <ul> <li><code>MemoryStore</code> is requested for blocksMemoryUsed and to logMemoryUsage</li> </ul>","text":""},{"location":"storage/MemoryStore/#maximum-storage-memory","title":"Maximum Storage Memory <pre><code>maxMemory: Long\n</code></pre> <p><code>maxMemory</code> is the total amount of memory available for storage (in bytes) and is the sum of the maxOnHeapStorageMemory and maxOffHeapStorageMemory of the MemoryManager.</p>  <p>Tip</p> <p>Enable INFO logging for <code>MemoryStore</code> to print out the <code>maxMemory</code> to the logs when created:</p> <pre><code>MemoryStore started with capacity [maxMemory] MB\n</code></pre>  <p><code>maxMemory</code> is used when:</p> <ul> <li><code>MemoryStore</code> is requested for the blocksMemoryUsed and to logMemoryUsage</li> </ul>","text":""},{"location":"storage/MemoryStore/#dropping-block-from-memory","title":"Dropping Block from Memory <pre><code>remove(\n  blockId: BlockId): Boolean\n</code></pre> <p><code>remove</code> returns <code>true</code> when the given block (BlockId) was (found and) removed from the entries registry successfully and the memory released (from the MemoryManager).</p>  <p><code>remove</code> removes (drops) the block (BlockId) from the entries registry.</p> <p>If found and removed, <code>remove</code> requests the MemoryManager to releaseStorageMemory and prints out the following DEBUG message to the logs (with the maxMemory and blocksMemoryUsed):</p> <pre><code>Block [blockId] of size [size] dropped from memory (free [memory])\n</code></pre> <p><code>remove</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is requested to dropFromMemory and removeBlockInternal</li> </ul>","text":""},{"location":"storage/MemoryStore/#releasing-unroll-memory-for-task","title":"Releasing Unroll Memory for Task <pre><code>releaseUnrollMemoryForThisTask(\n  memoryMode: MemoryMode,\n  memory: Long = Long.MaxValue): Unit\n</code></pre> <p><code>releaseUnrollMemoryForThisTask</code> finds the task attempt ID of the current task.</p> <p><code>releaseUnrollMemoryForThisTask</code> uses the onHeapUnrollMemoryMap or offHeapUnrollMemoryMap based on the given <code>MemoryMode</code>.</p> <p>(Only when the unroll memory map contains the task attempt ID) <code>releaseUnrollMemoryForThisTask</code> descreases the memory registered in the unroll memory map by the given memory amount and requests the MemoryManager to releaseUnrollMemory. In the end, <code>releaseUnrollMemoryForThisTask</code> removes the task attempt ID (entry) from the unroll memory map if the memory used is <code>0</code>.</p> <p><code>releaseUnrollMemoryForThisTask</code>\u00a0is used when:</p> <ul> <li><code>Task</code> is requested to run (and is about to finish)</li> <li><code>MemoryStore</code> is requested to putIterator</li> <li><code>PartiallyUnrolledIterator</code> is requested to <code>releaseUnrollMemory</code></li> <li><code>PartiallySerializedBlock</code> is requested to <code>discard</code> and <code>finishWritingToStream</code></li> </ul>","text":""},{"location":"storage/MemoryStore/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.memory.MemoryStore</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.memory.MemoryStore=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/NettyBlockRpcServer/","title":"NettyBlockRpcServer","text":"<p><code>NettyBlockRpcServer</code> is a <code>RpcHandler</code> to handle messages for NettyBlockTransferService.</p> <p></p>"},{"location":"storage/NettyBlockRpcServer/#creating-instance","title":"Creating Instance","text":"<p><code>NettyBlockRpcServer</code> takes the following to be created:</p> <ul> <li> Application ID <li> Serializer <li> BlockDataManager <p><code>NettyBlockRpcServer</code> is created\u00a0when:</p> <ul> <li><code>NettyBlockTransferService</code> is requested to initialize</li> </ul>"},{"location":"storage/NettyBlockRpcServer/#oneforonestreammanager","title":"OneForOneStreamManager <p><code>NettyBlockRpcServer</code> uses a <code>OneForOneStreamManager</code>.</p>","text":""},{"location":"storage/NettyBlockRpcServer/#receiving-rpc-messages","title":"Receiving RPC Messages <pre><code>receive(\n  client: TransportClient,\n  rpcMessage: ByteBuffer,\n  responseContext: RpcResponseCallback): Unit\n</code></pre> <p><code>receive</code> deserializes the incoming RPC message (from <code>ByteBuffer</code> to <code>BlockTransferMessage</code>) and prints out the following TRACE message to the logs:</p> <pre><code>Received request: [message]\n</code></pre> <p><code>receive</code> handles the message.</p> <p><code>receive</code>\u00a0is part of the <code>RpcHandler</code> abstraction.</p>","text":""},{"location":"storage/NettyBlockRpcServer/#fetchshuffleblocks","title":"FetchShuffleBlocks <p><code>FetchShuffleBlocks</code> carries the following:</p> <ul> <li>Application ID</li> <li>Executor ID</li> <li>Shuffle ID</li> <li>Map IDs (<code>long[]</code>)</li> <li>Reduce IDs (<code>long[][]</code>)</li> <li><code>batchFetchEnabled</code> flag</li> </ul> <p>When received, <code>receive</code>...FIXME</p> <p><code>receive</code> prints out the following TRACE message in the logs:</p> <pre><code>Registered streamId [streamId] with [numBlockIds] buffers\n</code></pre> <p>In the end, <code>receive</code> responds with a <code>StreamHandle</code> (with the <code>streamId</code> and the number of blocks). The response is serialized to a <code>ByteBuffer</code>.</p> <p><code>FetchShuffleBlocks</code> is posted when:</p> <ul> <li><code>OneForOneBlockFetcher</code> is requested to createFetchShuffleBlocksMsgAndBuildBlockIds</li> </ul>","text":""},{"location":"storage/NettyBlockRpcServer/#getlocaldirsforexecutors","title":"GetLocalDirsForExecutors","text":""},{"location":"storage/NettyBlockRpcServer/#openblocks","title":"OpenBlocks <p><code>OpenBlocks</code> carries the following:</p> <ul> <li>Application ID</li> <li>Executor ID</li> <li>Block IDs</li> </ul> <p>When received, <code>receive</code>...FIXME</p> <p><code>receive</code> prints out the following TRACE message in the logs:</p> <pre><code>Registered streamId [streamId] with [blocksNum] buffers\n</code></pre> <p>In the end, <code>receive</code> responds with a <code>StreamHandle</code> (with the <code>streamId</code> and the number of blocks). The response is serialized to a <code>ByteBuffer</code>.</p> <p><code>OpenBlocks</code> is posted when:</p> <ul> <li><code>OneForOneBlockFetcher</code> is requested to start</li> </ul>","text":""},{"location":"storage/NettyBlockRpcServer/#uploadblock","title":"UploadBlock <p><code>UploadBlock</code> carries the following:</p> <ul> <li>Application ID</li> <li>Executor ID</li> <li>Block ID</li> <li>Metadata (<code>byte[]</code>)</li> <li>Block Data (<code>byte[]</code>)</li> </ul> <p>When received, <code>receive</code> deserializes the <code>metadata</code> to get the StorageLevel and <code>ClassTag</code> of the block being uploaded.</p> <p><code>receive</code>...FIXME</p> <p><code>UploadBlock</code> is posted when:</p> <ul> <li><code>NettyBlockTransferService</code> is requested to upload a block</li> </ul>","text":""},{"location":"storage/NettyBlockRpcServer/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.network.netty.NettyBlockRpcServer</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.network.netty.NettyBlockRpcServer=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/NettyBlockTransferService/","title":"NettyBlockTransferService","text":"<p><code>NettyBlockTransferService</code> is a BlockTransferService that uses Netty for uploading and fetching blocks of data.</p> <p></p>"},{"location":"storage/NettyBlockTransferService/#creating-instance","title":"Creating Instance","text":"<p><code>NettyBlockTransferService</code> takes the following to be created:</p> <ul> <li> SparkConf <li> <code>SecurityManager</code> <li> Bind Address <li> Host Name <li> Port <li> Number of CPU Cores <li> Driver RpcEndpointRef <p><code>NettyBlockTransferService</code> is created\u00a0when:</p> <ul> <li><code>SparkEnv</code> utility is used to create a SparkEnv (for the driver and executors and creates a BlockManager)</li> </ul>"},{"location":"storage/NettyBlockTransferService/#initializing","title":"Initializing <pre><code>init(\n  blockDataManager: BlockDataManager): Unit\n</code></pre> <p><code>init</code>\u00a0is part of the BlockTransferService abstraction.</p> <p><code>init</code> creates a NettyBlockRpcServer (with the application ID, a <code>JavaSerializer</code> and the given BlockDataManager).</p> <p><code>init</code> creates a TransportContext (with the <code>NettyBlockRpcServer</code> just created) and requests it for a TransportClientFactory.</p> <p><code>init</code> createServer.</p> <p>In the end, <code>init</code> prints out the following INFO message to the logs:</p> <pre><code>Server created on [hostName]:[port]\n</code></pre>","text":""},{"location":"storage/NettyBlockTransferService/#fetching-blocks","title":"Fetching Blocks <pre><code>fetchBlocks(\n  host: String,\n  port: Int,\n  execId: String,\n  blockIds: Array[String],\n  listener: BlockFetchingListener,\n  tempFileManager: DownloadFileManager): Unit\n</code></pre> <p><code>fetchBlocks</code> prints out the following TRACE message to the logs:</p> <pre><code>Fetch blocks from [host]:[port] (executor id [execId])\n</code></pre> <p><code>fetchBlocks</code> requests the TransportConf for the maxIORetries.</p> <p><code>fetchBlocks</code> creates a BlockTransferStarter.</p> <p>With the <code>maxIORetries</code> above zero, <code>fetchBlocks</code> creates a RetryingBlockFetcher (with the <code>BlockFetchStarter</code>, the <code>blockIds</code> and the BlockFetchingListener) and starts it.</p> <p>Otherwise, <code>fetchBlocks</code> requests the <code>BlockFetchStarter</code> to createAndStart (with the <code>blockIds</code> and the <code>BlockFetchingListener</code>).</p> <p>In case of any <code>Exception</code>, <code>fetchBlocks</code> prints out the following ERROR message to the logs and the given <code>BlockFetchingListener</code> gets notified.</p> <pre><code>Exception while beginning fetchBlocks\n</code></pre> <p><code>fetchBlocks</code>\u00a0is part of the BlockStoreClient abstraction.</p>","text":""},{"location":"storage/NettyBlockTransferService/#blocktransferstarter","title":"BlockTransferStarter <p><code>fetchBlocks</code> creates a <code>BlockTransferStarter</code>. When requested to <code>createAndStart</code>, the <code>BlockTransferStarter</code> requests the TransportClientFactory to create a TransportClient.</p> <p><code>createAndStart</code> creates an OneForOneBlockFetcher and requests it to start.</p>","text":""},{"location":"storage/NettyBlockTransferService/#ioexception","title":"IOException <p>In case of an <code>IOException</code>, <code>createAndStart</code> requests the driver RpcEndpointRef to send an <code>IsExecutorAlive</code> message synchronously (with the given <code>execId</code>).</p> <p>If the driver <code>RpcEndpointRef</code> replied <code>false</code>, <code>createAndStart</code> throws an ExecutorDeadException:</p> <pre><code>The relative remote executor(Id: [execId]),\nwhich maintains the block data to fetch is dead.\n</code></pre> <p>Otherwise, <code>createAndStart</code> (re)throws the <code>IOException</code>.</p>","text":""},{"location":"storage/NettyBlockTransferService/#uploading-block","title":"Uploading Block <pre><code>uploadBlock(\n  hostname: String,\n  port: Int,\n  execId: String,\n  blockId: BlockId,\n  blockData: ManagedBuffer,\n  level: StorageLevel,\n  classTag: ClassTag[_]): Future[Unit]\n</code></pre> <p><code>uploadBlock</code>\u00a0is part of the BlockTransferService abstraction.</p> <p><code>uploadBlock</code> creates a <code>TransportClient</code> (with the given <code>hostname</code> and <code>port</code>).</p> <p><code>uploadBlock</code> serializes the given StorageLevel and <code>ClassTag</code> (using a <code>JavaSerializer</code>).</p> <p><code>uploadBlock</code> uses a stream to transfer shuffle blocks when one of the following holds:</p> <ol> <li>The size of the block data (<code>ManagedBuffer</code>) is above spark.network.maxRemoteBlockSizeFetchToMem configuration property</li> <li>The given BlockId is a shuffle block</li> </ol> <p>For stream transfer <code>uploadBlock</code> requests the <code>TransportClient</code> to <code>uploadStream</code>. Otherwise, <code>uploadBlock</code> requests the <code>TransportClient</code> to <code>sendRpc</code> a <code>UploadBlock</code> message.</p>  <p>Note</p> <p><code>UploadBlock</code> message is processed by NettyBlockRpcServer.</p>  <p>With the upload successful, <code>uploadBlock</code> prints out the following TRACE message to the logs:</p> <pre><code>Successfully uploaded block [blockId] [as stream]\n</code></pre> <p>With the upload failed, <code>uploadBlock</code> prints out the following ERROR message to the logs:</p> <pre><code>Error while uploading block [blockId] [as stream]\n</code></pre>","text":""},{"location":"storage/NettyBlockTransferService/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.network.netty.NettyBlockTransferService</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.network.netty.NettyBlockTransferService=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/OneForOneBlockFetcher/","title":"OneForOneBlockFetcher","text":""},{"location":"storage/OneForOneBlockFetcher/#creating-instance","title":"Creating Instance","text":"<p><code>OneForOneBlockFetcher</code> takes the following to be created:</p> <ul> <li> <code>TransportClient</code> <li> Application ID <li> Executor ID <li> Block IDs (<code>String[]</code>) <li> BlockFetchingListener <li> TransportConf <li> DownloadFileManager <p><code>OneForOneBlockFetcher</code> is created\u00a0when:</p> <ul> <li><code>NettyBlockTransferService</code> is requested to fetch blocks</li> <li><code>ExternalBlockStoreClient</code> is requested to fetch blocks</li> </ul>"},{"location":"storage/OneForOneBlockFetcher/#createfetchshuffleblocksmsg","title":"createFetchShuffleBlocksMsg <pre><code>FetchShuffleBlocks createFetchShuffleBlocksMsg(\n  String appId,\n  String execId,\n  String[] blockIds)\n</code></pre> <p><code>createFetchShuffleBlocksMsg</code>...FIXME</p>","text":""},{"location":"storage/OneForOneBlockFetcher/#starting","title":"Starting <pre><code>void start()\n</code></pre> <p><code>start</code> requests the TransportClient to <code>sendRpc</code> the BlockTransferMessage</p> <p><code>start</code>...FIXME</p> <p><code>start</code>\u00a0is used when:</p> <ul> <li><code>ExternalBlockStoreClient</code> is requested to fetchBlocks</li> <li><code>NettyBlockTransferService</code> is requested to fetchBlocks</li> </ul>","text":""},{"location":"storage/OneForOneBlockFetcher/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.network.shuffle.OneForOneBlockFetcher</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.network.shuffle.OneForOneBlockFetcher=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/RDDInfo/","title":"RDDInfo","text":"<p><code>RDDInfo</code> is...FIXME</p>"},{"location":"storage/RandomBlockReplicationPolicy/","title":"RandomBlockReplicationPolicy","text":"<p><code>RandomBlockReplicationPolicy</code> is...FIXME</p>"},{"location":"storage/ShuffleBlockFetcherIterator/","title":"ShuffleBlockFetcherIterator","text":"<p><code>ShuffleBlockFetcherIterator</code> is an <code>Iterator[(BlockId, InputStream)]</code> (Scala) that fetches shuffle blocks from local or remote BlockManagers (and makes them available as an <code>InputStream</code>).</p> <p><code>ShuffleBlockFetcherIterator</code> allows for a synchronous iteration over shuffle blocks so a caller can handle them in a pipelined fashion as they are received.</p> <p><code>ShuffleBlockFetcherIterator</code> is exhausted (and can provide no elements) when the number of blocks already processed is at least the total number of blocks to fetch.</p> <p><code>ShuffleBlockFetcherIterator</code> throttles the remote fetches to avoid consuming too much memory.</p>"},{"location":"storage/ShuffleBlockFetcherIterator/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleBlockFetcherIterator</code> takes the following to be created:</p> <ul> <li> TaskContext <li> BlockStoreClient <li> BlockManager <li> Blocks to Fetch by Address (<code>Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])]</code>) <li> Stream Wrapper Function (<code>(BlockId, InputStream) =&gt; InputStream</code>) <li> spark.reducer.maxSizeInFlight <li> spark.reducer.maxReqsInFlight <li> spark.reducer.maxBlocksInFlightPerAddress <li> spark.network.maxRemoteBlockSizeFetchToMem <li> spark.shuffle.detectCorrupt <li> spark.shuffle.detectCorrupt.useExtraMemory <li> <code>ShuffleReadMetricsReporter</code> <li> <code>doBatchFetch</code> flag <p>While being created, <code>ShuffleBlockFetcherIterator</code> initializes itself.</p> <p><code>ShuffleBlockFetcherIterator</code> is created\u00a0when:</p> <ul> <li><code>BlockStoreShuffleReader</code> is requested to read combined key-value records for a reduce task</li> </ul>"},{"location":"storage/ShuffleBlockFetcherIterator/#initializing","title":"Initializing <pre><code>initialize(): Unit\n</code></pre> <p><code>initialize</code> registers a task cleanup and fetches shuffle blocks from remote and local storage:BlockManager.md[BlockManagers].</p> <p>Internally, <code>initialize</code> uses the TaskContext to register the ShuffleFetchCompletionListener (to cleanup).</p> <p><code>initialize</code> partitionBlocksByFetchMode.</p> <p><code>initialize</code>...FIXME</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#partitionblocksbyfetchmode","title":"partitionBlocksByFetchMode <pre><code>partitionBlocksByFetchMode(): ArrayBuffer[FetchRequest]\n</code></pre> <p><code>partitionBlocksByFetchMode</code>...FIXME</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#collectfetchrequests","title":"collectFetchRequests <pre><code>collectFetchRequests(\n  address: BlockManagerId,\n  blockInfos: Seq[(BlockId, Long, Int)],\n  collectedRemoteRequests: ArrayBuffer[FetchRequest]): Unit\n</code></pre> <p><code>collectFetchRequests</code>...FIXME</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#createfetchrequests","title":"createFetchRequests <pre><code>createFetchRequests(\n  curBlocks: Seq[FetchBlockInfo],\n  address: BlockManagerId,\n  isLast: Boolean,\n  collectedRemoteRequests: ArrayBuffer[FetchRequest]): Seq[FetchBlockInfo]\n</code></pre> <p><code>createFetchRequests</code>...FIXME</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#fetchuptomaxbytes","title":"fetchUpToMaxBytes <pre><code>fetchUpToMaxBytes(): Unit\n</code></pre> <p><code>fetchUpToMaxBytes</code>...FIXME</p> <p><code>fetchUpToMaxBytes</code> is used when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to initialize and next</li> </ul>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#sending-remote-shuffle-block-fetch-request","title":"Sending Remote Shuffle Block Fetch Request <pre><code>sendRequest(\n  req: FetchRequest): Unit\n</code></pre> <p><code>sendRequest</code> prints out the following DEBUG message to the logs:</p> <pre><code>Sending request for [n] blocks ([size]) from [hostPort]\n</code></pre> <p><code>sendRequest</code> add the size of the blocks in the <code>FetchRequest</code> to bytesInFlight and increments the reqsInFlight internal counters.</p> <p><code>sendRequest</code> requests the ShuffleClient to fetch the blocks with a new BlockFetchingListener (and this <code>ShuffleBlockFetcherIterator</code> when the size of the blocks in the <code>FetchRequest</code> is higher than the maxReqSizeShuffleToMem).</p> <p><code>sendRequest</code> is used when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to fetch remote shuffle blocks</li> </ul>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#blockfetchinglistener","title":"BlockFetchingListener <p><code>sendRequest</code> creates a new BlockFetchingListener to be notified about successes or failures of shuffle block fetch requests.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#onblockfetchsuccess","title":"onBlockFetchSuccess <p>On onBlockFetchSuccess the <code>BlockFetchingListener</code> adds a <code>SuccessFetchResult</code> to the results registry and prints out the following DEBUG message to the logs (when not a zombie):</p> <pre><code>remainingBlocks: [remainingBlocks]\n</code></pre> <p>In the end, <code>onBlockFetchSuccess</code> prints out the following TRACE message to the logs:</p> <pre><code>Got remote block [blockId] after [time]\n</code></pre>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#onblockfetchfailure","title":"onBlockFetchFailure <p>On onBlockFetchFailure the <code>BlockFetchingListener</code> adds a <code>FailureFetchResult</code> to the results registry and prints out the following ERROR message to the logs:</p> <pre><code>Failed to get block(s) from [host]:[port]\n</code></pre>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#fetchresults","title":"FetchResults <pre><code>results: LinkedBlockingQueue[FetchResult]\n</code></pre> <p><code>ShuffleBlockFetcherIterator</code> uses an internal FIFO blocking queue (Java) of <code>FetchResult</code>s.</p> <p><code>results</code> is used for fetching the next element.</p> <p>For remote blocks, <code>FetchResult</code>s are added in sendRequest:</p> <ul> <li><code>SuccessFetchResult</code>s after a <code>BlockFetchingListener</code> is notified about onBlockFetchSuccess</li> <li><code>FailureFetchResult</code>s after a <code>BlockFetchingListener</code> is notified about onBlockFetchFailure</li> </ul> <p>For local blocks, <code>FetchResult</code>s are added in fetchLocalBlocks:</p> <ul> <li><code>SuccessFetchResult</code>s after the BlockManager has successfully getLocalBlockData</li> <li><code>FailureFetchResult</code>s otherwise</li> </ul> <p>For local blocks, <code>FetchResult</code>s are added in fetchHostLocalBlock:</p> <ul> <li><code>SuccessFetchResult</code>s after the BlockManager has successfully getHostLocalShuffleData</li> <li><code>FailureFetchResult</code>s otherwise</li> </ul> <p><code>FailureFetchResult</code>s can also be added in fetchHostLocalBlocks.</p> <p>Cleaned up in cleanup</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#hasnext","title":"hasNext <pre><code>hasNext: Boolean\n</code></pre> <p><code>hasNext</code>\u00a0is part of the <code>Iterator</code> (Scala) abstraction (to test whether this iterator can provide another element).</p> <p><code>hasNext</code> is <code>true</code> when numBlocksProcessed is below numBlocksToFetch.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#retrieving-next-element","title":"Retrieving Next Element <pre><code>next(): (BlockId, InputStream)\n</code></pre> <p><code>next</code> increments the numBlocksProcessed registry.</p> <p><code>next</code> takes (and removes) the head of the results queue.</p> <p><code>next</code> requests the ShuffleReadMetricsReporter to <code>incFetchWaitTime</code>.</p> <p><code>next</code>...FIXME</p> <p><code>next</code> throws a <code>NoSuchElementException</code> if there is no element left.</p> <p><code>next</code> is part of the <code>Iterator</code> (Scala) abstraction (to produce the next element of this iterator).</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#numblocksprocessed","title":"numBlocksProcessed <p>The number of blocks fetched and consumed</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#numblockstofetch","title":"numBlocksToFetch <p>Total number of blocks to fetch and consume</p> <p><code>ShuffleBlockFetcherIterator</code> can produce up to <code>numBlocksToFetch</code> elements.</p> <p><code>numBlocksToFetch</code> is increased every time <code>ShuffleBlockFetcherIterator</code> is requested to partitionBlocksByFetchMode that prints it out as the INFO message to the logs:</p> <pre><code>Getting [numBlocksToFetch] non-empty blocks out of [totalBlocks] blocks\n</code></pre>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#releasecurrentresultbuffer","title":"releaseCurrentResultBuffer <pre><code>releaseCurrentResultBuffer(): Unit\n</code></pre> <p><code>releaseCurrentResultBuffer</code>...FIXME</p> <p><code>releaseCurrentResultBuffer</code>\u00a0is used when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to cleanup</li> <li><code>BufferReleasingInputStream</code> is requested to <code>close</code></li> </ul>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#shufflefetchcompletionlistener","title":"ShuffleFetchCompletionListener <p><code>ShuffleBlockFetcherIterator</code> creates a ShuffleFetchCompletionListener when created.</p> <p><code>ShuffleFetchCompletionListener</code> is used when initialize and toCompletionIterator.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#cleaning-up","title":"Cleaning Up <pre><code>cleanup(): Unit\n</code></pre> <p><code>cleanup</code> marks this <code>ShuffleBlockFetcherIterator</code> a zombie.</p> <p><code>cleanup</code> releases the current result buffer.</p> <p><code>cleanup</code> iterates over results internal queue and for every <code>SuccessFetchResult</code>, increments remote bytes read and blocks fetched shuffle task metrics, and eventually releases the managed buffer.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#bytesinflight","title":"bytesInFlight <p>The bytes of fetched remote shuffle blocks in flight</p> <p>Starts at <code>0</code> when <code>ShuffleBlockFetcherIterator</code> is created</p> <p>Incremented every sendRequest and decremented every next.</p> <p><code>ShuffleBlockFetcherIterator</code> makes sure that the invariant of <code>bytesInFlight</code> is below maxBytesInFlight every remote shuffle block fetch.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#reqsinflight","title":"reqsInFlight <p>The number of remote shuffle block fetch requests in flight.</p> <p>Starts at <code>0</code> when <code>ShuffleBlockFetcherIterator</code> is created</p> <p>Incremented every sendRequest and decremented every next.</p> <p><code>ShuffleBlockFetcherIterator</code> makes sure that the invariant of <code>reqsInFlight</code> is below maxReqsInFlight every remote shuffle block fetch.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#iszombie","title":"isZombie <p>Controls whether <code>ShuffleBlockFetcherIterator</code> is still active and records <code>SuccessFetchResult</code>s on successful shuffle block fetches.</p> <p>Starts <code>false</code> when <code>ShuffleBlockFetcherIterator</code> is created</p> <p>Enabled (<code>true</code>) in cleanup.</p> <p>When enabled, registerTempFileToClean is a noop.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#downloadfilemanager","title":"DownloadFileManager <p><code>ShuffleBlockFetcherIterator</code> is a DownloadFileManager.</p>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#throwfetchfailedexception","title":"throwFetchFailedException <pre><code>throwFetchFailedException(\n  blockId: BlockId,\n  mapIndex: Int,\n  address: BlockManagerId,\n  e: Throwable,\n  message: Option[String] = None): Nothing\n</code></pre> <p><code>throwFetchFailedException</code> takes the <code>message</code> (if defined) or uses the message of the given <code>Throwable</code>.</p> <p>In the end, <code>throwFetchFailedException</code> throws a FetchFailedException if the BlockId is either a <code>ShuffleBlockId</code> or a <code>ShuffleBlockBatchId</code>. Otherwise, <code>throwFetchFailedException</code> throws a <code>SparkException</code>:</p> <pre><code>Failed to get block [blockId], which is not a shuffle block\n</code></pre> <p><code>throwFetchFailedException</code>\u00a0is used when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is requested to next</li> <li><code>BufferReleasingInputStream</code> is requested to <code>tryOrFetchFailedException</code></li> </ul>","text":""},{"location":"storage/ShuffleBlockFetcherIterator/#logging","title":"Logging <p>Enable <code>ALL</code> logging level for <code>org.apache.spark.storage.ShuffleBlockFetcherIterator</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.storage.ShuffleBlockFetcherIterator=ALL\n</code></pre> <p>Refer to Logging.</p>","text":""},{"location":"storage/ShuffleFetchCompletionListener/","title":"ShuffleFetchCompletionListener","text":"<p><code>ShuffleFetchCompletionListener</code> is a TaskCompletionListener (that ShuffleBlockFetcherIterator uses to clean up after the owning task is completed).</p>"},{"location":"storage/ShuffleFetchCompletionListener/#creating-instance","title":"Creating Instance","text":"<p><code>ShuffleFetchCompletionListener</code> takes the following to be created:</p> <ul> <li> ShuffleBlockFetcherIterator <p><code>ShuffleFetchCompletionListener</code> is created\u00a0when:</p> <ul> <li><code>ShuffleBlockFetcherIterator</code> is created</li> </ul>"},{"location":"storage/ShuffleFetchCompletionListener/#ontaskcompletion","title":"onTaskCompletion <pre><code>onTaskCompletion(\n  context: TaskContext): Unit\n</code></pre> <p><code>onTaskCompletion</code>\u00a0is part of the TaskCompletionListener abstraction.</p> <p><code>onTaskCompletion</code> requests the ShuffleBlockFetcherIterator (if available) to cleanup.</p> <p>In the end, <code>onTaskCompletion</code> nulls out the reference to the ShuffleBlockFetcherIterator (to make it available for garbage collection).</p>","text":""},{"location":"storage/ShuffleMetricsSource/","title":"ShuffleMetricsSource","text":"<p>= ShuffleMetricsSource</p> <p>ShuffleMetricsSource is the metrics:spark-metrics-Source.md[metrics source] of a storage:BlockManager.md[] for &lt;&gt;. <p>ShuffleMetricsSource lives on a Spark executor and is executor:Executor.md#creating-instance-BlockManager-shuffleMetricsSource[registered only when a Spark application runs in a non-local / cluster mode].</p> <p>.Registering ShuffleMetricsSource with \"executor\" MetricsSystem image::ShuffleMetricsSource.png[align=\"center\"]</p> <p>== [[creating-instance]] Creating Instance</p> <p>ShuffleMetricsSource takes the following to be created:</p> <ul> <li>&lt;&gt; <li>&lt;&gt; <p>ShuffleMetricsSource is created when BlockManager is requested for the storage:BlockManager.md#shuffleMetricsSource[shuffle metrics source].</p> <p>== [[sourceName]] Source Name</p> <p>ShuffleMetricsSource is given a name when &lt;&gt; that is one of the following: <ul> <li> <p>NettyBlockTransfer when spark.shuffle.service.enabled configuration property is off (<code>false</code>)</p> </li> <li> <p>ExternalShuffle when spark.shuffle.service.enabled configuration property is on (<code>true</code>)</p> </li> </ul>"},{"location":"storage/ShuffleMigrationRunnable/","title":"ShuffleMigrationRunnable","text":"<p><code>ShuffleMigrationRunnable</code> is...FIXME</p>"},{"location":"storage/StorageLevel/","title":"StorageLevel","text":"<p><code>StorageLevel</code> is the following flags for controlling the storage of an RDD.</p> Flag Default Value <code>useDisk</code> <code>false</code> <code>useMemory</code> <code>true</code> <code>useOffHeap</code> <code>false</code> <code>deserialized</code> <code>false</code> <code>replication</code> 1","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#restrictions","title":"Restrictions","text":"<ol> <li>The replication is restricted to be less than <code>40</code> (for calculating the hash code)</li> <li>Off-heap storage level does not support deserialized storage</li> </ol>","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#validation","title":"Validation <pre><code>isValid: Boolean\n</code></pre> <p><code>StorageLevel</code> is considered valid when the following all hold:</p> <ol> <li>Uses memory or disk</li> <li>Replication is non-zero positive number (between the default <code>1</code> and 40)</li> </ol>","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#externalizable","title":"Externalizable <p><code>DirectTaskResult</code> is an <code>Externalizable</code> (Java).</p>","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#writeexternal","title":"writeExternal <pre><code>writeExternal(\n  out: ObjectOutput): Unit\n</code></pre> <p><code>writeExternal</code>\u00a0is part of the <code>Externalizable</code> (Java) abstraction.</p> <p><code>writeExternal</code> writes the bitwise representation out followed by the replication of this <code>StorageLevel</code>.</p>","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageLevel/#bitwise-integer-representation","title":"Bitwise Integer Representation <pre><code>toInt: Int\n</code></pre> <p><code>toInt</code> converts this <code>StorageLevel</code> to numeric (binary) representation by turning the corresponding bits on for the following (if used and in that order):</p> <ol> <li>deserialized</li> <li>useOffHeap</li> <li>useMemory</li> <li>useDisk</li> </ol> <p>In other words, the following number in bitwise representation says the <code>StorageLevel</code> is deserialized and useMemory:</p> <pre><code>import org.apache.spark.storage.StorageLevel.MEMORY_ONLY\nassert(MEMORY_ONLY.toInt == (0 | 1 | 4))\n\nscala&gt; println(MEMORY_ONLY.toInt.toBinaryString)\n101\n</code></pre> <p><code>toInt</code>\u00a0is used when:</p> <ul> <li><code>StorageLevel</code> is requested to writeExternal and hashCode</li> </ul>","text":"","tags":["DeveloperApi"]},{"location":"storage/StorageStatus/","title":"StorageStatus","text":"<p>== [[StorageStatus]] StorageStatus</p> <p><code>StorageStatus</code> is a developer API that Spark uses to pass \"just enough\" information about registered storage:BlockManager.md[BlockManagers] in a Spark application between Spark services (mostly for monitoring purposes like spark-webui.md[web UI] or SparkListener.md[]s).</p>"},{"location":"storage/StorageStatus/#note","title":"[NOTE]","text":"<p>There are two ways to access <code>StorageStatus</code> about all the known <code>BlockManagers</code> in a Spark application:</p> <ul> <li>SparkContext.md#getExecutorStorageStatus[SparkContext.getExecutorStorageStatus]</li> </ul>"},{"location":"storage/StorageStatus/#being-a-sparklistenermd-and-intercepting-sparklistenermdonblockmanageraddedonblockmanageradded-and-sparklistenermdonblockmanagerremovedonblockmanagerremoved-events","title":"* Being a SparkListener.md[] and intercepting SparkListener.md#onBlockManagerAdded[onBlockManagerAdded] and SparkListener.md#onBlockManagerRemoved[onBlockManagerRemoved] events","text":"<p><code>StorageStatus</code> is &lt;&gt; when: <ul> <li><code>BlockManagerMasterEndpoint</code> storage:BlockManagerMasterEndpoint.md#storageStatus[is requested for storage status] (of every storage:BlockManager.md[BlockManager] in a Spark application)</li> </ul> <p>[[internal-registries]] .StorageStatus's Internal Registries and Counters [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| [[_nonRddBlocks]] <code>_nonRddBlocks</code> | Lookup table of <code>BlockIds</code> per <code>BlockId</code>.</p> <p>Used when...FIXME</p> <p>| [[_rddBlocks]] <code>_rddBlocks</code> | Lookup table of <code>BlockIds</code> with <code>BlockStatus</code> per RDD id.</p> <p>Used when...FIXME |===</p> <p>=== [[updateStorageInfo]] <code>updateStorageInfo</code> Internal Method</p>"},{"location":"storage/StorageStatus/#source-scala","title":"[source, scala]","text":"<p>updateStorageInfo(   blockId: BlockId,   newBlockStatus: BlockStatus): Unit</p> <p><code>updateStorageInfo</code>...FIXME</p> <p>NOTE: <code>updateStorageInfo</code> is used when...FIXME</p> <p>=== [[creating-instance]] Creating StorageStatus Instance</p> <p><code>StorageStatus</code> takes the following when created:</p> <ul> <li>[[blockManagerId]] storage:BlockManagerId.md[]</li> <li>[[maxMem]] Maximum memory -- storage:BlockManager.md#maxMemory[total available on-heap and off-heap memory for storage on the <code>BlockManager</code>]</li> </ul> <p><code>StorageStatus</code> initializes the &lt;&gt;. <p>=== [[rddBlocksById]] Getting RDD Blocks For RDD -- <code>rddBlocksById</code> Method</p>"},{"location":"storage/StorageStatus/#source-scala_1","title":"[source, scala]","text":""},{"location":"storage/StorageStatus/#rddblocksbyidrddid-int-mapblockid-blockstatus","title":"rddBlocksById(rddId: Int): Map[BlockId, BlockStatus]","text":"<p><code>rddBlocksById</code> gives the blocks (as <code>BlockId</code> with their status as <code>BlockStatus</code>) that belong to <code>rddId</code> RDD.</p> <p>=== [[removeBlock]] Removing Block (From Internal Registries) -- <code>removeBlock</code> Internal Method</p>"},{"location":"storage/StorageStatus/#source-scala_2","title":"[source, scala]","text":""},{"location":"storage/StorageStatus/#removeblockblockid-blockid-optionblockstatus","title":"removeBlock(blockId: BlockId): Option[BlockStatus]","text":"<p><code>removeBlock</code> removes <code>blockId</code> from &lt;&lt;_rddBlocks, _rddBlocks&gt;&gt; registry and returns it.</p> <p>Internally, <code>removeBlock</code> &lt;&gt; of <code>blockId</code> (to be empty, i.e. removed). <p><code>removeBlock</code> branches off per the type of storage:BlockId.md[], i.e. <code>RDDBlockId</code> or not.</p> <p>For a <code>RDDBlockId</code>, <code>removeBlock</code> finds the RDD in &lt;&lt;_rddBlocks, _rddBlocks&gt;&gt; and removes the <code>blockId</code>. <code>removeBlock</code> removes the RDD (from &lt;&lt;_rddBlocks, _rddBlocks&gt;&gt;) completely, if there are no more blocks registered.</p> <p>For a non-<code>RDDBlockId</code>, <code>removeBlock</code> removes <code>blockId</code> from &lt;&lt;_nonRddBlocks, _nonRddBlocks&gt;&gt; registry.</p> <p>=== [[addBlock]] Registering Status of Data Block -- <code>addBlock</code> Method</p>"},{"location":"storage/StorageStatus/#source-scala_3","title":"[source, scala]","text":"<p>addBlock(   blockId: BlockId,   blockStatus: BlockStatus): Unit</p> <p><code>addBlock</code>...FIXME</p> <p>NOTE: <code>addBlock</code> is used when...FIXME</p> <p>=== [[getBlock]] <code>getBlock</code> Method</p>"},{"location":"storage/StorageStatus/#source-scala_4","title":"[source, scala]","text":""},{"location":"storage/StorageStatus/#getblockblockid-blockid-optionblockstatus","title":"getBlock(blockId: BlockId): Option[BlockStatus]","text":"<p><code>getBlock</code>...FIXME</p> <p>NOTE: <code>getBlock</code> is used when...FIXME</p>"},{"location":"storage/StorageUtils/","title":"StorageUtils","text":""},{"location":"storage/StorageUtils/#port-of-external-shuffle-service","title":"Port of External Shuffle Service <pre><code>externalShuffleServicePort(\n  conf: SparkConf): Int\n</code></pre> <p><code>externalShuffleServicePort</code>...FIXME</p> <p><code>externalShuffleServicePort</code>\u00a0is used when:</p> <ul> <li><code>BlockManager</code> is created</li> <li><code>BlockManagerMasterEndpoint</code> is created</li> </ul>","text":""},{"location":"storage/TempFileBasedBlockStoreUpdater/","title":"TempFileBasedBlockStoreUpdater","text":"<p><code>TempFileBasedBlockStoreUpdater</code> is a BlockStoreUpdater (that BlockManager uses for storing a block from bytes in a local temporary file).</p>"},{"location":"storage/TempFileBasedBlockStoreUpdater/#creating-instance","title":"Creating Instance","text":"<p><code>TempFileBasedBlockStoreUpdater</code> takes the following to be created:</p> <ul> <li> BlockId <li> StorageLevel <li> <code>ClassTag</code> (Scala) <li> Temporary File <li> Block Size <li> <code>tellMaster</code> flag (default: <code>true</code>) <li> <code>keepReadLock</code> flag (default: <code>false</code>) <p><code>TempFileBasedBlockStoreUpdater</code> is created\u00a0when:</p> <ul> <li><code>BlockManager</code> is requested to putBlockDataAsStream</li> <li><code>PythonBroadcast</code> is requested to <code>readObject</code></li> </ul>"},{"location":"storage/TempFileBasedBlockStoreUpdater/#block-data","title":"Block Data <pre><code>blockData(): BlockData\n</code></pre> <p><code>blockData</code> requests the DiskStore (of the parent BlockManager) to getBytes (with the temp file and the block size).</p> <p><code>blockData</code>\u00a0is part of the BlockStoreUpdater abstraction.</p>","text":""},{"location":"storage/TempFileBasedBlockStoreUpdater/#storing-block-to-disk","title":"Storing Block to Disk <pre><code>saveToDiskStore(): Unit\n</code></pre> <p><code>saveToDiskStore</code> requests the DiskStore (of the parent BlockManager) to moveFileToBlock.</p> <p><code>saveToDiskStore</code>\u00a0is part of the BlockStoreUpdater abstraction.</p>","text":""},{"location":"tools/","title":"Spark Tools","text":"<p>Main abstractions:</p> <ul> <li>AbstractCommandBuilder</li> </ul>"},{"location":"tools/AbstractCommandBuilder/","title":"AbstractCommandBuilder","text":"<p><code>AbstractCommandBuilder</code> is an abstraction of launch command builders.</p>"},{"location":"tools/AbstractCommandBuilder/#contract","title":"Contract","text":""},{"location":"tools/AbstractCommandBuilder/#buildCommand","title":"Building Command","text":"<pre><code>List&lt;String&gt; buildCommand(\n  Map&lt;String, String&gt; env)\n</code></pre> <p>Builds a command to launch a script on command line</p> <p>See:</p> <ul> <li>SparkClassCommandBuilder</li> <li>SparkSubmitCommandBuilder</li> </ul> <p>Used when:</p> <ul> <li><code>Main</code> is requested to build a command</li> </ul>"},{"location":"tools/AbstractCommandBuilder/#implementations","title":"Implementations","text":"<ul> <li>SparkClassCommandBuilder</li> <li>SparkSubmitCommandBuilder</li> <li>WorkerCommandBuilder</li> </ul>"},{"location":"tools/AbstractCommandBuilder/#buildjavacommand","title":"buildJavaCommand <pre><code>List&lt;String&gt; buildJavaCommand(\n  String extraClassPath)\n</code></pre> <p><code>buildJavaCommand</code> builds the Java command for a Spark application (which is a collection of elements with the path to <code>java</code> executable, JVM options from <code>java-opts</code> file, and a class path).</p> <p>If <code>javaHome</code> is set, <code>buildJavaCommand</code> adds <code>[javaHome]/bin/java</code> to the result Java command. Otherwise, it uses <code>JAVA_HOME</code> or, when no earlier checks succeeded, falls through to <code>java.home</code> Java's system property.</p> <p>CAUTION: FIXME Who sets <code>javaHome</code> internal property and when?</p> <p><code>buildJavaCommand</code> loads extra Java options from the <code>java-opts</code> file in configuration directory if the file exists and adds them to the result Java command.</p> <p>Eventually, <code>buildJavaCommand</code> builds the class path (with the extra class path if non-empty) and adds it as <code>-cp</code> to the result Java command.</p>","text":""},{"location":"tools/AbstractCommandBuilder/#buildclasspath","title":"buildClassPath <pre><code>List&lt;String&gt; buildClassPath(\n  String appClassPath)\n</code></pre> <p><code>buildClassPath</code> builds the classpath for a Spark application.</p>  <p>Note</p> <p>Directories always end up with the OS-specific file separator at the end of their paths.</p>  <p><code>buildClassPath</code> adds the following in that order:</p> <ol> <li><code>SPARK_CLASSPATH</code> environment variable</li> <li>The input <code>appClassPath</code></li> <li>The configuration directory</li> <li> <p>(only with <code>SPARK_PREPEND_CLASSES</code> set or <code>SPARK_TESTING</code> being <code>1</code>) Locally compiled Spark classes in <code>classes</code>, <code>test-classes</code> and Core's jars. + CAUTION: FIXME Elaborate on \"locally compiled Spark classes\".</p> </li> <li> <p>(only with <code>SPARK_SQL_TESTING</code> being <code>1</code>) ... + CAUTION: FIXME Elaborate on the SQL testing case</p> </li> <li> <p><code>HADOOP_CONF_DIR</code> environment variable</p> </li> <li> <p><code>YARN_CONF_DIR</code> environment variable</p> </li> <li> <p><code>SPARK_DIST_CLASSPATH</code> environment variable</p> </li> </ol> <p>NOTE: <code>childEnv</code> is queried first before System properties. It is always empty for <code>AbstractCommandBuilder</code> (and <code>SparkSubmitCommandBuilder</code>, too).</p>","text":""},{"location":"tools/AbstractCommandBuilder/#loading-properties-file","title":"Loading Properties File <pre><code>Properties loadPropertiesFile()\n</code></pre> <p><code>loadPropertiesFile</code> loads Spark settings from a properties file (when specified on the command line) or spark-defaults.conf in the configuration directory.</p> <p><code>loadPropertiesFile</code> loads the settings from the following files starting from the first and checking every location until the first properties file is found:</p> <ol> <li><code>propertiesFile</code> (if specified using <code>--properties-file</code> command-line option or set by <code>AbstractCommandBuilder.setPropertiesFile</code>).</li> <li><code>[SPARK_CONF_DIR]/spark-defaults.conf</code></li> <li><code>[SPARK_HOME]/conf/spark-defaults.conf</code></li> </ol>","text":""},{"location":"tools/AbstractCommandBuilder/#sparks-configuration-directory","title":"Spark's Configuration Directory <p><code>AbstractCommandBuilder</code> uses <code>getConfDir</code> to compute the current configuration directory of a Spark application.</p> <p>It uses <code>SPARK_CONF_DIR</code> (from <code>childEnv</code> which is always empty anyway or as a environment variable) and falls through to <code>[SPARK_HOME]/conf</code> (with <code>SPARK_HOME</code> from getSparkHome).</p>","text":""},{"location":"tools/AbstractCommandBuilder/#sparks-home-directory","title":"Spark's Home Directory <p><code>AbstractCommandBuilder</code> uses <code>getSparkHome</code> to compute Spark's home directory for a Spark application.</p> <p>It uses <code>SPARK_HOME</code> (from <code>childEnv</code> which is always empty anyway or as a environment variable).</p> <p>If <code>SPARK_HOME</code> is not set, Spark throws a <code>IllegalStateException</code>:</p> <pre><code>Spark home not found; set it explicitly or use the SPARK_HOME environment variable.\n</code></pre>","text":""},{"location":"tools/AbstractCommandBuilder/#appResource","title":"Application Resource <pre><code>String appResource\n</code></pre> <p><code>AbstractCommandBuilder</code> uses <code>appResource</code> variable for the name of an application resource.</p> <p><code>appResource</code> can be one of the following application resource names:</p>    Identifier appResource     <code>pyspark-shell-main</code> <code>pyspark-shell-main</code>   <code>sparkr-shell-main</code> <code>sparkr-shell-main</code>   <code>run-example</code> findExamplesAppJar   <code>pyspark-shell</code> buildPySparkShellCommand   <code>sparkr-shell</code> buildSparkRCommand    <p><code>appResource</code> can be specified when:</p> <ul> <li><code>AbstractLauncher</code> is requested to setAppResource</li> <li><code>SparkSubmitCommandBuilder</code> is created</li> <li><code>SparkSubmitCommandBuilder.OptionParser</code> is requested to handle known or unknown options</li> </ul> <p><code>appResource</code> is used when:</p> <ul> <li><code>SparkLauncher</code> is requested to startApplication</li> <li><code>SparkSubmitCommandBuilder</code> is requested to build a command, buildSparkSubmitArgs</li> </ul>","text":""},{"location":"tools/AbstractLauncher/","title":"AbstractLauncher","text":"<p><code>AbstractLauncher</code> is...FIXME</p>"},{"location":"tools/DependencyUtils/","title":"DependencyUtils Utilities","text":""},{"location":"tools/DependencyUtils/#resolveglobpaths","title":"resolveGlobPaths <pre><code>resolveGlobPaths(\n  paths: String,\n  hadoopConf: Configuration): String\n</code></pre> <p><code>resolveGlobPaths</code>...FIXME</p> <p><code>resolveGlobPaths</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment</li> <li><code>DependencyUtils</code> is used to resolveAndDownloadJars</li> </ul>","text":""},{"location":"tools/DependencyUtils/#downloadfile","title":"downloadFile <pre><code>downloadFile(\n  path: String,\n  targetDir: File,\n  sparkConf: SparkConf,\n  hadoopConf: Configuration,\n  secMgr: SecurityManager): String\n</code></pre> <p><code>downloadFile</code> resolves the path to a well-formed URI and branches off based on the scheme:</p> <ul> <li>For <code>file</code> and <code>local</code> schemes, <code>downloadFile</code> returns the input <code>path</code></li> <li>For other schemes, <code>downloadFile</code>...FIXME</li> </ul> <p><code>downloadFile</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment</li> <li><code>DependencyUtils</code> is used to downloadFileList</li> </ul>","text":""},{"location":"tools/DependencyUtils/#downloadfilelist","title":"downloadFileList <pre><code>downloadFileList(\n  fileList: String,\n  targetDir: File,\n  sparkConf: SparkConf,\n  hadoopConf: Configuration,\n  secMgr: SecurityManager): String\n</code></pre> <p><code>downloadFileList</code>...FIXME</p> <p><code>downloadFileList</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment</li> <li><code>DependencyUtils</code> is used to resolveAndDownloadJars</li> </ul>","text":""},{"location":"tools/DependencyUtils/#resolvemavendependencies","title":"resolveMavenDependencies <pre><code>resolveMavenDependencies(\n  packagesExclusions: String,\n  packages: String,\n  repositories: String,\n  ivyRepoPath: String,\n  ivySettingsPath: Option[String]): String\n</code></pre> <p><code>resolveMavenDependencies</code>...FIXME</p> <p><code>resolveMavenDependencies</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment (for all resource managers but Spark Standalone and Apache Mesos)</li> </ul>","text":""},{"location":"tools/DependencyUtils/#adding-local-jars-to-classloader","title":"Adding Local Jars to ClassLoader <pre><code>addJarToClasspath(\n  localJar: String,\n  loader: MutableURLClassLoader): Unit\n</code></pre> <p><code>addJarToClasspath</code> adds <code>file</code> and <code>local</code> jars (as <code>localJar</code>) to the <code>loader</code> classloader.</p> <p><code>addJarToClasspath</code> resolves the URI of <code>localJar</code>. If the URI is <code>file</code> or <code>local</code> and the file denoted by <code>localJar</code> exists, <code>localJar</code> is added to <code>loader</code>. Otherwise, the following warning is printed out to the logs:</p> <pre><code>Warning: Local jar /path/to/fake.jar does not exist, skipping.\n</code></pre> <p>For all other URIs, the following warning is printed out to the logs:</p> <pre><code>Warning: Skip remote jar hdfs://fake.jar.\n</code></pre>  <p>Note</p> <p><code>addJarToClasspath</code> assumes <code>file</code> URI when <code>localJar</code> has no URI specified, e.g. <code>/path/to/local.jar</code>.</p>","text":""},{"location":"tools/DependencyUtils/#resolveanddownloadjars","title":"resolveAndDownloadJars <pre><code>resolveAndDownloadJars(\n  jars: String,\n  userJar: String,\n  sparkConf: SparkConf,\n  hadoopConf: Configuration,\n  secMgr: SecurityManager): String\n</code></pre> <p><code>resolveAndDownloadJars</code>...FIXME</p> <p><code>resolveAndDownloadJars</code> is used when:</p> <ul> <li><code>DriverWrapper</code> is requested to <code>setupDependencies</code> (Spark Standalone cluster mode)</li> </ul>","text":""},{"location":"tools/JavaMainApplication/","title":"JavaMainApplication","text":"<p><code>JavaMainApplication</code> is...FIXME</p>"},{"location":"tools/Main/","title":"Main","text":"<p><code>Main</code>\u00a0is the standalone application that is launched from spark-class shell script.</p>"},{"location":"tools/Main/#main","title":"Launching Application","text":"<pre><code>void main(\n  String[] argsArray)\n</code></pre> <p>Note</p> <p><code>main</code> requires that at least the class name (<code>className</code>) is given as the first argument in the given <code>argsArray</code>.</p> <p>For <code>org.apache.spark.deploy.SparkSubmit</code> class name, <code>main</code> creates a SparkSubmitCommandBuilder and builds a command (with the <code>SparkSubmitCommandBuilder</code>).</p> <p>Otherwise, <code>main</code> creates a SparkClassCommandBuilder and builds a command (with the <code>SparkClassCommandBuilder</code>).</p> Class Name AbstractCommandBuilder <code>org.apache.spark.deploy.SparkSubmit</code> SparkSubmitCommandBuilder anything else SparkClassCommandBuilder <p>In the end, <code>main</code> <code>prepareWindowsCommand</code> or prepareBashCommand based on the operating system it runs on, MS Windows or non-Windows, respectively.</p>"},{"location":"tools/Main/#buildCommand","title":"Building Command","text":"<pre><code>List&lt;String&gt; buildCommand(\n  AbstractCommandBuilder builder,\n  Map&lt;String, String&gt; env,\n  boolean printLaunchCommand)\n</code></pre> <p><code>buildCommand</code> requests the given AbstractCommandBuilder to build a command.</p> <p>With <code>printLaunchCommand</code> enabled, <code>buildCommand</code> prints out the command to standard error:</p> <pre><code>Spark Command: [cmd]\n========================================\n</code></pre> <p>SPARK_PRINT_LAUNCH_COMMAND</p> <p><code>printLaunchCommand</code> is controlled by <code>SPARK_PRINT_LAUNCH_COMMAND</code> environment variable.</p>"},{"location":"tools/SparkApplication/","title":"SparkApplication","text":"<p><code>SparkApplication</code> is an abstraction of entry points to Spark applications that can be started (submitted for execution using spark-submit).</p>"},{"location":"tools/SparkApplication/#contract","title":"Contract","text":""},{"location":"tools/SparkApplication/#starting-spark-application","title":"Starting Spark Application <pre><code>start(\n  args: Array[String], conf: SparkConf): Unit\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkSubmit</code> is requested to submit an application for execution</li> </ul>","text":""},{"location":"tools/SparkApplication/#implementations","title":"Implementations","text":"<ul> <li><code>ClientApp</code></li> <li>JavaMainApplication</li> <li><code>KubernetesClientApplication</code> (Spark on Kubernetes)</li> <li><code>RestSubmissionClientApp</code></li> <li><code>YarnClusterApplication</code></li> </ul>"},{"location":"tools/SparkClassCommandBuilder/","title":"SparkClassCommandBuilder","text":"<p><code>SparkClassCommandBuilder</code> is an AbstractCommandBuilder.</p>"},{"location":"tools/SparkClassCommandBuilder/#creating-instance","title":"Creating Instance","text":"<p><code>SparkClassCommandBuilder</code> takes the following to be created:</p> <ul> <li> Class Name <li> Class Arguments (<code>List&lt;String&gt;</code>) <p><code>SparkClassCommandBuilder</code> is created when:</p> <ul> <li><code>Main</code> standalone application is launched</li> </ul>"},{"location":"tools/SparkLauncher/","title":"SparkLauncher","text":"<p><code>SparkLauncher</code> is an interface to launch Spark applications programmatically, i.e. from a code (not spark-submit/index.md[spark-submit] directly). It uses a builder pattern to configure a Spark application and launch it as a child process using spark-submit/index.md[spark-submit].</p> <p><code>SparkLauncher</code> uses SparkSubmitCommandBuilder to build the Spark command of a Spark application to launch.</p>"},{"location":"tools/SparkLauncher/#spark-internal","title":"spark-internal <p><code>SparkLauncher</code> defines <code>spark-internal</code> (<code>NO_RESOURCE</code>) as a special value to inform Spark not to try to process the application resource (primary resource) as a regular file (but as an imaginary resource that cluster managers would know how to look up and submit for execution, e.g. Spark on YARN or Spark on Kubernetes).</p> <p><code>spark-internal</code> special value is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment and checks whether to add the primaryResource as part of the following:</li> <li><code>--jar</code> (for Spark on YARN in <code>cluster</code> deploy mode)</li> <li><code>--primary-*</code> arguments and define the <code>--main-class</code> argument (for Spark on Kubernetes in <code>cluster</code> deploy mode with <code>KubernetesClientApplication</code> main class)</li> <li><code>SparkSubmit</code> is requested to check whether a resource is internal or not</li> </ul>","text":""},{"location":"tools/SparkLauncher/#other","title":"Other <p>.<code>SparkLauncher</code>'s Builder Methods to Set Up Invocation of Spark Application [options=\"header\",width=\"100%\"] |=== | Setter | Description | <code>addAppArgs(String... args)</code> | Adds command line arguments for a Spark application. | <code>addFile(String file)</code> | Adds a file to be submitted with a Spark application. | <code>addJar(String jar)</code> | Adds a jar file to be submitted with the application. | <code>addPyFile(String file)</code> | Adds a python file / zip / egg to be submitted with a Spark application. | <code>addSparkArg(String arg)</code> | Adds a no-value argument to the Spark invocation. | <code>addSparkArg(String name, String value)</code> | Adds an argument with a value to the Spark invocation. It recognizes known command-line arguments, i.e. <code>--master</code>, <code>--properties-file</code>, <code>--conf</code>, <code>--class</code>, <code>--jars</code>, <code>--files</code>, and <code>--py-files</code>. | <code>directory(File dir)</code> | Sets the working directory of spark-submit. | <code>redirectError()</code> | Redirects stderr to stdout. | <code>redirectError(File errFile)</code> | Redirects error output to the specified <code>errFile</code> file. | <code>redirectError(ProcessBuilder.Redirect to)</code> | Redirects error output to the specified <code>to</code> Redirect. | <code>redirectOutput(File outFile)</code> | Redirects output to the specified <code>outFile</code> file. | <code>redirectOutput(ProcessBuilder.Redirect to)</code> | Redirects standard output to the specified <code>to</code> Redirect. | <code>redirectToLog(String loggerName)</code> | Sets all output to be logged and redirected to a logger with the specified name. | <code>setAppName(String appName)</code> | Sets the name of an Spark application | <code>setAppResource(String resource)</code> | Sets the main application resource, i.e. the location of a jar file for Scala/Java applications. | <code>setConf(String key, String value)</code> | Sets a Spark property. Expects <code>key</code> starting with <code>spark.</code> prefix. | <code>setDeployMode(String mode)</code> | Sets the deploy mode. | <code>setJavaHome(String javaHome)</code> | Sets a custom <code>JAVA_HOME</code>. | <code>setMainClass(String mainClass)</code> | Sets the main class. | <code>setMaster(String master)</code> | Sets the master URL. | <code>setPropertiesFile(String path)</code> | Sets the internal <code>propertiesFile</code>.</p> <p>See spark-AbstractCommandBuilder.md#loadPropertiesFile[<code>loadPropertiesFile</code> Internal Method]. | <code>setSparkHome(String sparkHome)</code> | Sets a custom <code>SPARK_HOME</code>. | <code>setVerbose(boolean verbose)</code> | Enables verbose reporting for SparkSubmit. |===</p> <p>After the invocation of a Spark application is set up, use <code>launch()</code> method to launch a sub-process that will start the configured Spark application. It is however recommended to use <code>startApplication</code> method instead.</p>","text":""},{"location":"tools/SparkLauncher/#source-scala","title":"[source, scala] <p>import org.apache.spark.launcher.SparkLauncher</p> <p>val command = new SparkLauncher()   .setAppResource(\"SparkPi\")   .setVerbose(true)</p>","text":""},{"location":"tools/SparkLauncher/#val-apphandle-commandstartapplication","title":"val appHandle = command.startApplication()","text":""},{"location":"tools/pyspark/","title":"pyspark Shell Script","text":"<p><code>pyspark</code> shell script runs spark-submit with pyspark-shell-main application resource as the first argument followed by <code>--name \"PySparkShell\"</code> option (with other command-line arguments, if specified).</p>"},{"location":"tools/pyspark/#pyspark-shell","title":"pyspark/shell.py","text":"<p>pyspark/shell.py</p> <p>Learn more about <code>pyspark/shell.py</code> in The Internals of PySpark.</p> <p><code>pyspark/shell.py</code> module is launched as a PYTHONSTARTUP script.</p>"},{"location":"tools/pyspark/#environment-variables","title":"Environment Variables","text":"<p><code>pyspark</code> script exports the following environment variables:</p> <ul> <li>OLD_PYTHONSTARTUP</li> <li><code>PYSPARK_DRIVER_PYTHON</code></li> <li><code>PYSPARK_DRIVER_PYTHON_OPTS</code></li> <li>PYSPARK_PYTHON</li> <li><code>PYTHONPATH</code></li> <li>PYTHONSTARTUP</li> </ul>"},{"location":"tools/pyspark/#OLD_PYTHONSTARTUP","title":"OLD_PYTHONSTARTUP","text":"<p><code>pyspark</code> defines <code>OLD_PYTHONSTARTUP</code> environment variable to be the initial value of PYTHONSTARTUP (before it gets redefined).</p> <p>The idea of <code>OLD_PYTHONSTARTUP</code> is to delay execution of the Python startup script until pyspark/shell.py finishes.</p>"},{"location":"tools/pyspark/#PYSPARK_PYTHON","title":"PYSPARK_PYTHON","text":"<p><code>PYSPARK_PYTHON</code> environment variable can be used to specify a Python executable to run PySpark scripts.</p> The Internals of PySpark <p>Learn more about PySpark in The Internals of PySpark.</p> <p><code>PYSPARK_PYTHON</code> can be overriden by PYSPARK_DRIVER_PYTHON and configuration properties when <code>SparkSubmitCommandBuilder</code> is requested to buildPySparkShellCommand.</p> <p><code>PYSPARK_PYTHON</code> is overriden by <code>spark.pyspark.python</code> configuration property, if defined, when <code>SparkSubmitCommandBuilder</code> is requested to buildPySparkShellCommand.</p>"},{"location":"tools/pyspark/#PYTHONSTARTUP","title":"PYTHONSTARTUP","text":"<p>From Python Documentation:</p> <p>PYTHONSTARTUP</p> <p>If this is the name of a readable file, the Python commands in that file are executed before the first prompt is displayed in interactive mode. The file is executed in the same namespace where interactive commands are executed so that objects defined or imported in it can be used without qualification in the interactive session. You can also change the prompts <code>sys.ps1</code> and <code>sys.ps2</code> and the hook <code>sys.__interactivehook__</code> in this file.</p> <p><code>pyspark</code> (re)defines <code>PYTHONSTARTUP</code> environment variable to be pyspark/shell.py module:</p> <pre><code>${SPARK_HOME}/python/pyspark/shell.py\n</code></pre> <p>OLD_PYTHONSTARTUP</p> <p>The initial value of <code>PYTHONSTARTUP</code> environment variable is available as OLD_PYTHONSTARTUP.</p>"},{"location":"tools/spark-class/","title":"spark-class shell script","text":"<p><code>spark-class</code> shell script is the Spark application command-line launcher that is responsible for setting up JVM environment and executing a Spark application.</p> <p>NOTE: Ultimately, any shell script in Spark, e.g. link:spark-submit.adoc[spark-submit], calls <code>spark-class</code> script.</p> <p>You can find <code>spark-class</code> script in <code>bin</code> directory of the Spark distribution.</p> <p>When started, <code>spark-class</code> first loads <code>$SPARK_HOME/bin/load-spark-env.sh</code>, collects the Spark assembly jars, and executes &lt;&gt;. <p>Depending on the Spark distribution (or rather lack thereof), i.e. whether <code>RELEASE</code> file exists or not, it sets <code>SPARK_JARS_DIR</code> environment variable to <code>[SPARK_HOME]/jars</code> or <code>[SPARK_HOME]/assembly/target/scala-[SPARK_SCALA_VERSION]/jars</code>, respectively (with the latter being a local build).</p> <p>If <code>SPARK_JARS_DIR</code> does not exist, <code>spark-class</code> prints the following error message and exits with the code <code>1</code>.</p> <pre><code>Failed to find Spark jars directory ([SPARK_JARS_DIR]).\nYou need to build Spark with the target \"package\" before running this program.\n</code></pre> <p><code>spark-class</code> sets <code>LAUNCH_CLASSPATH</code> environment variable to include all the jars under <code>SPARK_JARS_DIR</code>.</p> <p>If <code>SPARK_PREPEND_CLASSES</code> is enabled, <code>[SPARK_HOME]/launcher/target/scala-[SPARK_SCALA_VERSION]/classes</code> directory is added to <code>LAUNCH_CLASSPATH</code> as the first entry.</p> <p>NOTE: Use <code>SPARK_PREPEND_CLASSES</code> to have the Spark launcher classes (from <code>[SPARK_HOME]/launcher/target/scala-[SPARK_SCALA_VERSION]/classes</code>) to appear before the other Spark assembly jars. It is useful for development so your changes don't require rebuilding Spark again.</p> <p><code>SPARK_TESTING</code> and <code>SPARK_SQL_TESTING</code> environment variables enable test special mode.</p> <p>CAUTION: FIXME What's so special about the env vars?</p> <p><code>spark-class</code> uses &lt;&gt; command-line application to compute the Spark command to launch. The <code>Main</code> class programmatically computes the command that <code>spark-class</code> executes afterwards. <p>TIP: Use <code>JAVA_HOME</code> to point at the JVM to use.</p> <p>=== [[main]] Launching org.apache.spark.launcher.Main Standalone Application</p> <p><code>org.apache.spark.launcher.Main</code> is a Scala standalone application used in <code>spark-class</code> to prepare the Spark command to execute.</p> <p><code>Main</code> expects that the first parameter is the class name that is the \"operation mode\":</p> <ol> <li><code>org.apache.spark.deploy.SparkSubmit</code> -- <code>Main</code> uses link:spark-submit-SparkSubmitCommandBuilder.adoc[SparkSubmitCommandBuilder] to parse command-line arguments. This is the mode link:spark-submit.adoc[spark-submit] uses.</li> <li>anything -- <code>Main</code> uses <code>SparkClassCommandBuilder</code> to parse command-line arguments.</li> </ol> <pre><code>$ ./bin/spark-class org.apache.spark.launcher.Main\nException in thread \"main\" java.lang.IllegalArgumentException: Not enough arguments: missing class name.\n    at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)\n    at org.apache.spark.launcher.Main.main(Main.java:51)\n</code></pre> <p><code>Main</code> uses <code>buildCommand</code> method on the builder to build a Spark command.</p> <p>If <code>SPARK_PRINT_LAUNCH_COMMAND</code> environment variable is enabled, <code>Main</code> prints the final Spark command to standard error.</p> <pre><code>Spark Command: [cmd]\n========================================\n</code></pre> <p>If on Windows it calls <code>prepareWindowsCommand</code> while on non-Windows OSes <code>prepareBashCommand</code> with tokens separated by <code>\u0000\u0000\\0</code>.</p> <p>CAUTION: FIXME What's <code>prepareWindowsCommand</code>? <code>prepareBashCommand</code>?</p> <p><code>Main</code> uses the following environment variables:</p> <ul> <li><code>SPARK_DAEMON_JAVA_OPTS</code> and <code>SPARK_MASTER_OPTS</code> to be added to the command line of the command.</li> <li><code>SPARK_DAEMON_MEMORY</code> (default: <code>1g</code>) for <code>-Xms</code> and <code>-Xmx</code>.</li> </ul>"},{"location":"tools/spark-shell/","title":"spark-shell shell script","text":"<p>Spark shell is an interactive environment where you can learn how to make the most out of Apache Spark quickly and conveniently.</p> <p>TIP: Spark shell is particularly helpful for fast interactive prototyping.</p> <p>Under the covers, Spark shell is a standalone Spark application written in Scala that offers environment with auto-completion (using <code>TAB</code> key) where you can run ad-hoc queries and get familiar with the features of Spark (that help you in developing your own standalone Spark applications). It is a very convenient tool to explore the many things available in Spark with immediate feedback. It is one of the many reasons why spark-overview.md#why-spark[Spark is so helpful for tasks to process datasets of any size].</p> <p>There are variants of Spark shell for different languages: <code>spark-shell</code> for Scala, <code>pyspark</code> for Python and <code>sparkR</code> for R.</p> <p>NOTE: This document (and the book in general) uses <code>spark-shell</code> for Scala only.</p> <p>You can start Spark shell using &lt;spark-shell script&gt;&gt;. <pre><code>$ ./bin/spark-shell\nscala&gt;\n</code></pre> <p><code>spark-shell</code> is an extension of Scala REPL with automatic instantiation of spark-sql-SparkSession.md[SparkSession] as <code>spark</code> (and SparkContext.md[] as <code>sc</code>).</p>"},{"location":"tools/spark-shell/#source-scala","title":"[source, scala]","text":"<p>scala&gt; :type spark org.apache.spark.sql.SparkSession</p> <p>// Learn the current version of Spark in use scala&gt; spark.version res0: String = 2.1.0-SNAPSHOT</p> <p><code>spark-shell</code> also imports spark-sql-SparkSession.md#implicits[Scala SQL's implicits] and spark-sql-SparkSession.md#sql[<code>sql</code> method].</p>"},{"location":"tools/spark-shell/#source-scala_1","title":"[source, scala]","text":"<p>scala&gt; :imports  1) import spark.implicits._       (59 terms, 38 are implicit)  2) import spark.sql               (1 terms)</p>"},{"location":"tools/spark-shell/#note","title":"[NOTE]","text":"<p>When you execute <code>spark-shell</code> you actually execute spark-submit/index.md[Spark submit] as follows:</p>"},{"location":"tools/spark-shell/#optionswrap","title":"[options=\"wrap\"]","text":""},{"location":"tools/spark-shell/#orgapachesparkdeploysparksubmit-class-orgapachesparkreplmain-name-spark-shell-spark-shell","title":"org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name Spark shell spark-shell","text":""},{"location":"tools/spark-shell/#set-spark_print_launch_command-to-see-the-entire-command-to-be-executed-refer-to-spark-tips-and-tricksmdspark_print_launch_commandprint-launch-command-of-spark-scripts","title":"Set <code>SPARK_PRINT_LAUNCH_COMMAND</code> to see the entire command to be executed. Refer to spark-tips-and-tricks.md#SPARK_PRINT_LAUNCH_COMMAND[Print Launch Command of Spark Scripts].","text":"<p>=== [[using-spark-shell]] Using Spark shell</p> <p>You start Spark shell using <code>spark-shell</code> script (available in <code>bin</code> directory).</p> <pre><code>$ ./bin/spark-shell\nSetting default log level to \"WARN\".\nTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\nWARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\nWARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException\nSpark context Web UI available at http://10.47.71.138:4040\nSpark context available as 'sc' (master = local[*], app id = local-1477858597347).\nSpark session available as 'spark'.\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 2.1.0-SNAPSHOT\n      /_/\n\nUsing Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)\nType in expressions to have them evaluated.\nType :help for more information.\n\nscala&gt;\n</code></pre> <p>Spark shell creates an instance of spark-sql-SparkSession.md[SparkSession] under the name <code>spark</code> for you (so you don't have to know the details how to do it yourself on day 1).</p> <pre><code>scala&gt; :type spark\norg.apache.spark.sql.SparkSession\n</code></pre> <p>Besides, there is also <code>sc</code> value created which is an instance of SparkContext.md[].</p> <pre><code>scala&gt; :type sc\norg.apache.spark.SparkContext\n</code></pre> <p>To close Spark shell, you press <code>Ctrl+D</code> or type in <code>:q</code> (or any subset of <code>:quit</code>).</p> <pre><code>scala&gt; :q\n</code></pre>"},{"location":"tools/spark-submit/","title":"spark-submit Shell Script","text":"<p><code>spark-submit</code> shell script allows managing Spark applications.</p> <p><code>spark-submit</code> is a command-line frontend to SparkSubmit.</p>"},{"location":"tools/spark-submit/#options","title":"Command-Line Options","text":""},{"location":"tools/spark-submit/#archives","title":"archives","text":"<ul> <li>Command-Line Option: <code>--archives</code></li> <li>Internal Property: <code>archives</code></li> </ul>"},{"location":"tools/spark-submit/#deploy-mode","title":"deploy-mode","text":"<p>Deploy mode</p> <ul> <li>Command-Line Option: <code>--deploy-mode</code></li> <li>Spark Property: <code>spark.submit.deployMode</code></li> <li>Environment Variable: <code>DEPLOY_MODE</code></li> <li>Internal Property: <code>deployMode</code></li> </ul>"},{"location":"tools/spark-submit/#driver-class-path","title":"driver-class-path","text":"<pre><code>--driver-class-path\n</code></pre> <p>Extra class path entries (e.g. jars and directories) to pass to a driver's JVM.</p> <p><code>--driver-class-path</code> command-line option sets the extra class path entries (e.g. jars and directories) that should be added to a driver's JVM.</p> <p>Tip</p> <p>Use <code>--driver-class-path</code> in <code>client</code> deploy mode (not SparkConf) to ensure that the CLASSPATH is set up with the entries.</p> <p><code>client</code> deploy mode uses the same JVM for the driver as <code>spark-submit</code>'s.</p> <p>Internal Property: <code>driverExtraClassPath</code></p> <p>Spark Property: spark.driver.extraClassPath</p> <p>Note</p> <p>Command-line options (e.g. <code>--driver-class-path</code>) have higher precedence than their corresponding Spark settings in a Spark properties file (e.g. <code>spark.driver.extraClassPath</code>). You can therefore control the final settings by overriding Spark settings on command line using the command-line options.</p>"},{"location":"tools/spark-submit/#driver-cores","title":"driver-cores","text":"<pre><code>--driver-cores NUM\n</code></pre> <p><code>--driver-cores</code> command-line option sets the number of cores to <code>NUM</code> for the driver in the <code>cluster</code> deploy mode.</p> <p>Spark Property: spark.driver.cores</p> <p>Note</p> <p>Only available for <code>cluster</code> deploy mode (when the driver is executed outside <code>spark-submit</code>).</p> <p>Internal Property: <code>driverCores</code></p>"},{"location":"tools/spark-submit/#properties-file","title":"properties-file","text":"<pre><code>--properties-file [FILE]\n</code></pre> <p><code>--properties-file</code> command-line option sets the path to a file <code>FILE</code> from which Spark loads extra Spark properties.</p> <p>Note</p> <p>Spark uses conf/spark-defaults.conf by default.</p>"},{"location":"tools/spark-submit/#queue","title":"queue","text":"<pre><code>--queue QUEUE_NAME\n</code></pre> <p>YARN resource queue</p> <ul> <li>Spark Property: <code>spark.yarn.queue</code></li> <li>Internal Property: <code>queue</code></li> </ul>"},{"location":"tools/spark-submit/#version","title":"version","text":"<p>Command-Line Option: <code>--version</code></p> <pre><code>$ ./bin/spark-submit --version\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 2.1.0-SNAPSHOT\n      /_/\n\nBranch master\nCompiled by user jacek on 2016-09-30T07:08:39Z\nRevision 1fad5596885aab8b32d2307c0edecbae50d5bd7a\nUrl https://github.com/apache/spark.git\nType --help for more information.\n</code></pre>"},{"location":"tools/spark-submit/#SPARK_PRINT_LAUNCH_COMMAND","title":"SPARK_PRINT_LAUNCH_COMMAND","text":"<p>SPARK_PRINT_LAUNCH_COMMAND environment variable allows to have the complete Spark command printed out to the standard output.</p> <pre><code>$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell\nSpark Command: /Library/Ja...\n</code></pre>"},{"location":"tools/spark-submit/SparkSubmit/","title":"SparkSubmit","text":"<p><code>SparkSubmit</code> is the entry point to spark-submit shell script.</p>"},{"location":"tools/spark-submit/SparkSubmit/#special-primary-resource-names","title":"Special Primary Resource Names <p><code>SparkSubmit</code> uses the following special primary resource names to represent Spark shells rather than application jars:</p> <ul> <li><code>spark-shell</code></li> <li>pyspark-shell</li> <li><code>sparkr-shell</code></li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#pyspark-shell","title":"pyspark-shell <p><code>SparkSubmit</code> uses <code>pyspark-shell</code> when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment for <code>.py</code> scripts or <code>pyspark</code>, isShell and isPython</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#isshell","title":"isShell <pre><code>isShell(\n  res: String): Boolean\n</code></pre> <p><code>isShell</code> is <code>true</code> when the given <code>res</code> primary resource represents a Spark shell.</p> <p><code>isShell</code>\u00a0is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to prepareSubmitEnvironment and isUserJar</li> <li><code>SparkSubmitArguments</code> is requested to handleUnknown (and determine a primary application resource)</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#actions","title":"Actions <p><code>SparkSubmit</code> executes actions (based on the action argument).</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#killing-submission","title":"Killing Submission <pre><code>kill(\n  args: SparkSubmitArguments): Unit\n</code></pre> <p><code>kill</code>...FIXME</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#displaying-version","title":"Displaying Version <pre><code>printVersion(): Unit\n</code></pre> <p><code>printVersion</code>...FIXME</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#submission-status","title":"Submission Status <pre><code>requestStatus(\n  args: SparkSubmitArguments): Unit\n</code></pre> <p><code>requestStatus</code>...FIXME</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#submit","title":"Application Submission <pre><code>submit(\n  args: SparkSubmitArguments,\n  uninitLog: Boolean): Unit\n</code></pre> <p><code>submit</code> doRunMain unless isStandaloneCluster and useRest.</p> <p>For isStandaloneCluster with useRest requested, <code>submit</code>...FIXME</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#doRunMain","title":"doRunMain","text":"<pre><code>doRunMain(): Unit\n</code></pre> <p><code>doRunMain</code> runMain unless proxyUser is specified.</p> <p>With proxyUser specified, <code>doRunMain</code>...FIXME</p>"},{"location":"tools/spark-submit/SparkSubmit/#runMain","title":"Running Main Class","text":"<pre><code>runMain(\n  args: SparkSubmitArguments,\n  uninitLog: Boolean): Unit\n</code></pre> <p><code>runMain</code> prepares submit environment for the given SparkSubmitArguments (that gives <code>childArgs</code>, <code>childClasspath</code>, <code>sparkConf</code> and childMainClass).</p> <p>With verbose enabled, <code>runMain</code> prints out the following INFO messages to the logs:</p> <pre><code>Main class:\n[childMainClass]\nArguments:\n[childArgs]\nSpark config:\n[sparkConf_redacted]\nClasspath elements:\n[childClasspath]\n</code></pre> <p> <code>runMain</code> creates and sets a context classloader (based on <code>spark.driver.userClassPathFirst</code> configuration property) and adds the jars (from <code>childClasspath</code>). <p> <code>runMain</code> loads the main class (<code>childMainClass</code>). <p><code>runMain</code> creates a SparkApplication (if the main class is a subtype of) or creates a JavaMainApplication (with the main class).</p> <p>In the end, <code>runMain</code> requests the <code>SparkApplication</code> to start (with the <code>childArgs</code> and <code>sparkConf</code>).</p>"},{"location":"tools/spark-submit/SparkSubmit/#cluster-managers","title":"Cluster Managers <p><code>SparkSubmit</code> has a built-in support for some cluster managers (that are selected based on the master argument).</p>    Nickname Master URL      KUBERNETES <code>k8s://</code>-prefix    LOCAL <code>local</code>-prefix    MESOS <code>mesos</code>-prefix    STANDALONE <code>spark</code>-prefix    YARN <code>yarn</code>","text":""},{"location":"tools/spark-submit/SparkSubmit/#main","title":"Launching Standalone Application <pre><code>main(\n  args: Array[String]): Unit\n</code></pre> <p><code>main</code> creates a <code>SparkSubmit</code> to doSubmit (with the given <code>args</code>).</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#doSubmit","title":"doSubmit <pre><code>doSubmit(\n  args: Array[String]): Unit\n</code></pre> <p><code>doSubmit</code> initializeLogIfNecessary.</p> <p><code>doSubmit</code> parses the arguments in the given <code>args</code> (that gives a SparkSubmitArguments).</p> <p>With verbose option on, <code>doSubmit</code> prints out the <code>appArgs</code> to standard output.</p> <p><code>doSubmit</code> branches off based on action.</p>    Action Handler     <code>SUBMIT</code> submit   <code>KILL</code> kill   <code>REQUEST_STATUS</code> requestStatus   <code>PRINT_VERSION</code> printVersion     <p><code>doSubmit</code> is used when:</p> <ul> <li><code>InProcessSparkSubmit</code> standalone application is started</li> <li><code>SparkSubmit</code> standalone application is started</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#parseArguments","title":"Parsing Arguments <pre><code>parseArguments(\n  args: Array[String]): SparkSubmitArguments\n</code></pre> <p><code>parseArguments</code> creates a SparkSubmitArguments (with the given <code>args</code>).</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#prepareSubmitEnvironment","title":"prepareSubmitEnvironment <pre><code>prepareSubmitEnvironment(\n  args: SparkSubmitArguments,\n  conf: Option[HadoopConfiguration] = None): (Seq[String], Seq[String], SparkConf, String)\n</code></pre> <p><code>prepareSubmitEnvironment</code> creates a 4-element tuple made up of the following:</p> <ol> <li><code>childArgs</code> for arguments</li> <li><code>childClasspath</code> for Classpath elements</li> <li><code>sysProps</code> for Spark properties</li> <li>childMainClass</li> </ol>  <p>Tip</p> <p>Use <code>--verbose</code> command-line option to have the elements of the tuple printed out to the standard output.</p>  <p><code>prepareSubmitEnvironment</code>...FIXME</p> <p>For isPython in <code>CLIENT</code> deploy mode, <code>prepareSubmitEnvironment</code> sets the following based on primaryResource:</p> <ul> <li> <p>For pyspark-shell the mainClass is <code>org.apache.spark.api.python.PythonGatewayServer</code></p> </li> <li> <p>Otherwise, the mainClass is <code>org.apache.spark.deploy.PythonRunner</code> and the main python file, extra python files and the childArgs</p> </li> </ul> <p><code>prepareSubmitEnvironment</code>...FIXME</p> <p><code>prepareSubmitEnvironment</code> determines the cluster manager based on master argument.</p> <p>For KUBERNETES, <code>prepareSubmitEnvironment</code> checkAndGetK8sMasterUrl.</p> <p><code>prepareSubmitEnvironment</code>...FIXME</p> <p><code>prepareSubmitEnvironment</code>\u00a0is used when...FIXME</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#childMainClass","title":"childMainClass <p><code>childMainClass</code> is the last 4<sup>th</sup> argument in the result tuple of prepareSubmitEnvironment.</p> <pre><code>// (childArgs, childClasspath, sparkConf, childMainClass)\n(Seq[String], Seq[String], SparkConf, String)\n</code></pre> <p><code>childMainClass</code> can be as follows (based on the deployMode):</p>    Deploy Mode Master URL childMainClass     <code>client</code> any mainClass   <code>cluster</code> KUBERNETES  <code>KubernetesClientApplication</code>   <code>cluster</code> MESOS RestSubmissionClientApp (for REST submission API)   <code>cluster</code> STANDALONE  <code>RestSubmissionClientApp</code> (for REST submission API)   <code>cluster</code> STANDALONE  <code>ClientApp</code>   <code>cluster</code> YARN  <code>YarnClusterApplication</code>","text":""},{"location":"tools/spark-submit/SparkSubmit/#iskubernetesclient","title":"isKubernetesClient <p><code>prepareSubmitEnvironment</code> uses <code>isKubernetesClient</code> flag to indicate that:</p> <ul> <li>Cluster manager is Kubernetes</li> <li>Deploy mode is client</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#iskubernetesclustermodedriver","title":"isKubernetesClusterModeDriver <p><code>prepareSubmitEnvironment</code> uses <code>isKubernetesClusterModeDriver</code> flag to indicate that:</p> <ul> <li>isKubernetesClient</li> <li><code>spark.kubernetes.submitInDriver</code> configuration property is enabled (Spark on Kubernetes)</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#renameresourcestolocalfs","title":"renameResourcesToLocalFS <pre><code>renameResourcesToLocalFS(\n  resources: String,\n  localResources: String): String\n</code></pre> <p><code>renameResourcesToLocalFS</code>...FIXME</p> <p><code>renameResourcesToLocalFS</code> is used for isKubernetesClusterModeDriver mode.</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#downloadresource","title":"downloadResource <pre><code>downloadResource(\n  resource: String): String\n</code></pre> <p><code>downloadResource</code>...FIXME</p>","text":""},{"location":"tools/spark-submit/SparkSubmit/#checking-whether-resource-is-internal","title":"Checking Whether Resource is Internal <pre><code>isInternal(\n  res: String): Boolean\n</code></pre> <p><code>isInternal</code> is <code>true</code> when the given <code>res</code> is spark-internal.</p> <p><code>isInternal</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to isUserJar</li> <li><code>SparkSubmitArguments</code> is requested to handleUnknown</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#isuserjar","title":"isUserJar <pre><code>isUserJar(\n  res: String): Boolean\n</code></pre> <p><code>isUserJar</code> is <code>true</code> when the given <code>res</code> is none of the following:</p> <ul> <li><code>isShell</code></li> <li>isPython</li> <li>isInternal</li> <li><code>isR</code></li> </ul> <p><code>isUserJar</code> is used when:</p> <ul> <li>FIXME</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmit/#isPython","title":"isPython <pre><code>isPython(\n  res: String): Boolean\n</code></pre> <p><code>isPython</code> is positive (<code>true</code>) when the given <code>res</code> primary resource represents a PySpark application:</p> <ul> <li><code>.py</code> script</li> <li>pyspark-shell</li> </ul>  <p><code>isPython</code> is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to isUserJar</li> <li><code>SparkSubmitArguments</code> is requested to handle an unknown option</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/","title":"SparkSubmitArguments","text":"<p><code>SparkSubmitArguments</code> is created\u00a0 for <code>SparkSubmit</code> to parseArguments.</p> <p><code>SparkSubmitArguments</code> is a custom <code>SparkSubmitArgumentsParser</code> to handle the command-line arguments of spark-submit script that the actions use for execution (possibly with the explicit <code>env</code> environment).</p> <p><code>SparkSubmitArguments</code> is created when launching spark-submit script with only <code>args</code> passed in and later used for printing the arguments in verbose mode.</p>"},{"location":"tools/spark-submit/SparkSubmitArguments/#creating-instance","title":"Creating Instance","text":"<p><code>SparkSubmitArguments</code> takes the following to be created:</p> <ul> <li> Arguments (<code>Seq[String]</code>) <li> Environment Variables (default: <code>sys.env</code>) <p><code>SparkSubmitArguments</code> is created\u00a0when:</p> <ul> <li><code>SparkSubmit</code> is requested to parseArguments</li> </ul>"},{"location":"tools/spark-submit/SparkSubmitArguments/#action","title":"Action","text":"<pre><code>action: SparkSubmitAction\n</code></pre> <p><code>action</code> is used by SparkSubmit to determine what to do when executed.</p> <p><code>action</code> can be one of the following <code>SparkSubmitAction</code>s:</p> Action Description <code>SUBMIT</code> The default action if none specified <code>KILL</code> Indicates --kill switch <code>REQUEST_STATUS</code> Indicates --status switch <code>PRINT_VERSION</code> Indicates --version switch <p><code>action</code> is undefined (<code>null</code>) by default (when <code>SparkSubmitAction</code> is created).</p> <p><code>action</code> is validated when validateArguments.</p>"},{"location":"tools/spark-submit/SparkSubmitArguments/#command-line-options","title":"Command-Line Options","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#-files","title":"--files <ul> <li>Configuration Property: spark.files</li> <li>Configuration Property (Spark on YARN): <code>spark.yarn.dist.files</code></li> </ul> <p>Printed out to standard output for <code>--verbose</code> option</p> <p>When <code>SparkSubmit</code> is requested to prepareSubmitEnvironment, the files are:</p> <ul> <li>resolveGlobPaths</li> <li>downloadFileList</li> <li>renameResourcesToLocalFS</li> <li>downloadResource</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#loading-spark-properties","title":"Loading Spark Properties <pre><code>loadEnvironmentArguments(): Unit\n</code></pre> <p><code>loadEnvironmentArguments</code> loads the Spark properties for the current execution of spark-submit.</p> <p><code>loadEnvironmentArguments</code> reads command-line options first followed by Spark properties and System's environment variables.</p>  <p>Note</p> <p>Spark config properties start with <code>spark.</code> prefix and can be set using <code>--conf [key=value]</code> command-line option.</p>","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#handle","title":"Option Handling  SparkSubmitOptionParser <pre><code>handle(\n  opt: String,\n  value: String): Boolean\n</code></pre> <p><code>handle</code> is part of the SparkSubmitOptionParser abstraction.</p>  <p><code>handle</code> parses the input <code>opt</code> argument and assigns the given <code>value</code> to corresponding properties.</p> <p>In the end, <code>handle</code> returns whether it was executed for any action but PRINT_VERSION.</p>    User Option (<code>opt</code>) Property     <code>--kill</code> action   <code>--name</code> name   <code>--status</code> action   <code>--version</code> action   ... ...","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#mergedefaultsparkproperties","title":"mergeDefaultSparkProperties <pre><code>mergeDefaultSparkProperties(): Unit\n</code></pre> <p><code>mergeDefaultSparkProperties</code> merges Spark properties from the default Spark properties file, i.e. <code>spark-defaults.conf</code> with those specified through <code>--conf</code> command-line option.</p>","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#isPython","title":"isPython <pre><code>isPython: Boolean = false\n</code></pre> <p><code>isPython</code> indicates whether the application resource is a PySpark application (a Python script or pyspark shell).</p> <p><code>isPython</code> is isPython when <code>SparkSubmitArguments</code> is requested to handle a unknown option.</p>","text":""},{"location":"tools/spark-submit/SparkSubmitArguments/#client-deploy-mode","title":"Client Deploy Mode <p>With isPython flag enabled, SparkSubmit determines the mainClass (and the childArgs) based on the primaryResource.</p>    primaryResource mainClass     <code>pyspark-shell</code> <code>org.apache.spark.api.python.PythonGatewayServer</code> (PySpark)   anything else <code>org.apache.spark.deploy.PythonRunner</code> (PySpark)","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder.OptionParser/","title":"SparkSubmitCommandBuilder.OptionParser","text":"<p><code>SparkSubmitCommandBuilder.OptionParser</code> is...FIXME</p>"},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/","title":"SparkSubmitCommandBuilder","text":"<p><code>SparkSubmitCommandBuilder</code> is an AbstractCommandBuilder.</p> <p><code>SparkSubmitCommandBuilder</code> is used to build a command that spark-submit and SparkLauncher use to launch a Spark application.</p> <p><code>SparkSubmitCommandBuilder</code> uses the first argument to distinguish the shells:</p> <ol> <li><code>pyspark-shell-main</code></li> <li><code>sparkr-shell-main</code></li> <li><code>run-example</code></li> </ol> <p><code>SparkSubmitCommandBuilder</code> parses command-line arguments using <code>OptionParser</code> (which is a spark-submit-SparkSubmitOptionParser.md[SparkSubmitOptionParser]). <code>OptionParser</code> comes with the following methods:</p> <ol> <li> <p><code>handle</code> to handle the known options (see the table below). It sets up <code>master</code>, <code>deployMode</code>, <code>propertiesFile</code>, <code>conf</code>, <code>mainClass</code>, <code>sparkArgs</code> internal properties.</p> </li> <li> <p><code>handleUnknown</code> to handle unrecognized options that usually lead to <code>Unrecognized option</code> error message.</p> </li> <li> <p><code>handleExtraArgs</code> to handle extra arguments that are considered a Spark application's arguments.</p> </li> </ol> <p>Note</p> <p>For <code>spark-shell</code> it assumes that the application arguments are after <code>spark-submit</code>'s arguments.</p>"},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#pyspark-shell-main","title":"pyspark-shell-main Application Resource <p>When <code>bin/pyspark</code> shell script (and <code>bin\\pyspark2.cmd</code>) are launched, they use bin/spark-submit with <code>pyspark-shell-main</code> application resource as the first argument (followed by <code>--name \"PySparkShell\"</code> option among the others).</p> <p><code>pyspark-shell-main</code> is used when:</p> <ul> <li><code>SparkSubmitCommandBuilder</code> is created and then requested to build a command (buildPySparkShellCommand actually)</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildCommand","title":"Building Command  AbstractCommandBuilder <pre><code>List&lt;String&gt; buildCommand(\n  Map&lt;String, String&gt; env)\n</code></pre> <p><code>buildCommand</code> is part of the AbstractCommandBuilder abstraction.</p>  <p><code>buildCommand</code> branches off based on the application resource.</p>    Application Resource Command Builder     pyspark-shell-main (but not isSpecialCommand) buildPySparkShellCommand   <code>sparkr-shell-main</code> (but not isSpecialCommand) buildSparkRCommand   anything else buildSparkSubmitCommand","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildPySparkShellCommand","title":"buildPySparkShellCommand","text":"<pre><code>List&lt;String&gt; buildPySparkShellCommand(\n  Map&lt;String, String&gt; env)\n</code></pre> appArgs expected to be empty <p><code>buildPySparkShellCommand</code> makes sure that:</p> <ul> <li>There are no appArgs</li> <li>If there are appArgs the first argument is not a Python script (a file with <code>.py</code> extension)</li> </ul> <p><code>buildPySparkShellCommand</code> sets the application resource as <code>pyspark-shell</code>.</p> pyspark-shell-main redefined to pyspark-shell <p><code>buildPySparkShellCommand</code> is executed when requested for a command with <code>pyspark-shell-main</code> application resource that is re-defined (reset) to <code>pyspark-shell</code> now.</p> <p><code>buildPySparkShellCommand</code> constructEnvVarArgs with the given <code>env</code> and <code>PYSPARK_SUBMIT_ARGS</code>.</p> <p><code>buildPySparkShellCommand</code> defines an internal <code>pyargs</code> collection for the parts of the shell command to execute.</p> <p><code>buildPySparkShellCommand</code> stores the Python executable (in <code>pyargs</code>) to be the first specified in the following order:</p> <ul> <li><code>spark.pyspark.driver.python</code> configuration property</li> <li><code>spark.pyspark.python</code> configuration property</li> <li><code>PYSPARK_DRIVER_PYTHON</code> environment variable</li> <li><code>PYSPARK_PYTHON</code> environment variable</li> <li><code>python3</code></li> </ul> <p><code>buildPySparkShellCommand</code> sets the environment variables (for the Python executable to use), if specified.</p> Environment Variable Configuration Property <code>PYSPARK_PYTHON</code> <code>spark.pyspark.python</code> <code>SPARK_REMOTE</code> remote option or <code>spark.remote</code> <p>In the end, <code>buildPySparkShellCommand</code> copies all the options from <code>PYSPARK_DRIVER_PYTHON_OPTS</code>, if specified.</p>"},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildSparkSubmitCommand","title":"buildSparkSubmitCommand","text":"<pre><code>List&lt;String&gt; buildSparkSubmitCommand(\n  Map&lt;String, String&gt; env)\n</code></pre> <p><code>buildSparkSubmitCommand</code> starts by building so-called effective config. When in client mode, <code>buildSparkSubmitCommand</code> adds spark.driver.extraClassPath to the result Spark command.</p> <p><code>buildSparkSubmitCommand</code> builds the first part of the Java command passing in the extra classpath (only for <code>client</code> deploy mode).</p> Add <code>isThriftServer</code> case <p><code>buildSparkSubmitCommand</code> appends <code>SPARK_SUBMIT_OPTS</code> and <code>SPARK_JAVA_OPTS</code> environment variables.</p> <p>(only for <code>client</code> deploy mode) ...</p> Elaborate on the client deply mode case <p><code>addPermGenSizeOpt</code> case...elaborate</p> Elaborate on <code>addPermGenSizeOpt</code> <p><code>buildSparkSubmitCommand</code> appends <code>org.apache.spark.deploy.SparkSubmit</code> and the command-line arguments (using buildSparkSubmitArgs).</p>"},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#buildsparksubmitargs","title":"buildSparkSubmitArgs <pre><code>List&lt;String&gt; buildSparkSubmitArgs()\n</code></pre> <p><code>buildSparkSubmitArgs</code> builds a list of command-line arguments for spark-submit.</p> <p><code>buildSparkSubmitArgs</code> uses a SparkSubmitOptionParser to add the command-line arguments that <code>spark-submit</code> recognizes (when it is executed later on and uses the very same <code>SparkSubmitOptionParser</code> parser to parse command-line arguments).</p> <p><code>buildSparkSubmitArgs</code> is used when:</p> <ul> <li><code>InProcessLauncher</code> is requested to <code>startApplication</code></li> <li><code>SparkLauncher</code> is requested to createBuilder</li> <li><code>SparkSubmitCommandBuilder</code> is requested to buildSparkSubmitCommand and constructEnvVarArgs</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitCommandBuilder/#sparksubmitcommandbuilder-properties-and-sparksubmitoptionparser-attributes","title":"SparkSubmitCommandBuilder Properties and SparkSubmitOptionParser Attributes    SparkSubmitCommandBuilder Property SparkSubmitOptionParser Attribute     <code>verbose</code> <code>VERBOSE</code>   <code>master</code> <code>MASTER [master]</code>   <code>deployMode</code> <code>DEPLOY_MODE [deployMode]</code>   <code>appName</code> <code>NAME [appName]</code>   <code>conf</code> <code>CONF [key=value]*</code>   <code>propertiesFile</code> <code>PROPERTIES_FILE [propertiesFile]</code>   <code>jars</code> <code>JARS [comma-separated jars]</code>   <code>files</code> <code>FILES [comma-separated files]</code>   <code>pyFiles</code> <code>PY_FILES [comma-separated pyFiles]</code>   <code>mainClass</code> <code>CLASS [mainClass]</code>   <code>sparkArgs</code> <code>sparkArgs</code> (passed straight through)   <code>appResource</code> <code>appResource</code> (passed straight through)   <code>appArgs</code> <code>appArgs</code> (passed straight through)","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/","title":"SparkSubmitOperation","text":"<p><code>SparkSubmitOperation</code> is an abstraction of operations of spark-submit (when requested to kill a submission or for a submission status).</p>"},{"location":"tools/spark-submit/SparkSubmitOperation/#contract","title":"Contract","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#killing-submission","title":"Killing Submission <pre><code>kill(\n  submissionId: String,\n  conf: SparkConf): Unit\n</code></pre> <p>Kills a given submission</p> <p>Used when:</p> <ul> <li><code>SparkSubmit</code> is requested to kill a submission</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#displaying-submission-status","title":"Displaying Submission Status <pre><code>printSubmissionStatus(\n  submissionId: String,\n  conf: SparkConf): Unit\n</code></pre> <p>Displays status of a given submission</p> <p>Used when:</p> <ul> <li><code>SparkSubmit</code> is requested for submission status</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#checking-whether-master-url-supported","title":"Checking Whether Master URL Supported <pre><code>supports(\n  master: String): Boolean\n</code></pre> <p>Used when:</p> <ul> <li><code>SparkSubmit</code> is requested to kill a submission and for a submission status (via getSubmitOperations utility)</li> </ul>","text":""},{"location":"tools/spark-submit/SparkSubmitOperation/#implementations","title":"Implementations","text":"<ul> <li><code>K8SSparkSubmitOperation</code> (Spark on Kubernetes)</li> </ul>"},{"location":"tools/spark-submit/SparkSubmitOptionParser/","title":"SparkSubmitOptionParser","text":"<p><code>SparkSubmitOptionParser</code> is the parser of spark-submit's command-line options.</p>"},{"location":"tools/spark-submit/SparkSubmitOptionParser/#parse","title":"Parsing Arguments","text":"<pre><code>void parse(\n  List&lt;String&gt; args)\n</code></pre> <p><code>parse</code>...FIXME</p> <p><code>parse</code> is used when:</p> <ul> <li><code>AbstractLauncher</code> is requested to addSparkArg</li> <li><code>Main</code> is launched</li> <li><code>SparkSubmitCommandBuilder</code> is created and requested to buildSparkSubmitArgs</li> </ul>"},{"location":"tools/spark-submit/SparkSubmitOptionParser/#handle","title":"Option Handling","text":"<pre><code>boolean handle(\n  String opt,\n  String value)\n</code></pre> <p><code>handle</code> throws an <code>UnsupportedOperationException</code> (and expects subclasses to override the default behaviour, e.g. SparkSubmitArguments).</p>"},{"location":"tools/spark-submit/SparkSubmitOptionParser/#-files","title":"--files <p>A comma-separated sequence of paths</p>","text":""},{"location":"tools/spark-submit/SparkSubmitUtils/","title":"SparkSubmitUtils","text":"<p><code>SparkSubmitUtils</code> provides utilities for SparkSubmit.</p>"},{"location":"tools/spark-submit/SparkSubmitUtils/#getsubmitoperations","title":"getSubmitOperations <pre><code>getSubmitOperations(\n  master: String): SparkSubmitOperation\n</code></pre> <p><code>getSubmitOperations</code>...FIXME</p> <p><code>getSubmitOperations</code>\u00a0is used when:</p> <ul> <li><code>SparkSubmit</code> is requested to kill a submission and requestStatus</li> </ul>","text":""},{"location":"webui/","title":"Web UIs","text":"<p>web UI is the web interface of Spark applications or infrastructure for monitoring and inspection.</p> <p>The main abstraction is WebUI.</p>"},{"location":"webui/AllJobsPage/","title":"AllJobsPage","text":"<p><code>AllJobsPage</code> is a WebUIPage of JobsTab.</p>"},{"location":"webui/AllJobsPage/#creating-instance","title":"Creating Instance","text":"<p><code>AllJobsPage</code> takes the following to be created:</p> <ul> <li> Parent JobsTab <li> AppStatusStore"},{"location":"webui/AllJobsPage/#rendering-page","title":"Rendering Page <pre><code>render(\n  request: HttpServletRequest): Seq[Node]\n</code></pre> <p><code>render</code>\u00a0is part of the WebUIPage abstraction.</p> <p><code>render</code> renders a <code>Spark Jobs</code> page with the jobs and executors alongside applicationInfo and appSummary (from the AppStatusStore).</p>","text":""},{"location":"webui/AllJobsPage/#introduction","title":"Introduction <p><code>AllJobsPage</code> renders a summary, an event timeline, and active, completed, and failed jobs of a Spark application.</p> <p><code>AllJobsPage</code> displays the Summary section with the current Spark user, total uptime, scheduling mode, and the number of jobs per status.</p> <p></p> <p>Under the summary section is the Event Timeline section.</p> <p></p> <p>Active Jobs, Completed Jobs, and Failed Jobs sections follow.</p> <p></p> <p>Jobs are clickable (and give information about the stages of tasks inside it).</p> <p>When you hover over a job in Event Timeline not only you see the job legend but also the job is highlighted in the Summary section.</p> <p></p> <p>The Event Timeline section shows not only jobs but also executors.</p> <p></p>","text":""},{"location":"webui/AllStagesPage/","title":"AllStagesPage","text":"<p><code>AllStagesPage</code> is a WebUIPage of StagesTab.</p> <p></p> <p></p>"},{"location":"webui/AllStagesPage/#creating-instance","title":"Creating Instance","text":"<p><code>AllStagesPage</code> takes the following to be created:</p> <ul> <li> Parent StagesTab"},{"location":"webui/AllStagesPage/#rendering-page","title":"Rendering Page <pre><code>render(\n  request: HttpServletRequest): Seq[Node]\n</code></pre> <p><code>render</code>\u00a0is part of the WebUIPage abstraction.</p> <p><code>render</code> renders a <code>Stages for All Jobs</code> page with the stages and application summary (from the AppStatusStore of the parent StagesTab).</p>","text":""},{"location":"webui/AllStagesPage/#stage-headers","title":"Stage Headers <p><code>AllStagesPage</code> uses the following headers and tooltips for the Stages table.</p>    Header Tooltip     Stage Id    Pool Name    Description    Submitted    Duration Elapsed time since the stage was submitted until execution completion of all its tasks.   Tasks: Succeeded/Total    Input Bytes read from Hadoop or from Spark storage.   Output Bytes written to Hadoop.   Shuffle Read Total shuffle bytes and records read (includes both data read locally and data read from remote executors).   Shuffle Write Bytes and records written to disk in order to be read by a shuffle in a future stage.   Failure Reason Bytes and records written to disk in order to be read by a shuffle in a future stage.","text":""},{"location":"webui/EnvironmentPage/","title":"EnvironmentPage","text":""},{"location":"webui/EnvironmentPage/#review-me","title":"Review Me","text":"<p>[[prefix]] <code>EnvironmentPage</code> is a spark-webui-WebUIPage.md[WebUIPage] with an empty spark-webui-WebUIPage.md#prefix[prefix].</p> <p><code>EnvironmentPage</code> is &lt;&gt; exclusively when <code>EnvironmentTab</code> is spark-webui-EnvironmentTab.md#creating-instance[created]. <p>== [[creating-instance]] Creating EnvironmentPage Instance</p> <p><code>EnvironmentPage</code> takes the following when created:</p> <ul> <li>[[parent]] Parent spark-webui-EnvironmentTab.md[EnvironmentTab]</li> <li>[[conf]] SparkConf.md[SparkConf]</li> <li>[[store]] core:AppStatusStore.md[]</li> </ul>"},{"location":"webui/EnvironmentTab/","title":"EnvironmentTab","text":""},{"location":"webui/EnvironmentTab/#review-me","title":"Review Me","text":"<p>[[prefix]] <code>EnvironmentTab</code> is a spark-webui-SparkUITab.md[SparkUITab] with environment spark-webui-SparkUITab.md#prefix[prefix].</p> <p><code>EnvironmentTab</code> is &lt;&gt; exclusively when <code>SparkUI</code> is spark-webui-SparkUI.md#initialize[initialized]. <p>[[creating-instance]] <code>EnvironmentTab</code> takes the following when created:</p> <ul> <li>[[parent]] Parent spark-webui-SparkUI.md[SparkUI]</li> <li>[[store]] core:AppStatusStore.md[]</li> </ul> <p>When created, <code>EnvironmentTab</code> creates the spark-webui-EnvironmentPage.md#creating-instance[EnvironmentPage] page and spark-webui-WebUITab.md#attachPage[attaches] it immediately.</p>"},{"location":"webui/ExecutorThreadDumpPage/","title":"ExecutorThreadDumpPage","text":""},{"location":"webui/ExecutorThreadDumpPage/#review-me","title":"Review Me","text":"<p>[[prefix]] <code>ExecutorThreadDumpPage</code> is a spark-webui-WebUIPage.md[WebUIPage] with threadDump spark-webui-WebUIPage.md#prefix[prefix].</p> <p><code>ExecutorThreadDumpPage</code> is &lt;&gt; exclusively when <code>ExecutorsTab</code> is spark-webui-ExecutorsTab.md#creating-instance[created] (with <code>spark.ui.threadDumpsEnabled</code> configuration property enabled). <p>NOTE: <code>spark.ui.threadDumpsEnabled</code> configuration property is enabled (i.e. <code>true</code>) by default.</p> <p>=== [[creating-instance]] Creating ExecutorThreadDumpPage Instance</p> <p><code>ExecutorThreadDumpPage</code> takes the following when created:</p> <ul> <li>[[parent]] spark-webui-SparkUITab.md[SparkUITab]</li> <li>[[sc]] Optional SparkContext.md[]</li> </ul>"},{"location":"webui/ExecutorsPage/","title":"ExecutorsPage","text":""},{"location":"webui/ExecutorsPage/#review-me","title":"Review Me","text":"<p>[[prefix]] <code>ExecutorsPage</code> is a spark-webui-WebUIPage.md[WebUIPage] with an empty spark-webui-WebUIPage.md#prefix[prefix].</p> <p><code>ExecutorsPage</code> is &lt;&gt; exclusively when <code>ExecutorsTab</code> is spark-webui-ExecutorsTab.md#creating-instance[created]. <p>=== [[creating-instance]] Creating ExecutorsPage Instance</p> <p><code>ExecutorsPage</code> takes the following when created:</p> <ul> <li>[[parent]] Parent spark-webui-SparkUITab.md[SparkUITab]</li> <li>[[threadDumpEnabled]] <code>threadDumpEnabled</code> flag</li> </ul>"},{"location":"webui/ExecutorsTab/","title":"ExecutorsTab","text":""},{"location":"webui/ExecutorsTab/#review-me","title":"Review Me","text":"<p>[[prefix]] <code>ExecutorsTab</code> is a spark-webui-SparkUITab.md[SparkUITab] with executors spark-webui-SparkUITab.md#prefix[prefix].</p> <p><code>ExecutorsTab</code> is &lt;&gt; exclusively when <code>SparkUI</code> is spark-webui-SparkUI.md#initialize[initialized]. <p>[[creating-instance]] [[parent]] <code>ExecutorsTab</code> takes the parent spark-webui-SparkUI.md[SparkUI] when created.</p> <p>When &lt;&gt;, <code>ExecutorsTab</code> creates the following pages and spark-webui-WebUITab.md#attachPage[attaches] them immediately: <ul> <li> <p>spark-webui-ExecutorsPage.md[ExecutorsPage]</p> </li> <li> <p>spark-webui-ExecutorThreadDumpPage.md[ExecutorThreadDumpPage]</p> </li> </ul>"},{"location":"webui/JettyUtils/","title":"JettyUtils","text":"<p>== [[JettyUtils]] JettyUtils</p> <p><code>JettyUtils</code> is a set of &lt;&gt; for creating Jetty HTTP Server-specific components. <p>[[utility-methods]] .JettyUtils's Utility Methods [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| &lt;&gt; | Creates an HttpServlet <p>| &lt;&gt; | Creates a Handler for a static content <p>| &lt;&gt; | Creates a ServletContextHandler for a path &lt;&gt; === <p>=== [[createServletHandler]] Creating ServletContextHandler for Path -- <code>createServletHandler</code> Method</p>"},{"location":"webui/JettyUtils/#source-scala","title":"[source, scala]","text":"<p>createServletHandler(   path: String,   servlet: HttpServlet,   basePath: String): ServletContextHandler createServletHandlerT &lt;: AnyRef: ServletContextHandler // &lt;1&gt;</p> <p>&lt;1&gt; Uses the first three-argument <code>createServletHandler</code></p> <p><code>createServletHandler</code>...FIXME</p>"},{"location":"webui/JettyUtils/#note","title":"[NOTE]","text":"<p><code>createServletHandler</code> is used when:</p> <ul> <li> <p><code>WebUI</code> is requested to spark-webui-WebUI.md#attachPage[attachPage]</p> </li> <li> <p><code>MetricsServlet</code> is requested to <code>getHandlers</code></p> </li> </ul>"},{"location":"webui/JettyUtils/#spark-standalones-workerwebui-is-requested-to-initialize","title":"* Spark Standalone's <code>WorkerWebUI</code> is requested to <code>initialize</code>","text":"<p>=== [[createServlet]] Creating HttpServlet -- <code>createServlet</code> Method</p>"},{"location":"webui/JettyUtils/#source-scala_1","title":"[source, scala]","text":"<p>createServletT &lt;: AnyRef: HttpServlet</p> <p><code>createServlet</code> creates the <code>X-Frame-Options</code> header that can be either <code>ALLOW-FROM</code> with the value of spark-webui-properties.md#spark.ui.allowFramingFrom[spark.ui.allowFramingFrom] configuration property if defined or <code>SAMEORIGIN</code>.</p> <p><code>createServlet</code> creates a Java Servlets <code>HttpServlet</code> with support for <code>GET</code> requests.</p> <p>When handling <code>GET</code> requests, the <code>HttpServlet</code> first checks view permissions of the remote user (by requesting the <code>SecurityManager</code> to <code>checkUIViewPermissions</code> of the remote user).</p>"},{"location":"webui/JettyUtils/#tip","title":"[TIP]","text":"<p>Enable <code>DEBUG</code> logging level for <code>org.apache.spark.SecurityManager</code> logger to see what happens when <code>SecurityManager</code> does the security check.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.SecurityManager=DEBUG\n</code></pre> <p>You should see the following DEBUG message in the logs:</p>"},{"location":"webui/JettyUtils/#debug-securitymanager-useruser-aclsenabledaclsenabled-viewaclsviewacls-viewaclsgroupsviewaclsgroups","title":"<pre><code>DEBUG SecurityManager: user=[user] aclsEnabled=[aclsEnabled] viewAcls=[viewAcls] viewAclsGroups=[viewAclsGroups]\n</code></pre>","text":"<p>With view permissions check passed, the <code>HttpServlet</code> sends a response with the following:</p> <ul> <li>FIXME</li> </ul> <p>In case the view permissions didn't allow to view the page, the <code>HttpServlet</code> sends an error response with the following:</p> <ul> <li> <p>Status <code>403</code></p> </li> <li> <p><code>Cache-Control</code> header with \"no-cache, no-store, must-revalidate\"</p> </li> <li> <p>Error message: \"User is not authorized to access this page.\"</p> </li> </ul> <p>NOTE: <code>createServlet</code> is used exclusively when <code>JettyUtils</code> is requested to &lt;&gt;. <p>=== [[createStaticHandler]] Creating Handler For Static Content -- <code>createStaticHandler</code> Method</p>"},{"location":"webui/JettyUtils/#source-scala_2","title":"[source, scala]","text":""},{"location":"webui/JettyUtils/#createstatichandlerresourcebase-string-path-string-servletcontexthandler","title":"createStaticHandler(resourceBase: String, path: String): ServletContextHandler","text":"<p><code>createStaticHandler</code> creates a handler for serving files from a static directory</p> <p>Internally, <code>createStaticHandler</code> creates a Jetty <code>ServletContextHandler</code> and sets <code>org.eclipse.jetty.servlet.Default.gzip</code> init parameter to <code>false</code>.</p> <p><code>createRedirectHandler</code> creates a Jetty https://www.eclipse.org/jetty/javadoc/current/org/eclipse/jetty/servlet/DefaultServlet.html[DefaultServlet].</p>"},{"location":"webui/JettyUtils/#note_1","title":"[NOTE]","text":"<p>Quoting the official documentation of Jetty's https://www.eclipse.org/jetty/javadoc/current/org/eclipse/jetty/servlet/DefaultServlet.html[DefaultServlet]:</p> <p>DefaultServlet The default servlet. This servlet, normally mapped to <code>/</code>, provides the handling for static content, OPTION and TRACE methods for the context. The following initParameters are supported, these can be set either on the servlet itself or as ServletContext initParameters with a prefix of <code>org.eclipse.jetty.servlet.Default.</code></p> <p>With that, <code>org.eclipse.jetty.servlet.Default.gzip</code> is to configure https://www.eclipse.org/jetty/documentation/current/advanced-extras.html#default-servlet-init[gzip] init parameter for Jetty's <code>DefaultServlet</code>.</p> <p>gzip If set to true, then static content will be served as gzip content encoded if a matching resource is found ending with \".gz\" (default <code>false</code>) (deprecated: use precompressed)</p> <p>====</p> <p><code>createRedirectHandler</code> resolves the <code>resourceBase</code> in the Spark classloader and, if successful, sets <code>resourceBase</code> init parameter of the Jetty <code>DefaultServlet</code> to the URL.</p> <p>NOTE: https://www.eclipse.org/jetty/documentation/current/advanced-extras.html#default-servlet-init[resourceBase] init parameter is used to replace the context resource base.</p> <p><code>createRedirectHandler</code> requests the <code>ServletContextHandler</code> to use the <code>path</code> as the context path and register the <code>DefaultServlet</code> to serve it.</p> <p><code>createRedirectHandler</code> throws an <code>Exception</code> if the input <code>resourceBase</code> could not be resolved.</p> <pre><code>Could not find resource path for Web UI: [resourceBase]\n</code></pre> <p>NOTE: <code>createStaticHandler</code> is used when spark-webui-SparkUI.md#initialize[SparkUI], spark-history-server:HistoryServer.md#initialize[HistoryServer], Spark Standalone's <code>MasterWebUI</code> and <code>WorkerWebUI</code>, Spark on Mesos' <code>MesosClusterUI</code> are requested to initialize.</p> <p>=== [[createRedirectHandler]] <code>createRedirectHandler</code> Method</p>"},{"location":"webui/JettyUtils/#source-scala_3","title":"[source, scala]","text":"<p>createRedirectHandler(   srcPath: String,   destPath: String,   beforeRedirect: HttpServletRequest =&gt; Unit = x =&gt; (),   basePath: String = \"\",   httpMethods: Set[String] = Set(\"GET\")): ServletContextHandler</p> <p><code>createRedirectHandler</code>...FIXME</p> <p>NOTE: <code>createRedirectHandler</code> is used when spark-webui-SparkUI.md#initialize[SparkUI] and Spark Standalone's <code>MasterWebUI</code> are requested to initialize.</p>"},{"location":"webui/JobPage/","title":"JobPage","text":""},{"location":"webui/JobPage/#review-me","title":"Review Me","text":"<p>[[prefix]] <code>JobPage</code> is a spark-webui-WebUIPage.md[WebUIPage] with job spark-webui-WebUIPage.md#prefix[prefix].</p> <p><code>JobPage</code> is &lt;&gt; exclusively when <code>JobsTab</code> is created. <p>=== [[creating-instance]] Creating JobPage Instance</p> <p><code>JobPage</code> takes the following when created:</p> <ul> <li>[[parent]] Parent JobsTab</li> <li>[[store]] core:AppStatusStore.md[]</li> </ul>"},{"location":"webui/JobsTab/","title":"JobsTab","text":"<p><code>JobsTab</code> is a SparkUITab with <code>jobs</code> URL prefix.</p> <p></p>"},{"location":"webui/JobsTab/#creating-instance","title":"Creating Instance","text":"<p><code>JobsTab</code> takes the following to be created:</p> <ul> <li> Parent SparkUI <li> AppStatusStore <p><code>JobsTab</code> is created\u00a0when:</p> <ul> <li><code>SparkUI</code> is requested to initialize</li> </ul>"},{"location":"webui/JobsTab/#pages","title":"Pages","text":"<p>When created, <code>JobsTab</code> attaches the following pages (with a reference to itself and the AppStatusStore):</p> <ul> <li>AllJobsPage</li> <li>JobPage</li> </ul>"},{"location":"webui/JobsTab/#event-timeline","title":"Event Timeline","text":""},{"location":"webui/JobsTab/#details-for-job","title":"Details for Job","text":"<p>Clicking a job in AllJobsPage, leads to Details for Job page.</p> <p></p> <p>When a job id is not found, you should see \"No information to display for job ID\" message.</p> <p></p> <p></p> <p></p>"},{"location":"webui/PoolPage/","title":"PoolPage","text":"<p><code>PoolPage</code> is a WebUIPage of StagesTab.</p> <p></p>"},{"location":"webui/PoolPage/#creating-instance","title":"Creating Instance","text":"<p><code>PoolPage</code> takes the following to be created:</p> <ul> <li> Parent StagesTab"},{"location":"webui/PoolPage/#url-prefix","title":"URL Prefix <p><code>PoolPage</code> uses <code>pool</code> URL prefix.</p>","text":""},{"location":"webui/PoolPage/#rendering-page","title":"Rendering Page <pre><code>render(\n  request: HttpServletRequest): Seq[Node]\n</code></pre> <p><code>render</code>\u00a0is part of the WebUIPage abstraction.</p> <p><code>render</code> requires <code>poolname</code> and <code>attempt</code> request parameters.</p> <p><code>render</code> renders a <code>Fair Scheduler Pool</code> page with the PoolData (from the AppStatusStore of the parent StagesTab).</p>","text":""},{"location":"webui/PoolPage/#introduction","title":"Introduction <p>The Fair Scheduler Pool Details page shows information about a Schedulable pool and is only available when a Spark application uses the FAIR scheduling mode.</p>","text":""},{"location":"webui/PoolPage/#summary-table","title":"Summary Table","text":"<p>The Summary table shows the details of a Schedulable pool.</p> <p></p> <p>It uses the following columns:</p> <ul> <li>Pool Name</li> <li>Minimum Share</li> <li>Pool Weight</li> <li>Active Stages (the number of the active stages in a <code>Schedulable</code> pool)</li> <li>Running Tasks</li> <li>SchedulingMode</li> </ul>"},{"location":"webui/PoolPage/#active-stages-table","title":"Active Stages Table","text":"<p>The Active Stages table shows the active stages in a pool.</p> <p></p>"},{"location":"webui/PrometheusResource/","title":"PrometheusResource","text":""},{"location":"webui/PrometheusResource/#getservlethandler","title":"getServletHandler <pre><code>getServletHandler(\n  uiRoot: UIRoot): ServletContextHandler\n</code></pre> <p><code>getServletHandler</code>...FIXME</p> <p><code>getServletHandler</code>\u00a0is used when:</p> <ul> <li><code>SparkUI</code> is requested to initialize</li> </ul>","text":""},{"location":"webui/RDDPage/","title":"RDDPage","text":"<p>== [[RDDPage]] RDDPage</p> <p>[[prefix]] <code>RDDPage</code> is a spark-webui-WebUIPage.md[WebUIPage] with rdd spark-webui-WebUIPage.md#prefix[prefix].</p> <p><code>RDDPage</code> is &lt;&gt; exclusively when <code>StorageTab</code> is spark-webui-StorageTab.md#creating-instance[created]. <p>[[creating-instance]] <code>RDDPage</code> takes the following when created:</p> <ul> <li>[[parent]] Parent spark-webui-SparkUITab.md[SparkUITab]</li> <li>[[store]] core:AppStatusStore.md[]</li> </ul> <p>=== [[render]] <code>render</code> Method</p>"},{"location":"webui/RDDPage/#source-scala","title":"[source, scala]","text":""},{"location":"webui/RDDPage/#renderrequest-httpservletrequest-seqnode","title":"render(request: HttpServletRequest): Seq[Node]","text":"<p>NOTE: <code>render</code> is part of spark-webui-WebUIPage.md#render[WebUIPage Contract] to...FIXME.</p> <p><code>render</code>...FIXME</p>"},{"location":"webui/SparkUI/","title":"SparkUI","text":"<p><code>SparkUI</code> is a WebUI of Spark applications.</p> <p></p>"},{"location":"webui/SparkUI/#creating-instance","title":"Creating Instance","text":"<p><code>SparkUI</code> takes the following to be created:</p> <ul> <li> AppStatusStore <li> SparkContext <li> SparkConf <li> <code>SecurityManager</code> <li> Application Name <li> Base Path <li> Start Time <li> Spark Version <p>While being created, <code>SparkUI</code> initializes itself.</p> <p><code>SparkUI</code> is created using create utility.</p>"},{"location":"webui/SparkUI/#ui-port","title":"UI Port <pre><code>getUIPort(\n  conf: SparkConf): Int\n</code></pre> <p><code>getUIPort</code> requests the SparkConf for the value of spark.ui.port configuration property.</p> <p><code>getUIPort</code>\u00a0is used when:</p> <ul> <li><code>SparkUI</code> is created</li> </ul>","text":""},{"location":"webui/SparkUI/#creating-sparkui","title":"Creating SparkUI <pre><code>create(\n  sc: Option[SparkContext],\n  store: AppStatusStore,\n  conf: SparkConf,\n  securityManager: SecurityManager,\n  appName: String,\n  basePath: String,\n  startTime: Long,\n  appSparkVersion: String): SparkUI\n</code></pre> <p><code>create</code> creates a new <code>SparkUI</code> with <code>appSparkVersion</code> being the current Spark version.</p> <p><code>create</code>\u00a0is used when:</p> <ul> <li><code>SparkContext</code> is created (with the spark.ui.enabled configuration property turned on)</li> <li><code>FsHistoryProvider</code> (Spark History Server) is requested for the web UI of a Spark application</li> </ul>","text":""},{"location":"webui/SparkUI/#initializing","title":"Initializing <pre><code>initialize(): Unit\n</code></pre> <p><code>initialize</code>\u00a0is part of the WebUI abstraction.</p> <p><code>initialize</code> creates and attaches the following tabs:</p> <ol> <li>JobsTab</li> <li>StagesTab</li> <li>StorageTab</li> <li>EnvironmentTab</li> <li>ExecutorsTab</li> </ol> <p><code>initialize</code> attaches itself as the UIRoot.</p> <p><code>initialize</code> attaches the PrometheusResource for executor metrics based on spark.ui.prometheus.enabled configuration property.</p>","text":""},{"location":"webui/SparkUI/#uiroot","title":"UIRoot <p><code>SparkUI</code> is an UIRoot</p>","text":""},{"location":"webui/SparkUI/#review-me","title":"Review Me <p>SparkUI is &lt;&gt; along with the following: <ul> <li> <p>SparkContext is created (for a live Spark application with spark-webui-properties.md#spark.ui.enabled[spark.ui.enabled] configuration property enabled)</p> </li> <li> <p><code>FsHistoryProvider</code> is requested for the spark-history-server:FsHistoryProvider.md#getAppUI[application UI] (for a live or completed Spark application)</p> </li> </ul> <p>.Creating SparkUI for Live Spark Application image::spark-webui-SparkUI.png[align=\"center\"]</p> <p>When &lt;&gt; (while <code>SparkContext</code> is created for a live Spark application), SparkUI gets the following: <ul> <li> <p>Live AppStatusStore (with a ElementTrackingStore using an core:InMemoryStore.md[] and a AppStatusListener for a live Spark application)</p> </li> <li> <p>Name of the Spark application that is exactly the value of SparkConf.md#spark.app.name[spark.app.name] configuration property</p> </li> <li> <p>Empty base path</p> </li> </ul> <p>When started, SparkUI binds to &lt;&gt; address that you can control using <code>SPARK_PUBLIC_DNS</code> environment variable or spark-driver.md#spark_driver_host[spark.driver.host] Spark property. <p>NOTE: With spark-webui-properties.md#spark.ui.killEnabled[spark.ui.killEnabled] configuration property turned on, SparkUI &lt;&gt; (subject to <code>SecurityManager.checkModifyPermissions</code> permissions). <p>SparkUI gets an &lt;&gt; that is then used for the following: <ul> <li> <p>&lt;&gt;, i.e. JobsTab.md#creating-instance[JobsTab], spark-webui-StagesTab.md#creating-instance[StagesTab], spark-webui-StorageTab.md#creating-instance[StorageTab], spark-webui-EnvironmentTab.md#creating-instance[EnvironmentTab]  <li> <p><code>AbstractApplicationResource</code> is requested for spark-api-AbstractApplicationResource.md#jobsList[jobsList], spark-api-AbstractApplicationResource.md#oneJob[oneJob], spark-api-AbstractApplicationResource.md#executorList[executorList], spark-api-AbstractApplicationResource.md#allExecutorList[allExecutorList], spark-api-AbstractApplicationResource.md#rddList[rddList], spark-api-AbstractApplicationResource.md#rddData[rddData], spark-api-AbstractApplicationResource.md#environmentInfo[environmentInfo]</p> </li> <li> <p><code>StagesResource</code> is requested for spark-api-StagesResource.md#stageList[stageList], spark-api-StagesResource.md#stageData[stageData], spark-api-StagesResource.md#oneAttemptData[oneAttemptData], spark-api-StagesResource.md#taskSummary[taskSummary], spark-api-StagesResource.md#taskList[taskList]</p> </li> <li> <p>SparkUI is requested for the current &lt;&gt;  <li> <p>Creating Spark SQL's <code>SQLTab</code> (when <code>SQLHistoryServerPlugin</code> is requested to <code>setupUI</code>)</p> </li> <li> <p>Spark Streaming's <code>BatchPage</code> is created</p> </li>  <p>[[internal-registries]] .SparkUI's Internal Properties (e.g. Registries, Counters and Flags) [cols=\"1,2\",options=\"header\",width=\"100%\"] |=== | Name | Description</p> <p>| <code>appId</code> | [[appId]] |===</p>","text":""},{"location":"webui/SparkUI/#tip","title":"[TIP]","text":"<p>Enable <code>INFO</code> logging level for <code>org.apache.spark.ui.SparkUI</code> logger to see what happens inside.</p> <p>Add the following line to <code>conf/log4j.properties</code>:</p> <pre><code>log4j.logger.org.apache.spark.ui.SparkUI=INFO\n</code></pre>"},{"location":"webui/SparkUI/#refer-to-spark-loggingmdlogging","title":"Refer to spark-logging.md[Logging].","text":"<p>== [[setAppId]] Assigning Unique Identifier of Spark Application -- <code>setAppId</code> Method</p>"},{"location":"webui/SparkUI/#source-scala","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#setappidid-string-unit","title":"setAppId(id: String): Unit <p><code>setAppId</code> sets the internal &lt;&gt;. <p><code>setAppId</code> is used when SparkContext is created.</p> <p>== [[stop]] Stopping SparkUI -- <code>stop</code> Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_1","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#stop-unit","title":"stop(): Unit <p><code>stop</code> stops the HTTP server and prints the following INFO message to the logs:</p> <pre><code>INFO SparkUI: Stopped Spark web UI at [appUIAddress]\n</code></pre> <p>NOTE: <code>appUIAddress</code> in the above INFO message is the result of &lt;&gt; method. <p>== [[appUIAddress]] <code>appUIAddress</code> Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_2","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#appuiaddress-string","title":"appUIAddress: String <p><code>appUIAddress</code> returns the entire URL of a Spark application's web UI, including <code>http://</code> scheme.</p> <p>Internally, <code>appUIAddress</code> uses &lt;&gt;. <p>== [[createLiveUI]] <code>createLiveUI</code> Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_3","title":"[source, scala] <p>createLiveUI(   sc: SparkContext,   conf: SparkConf,   listenerBus: SparkListenerBus,   jobProgressListener: JobProgressListener,   securityManager: SecurityManager,   appName: String,   startTime: Long): SparkUI</p>  <p><code>createLiveUI</code> creates a SparkUI for a live running Spark application.</p> <p>Internally, <code>createLiveUI</code> simply forwards the call to &lt;&gt;. <p><code>createLiveUI</code> is used when SparkContext is created.</p> <p>== [[createHistoryUI]] <code>createHistoryUI</code> Method</p> <p>CAUTION: FIXME</p> <p>== [[appUIHostPort]] <code>appUIHostPort</code> Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_4","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#appuihostport-string","title":"appUIHostPort: String <p><code>appUIHostPort</code> returns the Spark application's web UI which is the public hostname and port, excluding the scheme.</p> <p>NOTE: &lt;&gt; uses <code>appUIHostPort</code> and adds <code>http://</code> scheme. <p>== [[getAppName]] <code>getAppName</code> Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_5","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#getappname-string","title":"getAppName: String <p><code>getAppName</code> returns the name of the Spark application (of a SparkUI instance).</p> <p>NOTE: <code>getAppName</code> is used when...FIXME</p> <p>== [[create]] Creating SparkUI Instance -- <code>create</code> Factory Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_6","title":"[source, scala] <p>create(   sc: Option[SparkContext],   store: AppStatusStore,   conf: SparkConf,   securityManager: SecurityManager,   appName: String,   basePath: String = \"\",   startTime: Long,   appSparkVersion: String = org.apache.spark.SPARK_VERSION): SparkUI</p>  <p><code>create</code> creates a SparkUI backed by a core:AppStatusStore.md[].</p> <p>Internally, <code>create</code> simply creates a new &lt;&gt; (with the predefined Spark version). <p><code>create</code> is used when:</p> <ul> <li>SparkContext is created</li> <li><code>FsHistoryProvider</code> is requested to spark-history-server:FsHistoryProvider.md#getAppUI[getAppUI] (for a Spark application that already finished)</li> </ul>","text":""},{"location":"webui/SparkUI/#creating-instance_1","title":"Creating Instance <p>SparkUI takes the following when created:</p> <ul> <li>[[store]] core:AppStatusStore.md[]</li> <li>[[sc]] SparkContext.md[]</li> <li>[[conf]] SparkConf.md[SparkConf]</li> <li>[[securityManager]] <code>SecurityManager</code></li> <li>[[appName]] Application name</li> <li>[[basePath]] <code>basePath</code></li> <li>[[startTime]] Start time</li> <li>[[appSparkVersion]] <code>appSparkVersion</code></li> </ul> <p>SparkUI initializes the &lt;&gt; and &lt;&gt;. <p>== [[initialize]] Attaching Tabs and Context Handlers -- <code>initialize</code> Method</p>","text":""},{"location":"webui/SparkUI/#source-scala_7","title":"[source, scala]","text":""},{"location":"webui/SparkUI/#initialize-unit","title":"initialize(): Unit <p>NOTE: <code>initialize</code> is part of spark-webui-WebUI.md#initialize[WebUI Contract] to initialize web components.</p> <p><code>initialize</code> creates and &lt;&gt; the following tabs (with the reference to the SparkUI and its &lt;&gt;): <p>. spark-webui-StagesTab.md[StagesTab] . spark-webui-StorageTab.md[StorageTab] . spark-webui-EnvironmentTab.md[EnvironmentTab] . spark-webui-ExecutorsTab.md[ExecutorsTab]</p> <p>In the end, <code>initialize</code> creates and spark-webui-WebUI.md#attachHandler[attaches] the following <code>ServletContextHandlers</code>:</p> <p>. spark-webui-JettyUtils.md#createStaticHandler[Creates a static handler] for serving files from a static directory, i.e. <code>/static</code> to serve static files from <code>org/apache/spark/ui/static</code> directory (on CLASSPATH)</p> <p>. spark-api-ApiRootResource.md#getServletHandler[Creates the /api/* context handler] for the spark-api.md[Status REST API]</p> <p>. spark-webui-JettyUtils.md#createRedirectHandler[Creates a redirect handler] to redirect <code>/jobs/job/kill</code> to <code>/jobs/</code> and request the <code>JobsTab</code> to execute handleKillRequest before redirection</p> <p>. spark-webui-JettyUtils.md#createRedirectHandler[Creates a redirect handler] to redirect <code>/stages/stage/kill</code> to <code>/stages/</code> and request the <code>StagesTab</code> to execute spark-webui-StagesTab.md#handleKillRequest[handleKillRequest] before redirection</p>","text":""},{"location":"webui/SparkUITab/","title":"SparkUITab","text":"<p><code>SparkUITab</code>\u00a0is an extension of the WebUITab abstraction for UI tabs with the application name and Spark version.</p>"},{"location":"webui/SparkUITab/#implementations","title":"Implementations","text":"<ul> <li>EnvironmentTab</li> <li>ExecutorsTab</li> <li>JobsTab</li> <li>StagesTab</li> <li>StorageTab</li> </ul>"},{"location":"webui/SparkUITab/#creating-instance","title":"Creating Instance","text":"<p><code>SparkUITab</code> takes the following to be created:</p> <ul> <li> Parent SparkUI <li> URL Prefix Abstract Class <p><code>SparkUITab</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete SparkUITabs.</p>"},{"location":"webui/SparkUITab/#application-name","title":"Application Name <pre><code>appName: String\n</code></pre> <p><code>appName</code> requests the parent SparkUI for the appName.</p>","text":""},{"location":"webui/SparkUITab/#spark-version","title":"Spark Version <pre><code>appSparkVersion: String\n</code></pre> <p><code>appSparkVersion</code> requests the parent SparkUI for the appSparkVersion.</p>","text":""},{"location":"webui/StagePage/","title":"StagePage","text":"<p><code>StagePage</code> is a WebUIPage of StagesTab.</p> <p></p>"},{"location":"webui/StagePage/#creating-instance","title":"Creating Instance","text":"<p><code>StagePage</code> takes the following to be created:</p> <ul> <li> Parent StagesTab <li> AppStatusStore"},{"location":"webui/StagePage/#url-prefix","title":"URL Prefix <p><code>StagePage</code> uses <code>stage</code> URL prefix.</p>","text":""},{"location":"webui/StagePage/#rendering-page","title":"Rendering Page <pre><code>render(\n  request: HttpServletRequest): Seq[Node]\n</code></pre> <p><code>render</code>\u00a0is part of the WebUIPage abstraction.</p>  <p><code>render</code> requires <code>id</code> and <code>attempt</code> request parameters.</p> <p><code>render</code>...FIXME</p>","text":""},{"location":"webui/StagePage/#tasks-section","title":"Tasks Section","text":""},{"location":"webui/StagePage/#summary-metrics-for-completed-tasks-in-stage","title":"Summary Metrics for Completed Tasks in Stage <p>The summary metrics table shows the metrics for the tasks in a given stage that have already finished with <code>SUCCESS</code> status and metrics available.</p> <p></p> <p>The 1<sup>st</sup> row is Duration which includes the quantiles based on <code>executorRunTime</code>.</p> <p>The 2<sup>nd</sup> row is the optional Scheduler Delay which includes the time to ship the task from the scheduler to executors, and the time to send the task result from the executors to the scheduler. It is not enabled by default and you should select Scheduler Delay checkbox under Show Additional Metrics to include it in the summary table.</p> <p>The 3<sup>rd</sup> row is the optional Task Deserialization Time which includes the quantiles based on <code>executorDeserializeTime</code> task metric. It is not enabled by default and you should select Task Deserialization Time checkbox under Show Additional Metrics to include it in the summary table.</p> <p>The 4<sup>th</sup> row is GC Time which is the time that an executor spent paused for Java garbage collection while the task was running (using <code>jvmGCTime</code> task metric).</p> <p>The 5<sup>th</sup> row is the optional Result Serialization Time which is the time spent serializing the task result on a executor before sending it back to the driver (using <code>resultSerializationTime</code> task metric). It is not enabled by default and you should select Result Serialization Time checkbox under Show Additional Metrics to include it in the summary table.</p> <p>The 6<sup>th</sup> row is the optional Getting Result Time which is the time that the driver spends fetching task results from workers. It is not enabled by default and you should select Getting Result Time checkbox under Show Additional Metrics to include it in the summary table.</p> <p>The 7<sup>th</sup> row is the optional Peak Execution Memory which is the sum of the peak sizes of the internal data structures created during shuffles, aggregations and joins (using <code>peakExecutionMemory</code> task metric).</p> <p>If the stage has an input, the 8<sup>th</sup> row is Input Size / Records which is the bytes and records read from Hadoop or from a Spark storage (using <code>inputMetrics.bytesRead</code> and <code>inputMetrics.recordsRead</code> task metrics).</p> <p>If the stage has an output, the 9<sup>th</sup> row is Output Size / Records which is the bytes and records written to Hadoop or to a Spark storage (using <code>outputMetrics.bytesWritten</code> and <code>outputMetrics.recordsWritten</code> task metrics).</p> <p>If the stage has shuffle read there will be three more rows in the table. The first row is Shuffle Read Blocked Time which is the time that tasks spent blocked waiting for shuffle data to be read from remote machines (using <code>shuffleReadMetrics.fetchWaitTime</code> task metric). The other row is Shuffle Read Size / Records which is the total shuffle bytes and records read (including both data read locally and data read from remote executors using <code>shuffleReadMetrics.totalBytesRead</code> and <code>shuffleReadMetrics.recordsRead</code> task metrics). And the last row is Shuffle Remote Reads which is the total shuffle bytes read from remote executors (which is a subset of the shuffle read bytes; the remaining shuffle data is read locally). It uses <code>shuffleReadMetrics.remoteBytesRead</code> task metric.</p> <p>If the stage has shuffle write, the following row is Shuffle Write Size / Records (using shuffleWriteMetrics.bytesWritten and shuffleWriteMetrics.recordsWritten task metrics).</p> <p>If the stage has bytes spilled, the following two rows are Shuffle spill (memory) (using <code>memoryBytesSpilled</code> task metric) and Shuffle spill (disk) (using <code>diskBytesSpilled</code> task metric).</p>","text":""},{"location":"webui/StagePage/#dag-visualization","title":"DAG Visualization","text":""},{"location":"webui/StagePage/#event-timeline","title":"Event Timeline","text":""},{"location":"webui/StagePage/#stage-task-and-shuffle-stats","title":"Stage Task and Shuffle Stats","text":""},{"location":"webui/StagePage/#aggregated-metrics-by-executor","title":"Aggregated Metrics by Executor <p><code>ExecutorTable</code> table shows the following columns:</p> <ul> <li>Executor ID</li> <li>Address</li> <li>Task Time</li> <li>Total Tasks</li> <li>Failed Tasks</li> <li>Killed Tasks</li> <li>Succeeded Tasks</li> <li>(optional) Input Size / Records (only when the stage has an input)</li> <li>(optional) Output Size / Records (only when the stage has an output)</li> <li>(optional) Shuffle Read Size / Records (only when the stage read bytes for a shuffle)</li> <li>(optional) Shuffle Write Size / Records (only when the stage wrote bytes for a shuffle)</li> <li>(optional) Shuffle Spill (Memory) (only when the stage spilled memory bytes)</li> <li>(optional) Shuffle Spill (Disk) (only when the stage spilled bytes to disk)</li> </ul> <p></p> <p>It gets <code>executorSummary</code> from <code>StageUIData</code> (for the stage and stage attempt id) and creates rows per executor.</p>","text":""},{"location":"webui/StagePage/#accumulators","title":"Accumulators <p>Stage page displays the table with named accumulators (only if they exist). It contains the name and value of the accumulators.</p> <p></p>","text":""},{"location":"webui/StagesTab/","title":"StagesTab","text":"<p><code>StagesTab</code> is a SparkUITab with <code>stages</code> URL prefix.</p> <p></p>"},{"location":"webui/StagesTab/#creating-instance","title":"Creating Instance","text":"<p><code>StagesTab</code> takes the following to be created:</p> <ul> <li> Parent SparkUI <li> AppStatusStore <p><code>StagesTab</code> is created\u00a0when:</p> <ul> <li><code>SparkUI</code> is requested to initialize</li> </ul>"},{"location":"webui/StagesTab/#pages","title":"Pages","text":"<p>When created, <code>StagesTab</code> attaches the following pages:</p> <ul> <li>AllStagesPage</li> <li>StagePage (with the AppStatusStore)</li> <li>PoolPage</li> </ul>"},{"location":"webui/StagesTab/#introduction","title":"Introduction","text":"<p>Stages tab shows the current state of all stages of all jobs in a Spark application with two optional pages for the tasks and statistics for a stage (when a stage is selected) and pool details (when the application works in FAIR scheduling mode).</p> <p>The title of the tab is Stages for All Jobs.</p> <p>With no jobs submitted yet (and hence no stages to display), the page shows nothing but the title.</p> <p></p> <p>The Stages page shows the stages in a Spark application per state in their respective sections:</p> <ul> <li>Active Stages</li> <li>Pending Stages</li> <li>Completed Stages</li> <li>Failed Stages</li> </ul> <p></p> <p>The state sections are only displayed when there are stages in a given state.</p> <p>In FAIR scheduling mode you have access to the table showing the scheduler pools.</p> <p></p>"},{"location":"webui/StoragePage/","title":"StoragePage","text":"<p><code>StoragePage</code> is a WebUIPage of StorageTab.</p>"},{"location":"webui/StoragePage/#creating-instance","title":"Creating Instance","text":"<p><code>StoragePage</code> takes the following to be created:</p> <ul> <li> Parent SparkUITab <li> AppStatusStore <p><code>StoragePage</code> is created\u00a0when:</p> <ul> <li>StorageTab is created</li> </ul>"},{"location":"webui/StoragePage/#rendering-page","title":"Rendering Page <pre><code>render(\n  request: HttpServletRequest): Seq[Node]\n</code></pre> <p><code>render</code>\u00a0is part of the WebUIPage abstraction.</p> <p><code>render</code> renders a <code>Storage</code> page with the RDDs and streaming blocks (from the AppStatusStore).</p>","text":""},{"location":"webui/StoragePage/#rdd-tables-headers","title":"RDD Table's Headers <p><code>StoragePage</code> uses the following headers and tooltips for the RDD table.</p>    Header Tooltip     ID    RDD Name Name of the persisted RDD   Storage Level StorageLevel displays where the persisted RDD is stored, format of the persisted RDD (serialized or de-serialized) and replication factor of the persisted RDD   Cached Partitions Number of partitions cached   Fraction Cached Fraction of total partitions cached   Size in Memory Total size of partitions in memory   Size on Disk Total size of partitions on the disk","text":""},{"location":"webui/StorageTab/","title":"StorageTab","text":"<p><code>StorageTab</code> is a SparkUITab with <code>storage</code> URL prefix.</p> <p></p>"},{"location":"webui/StorageTab/#creating-instance","title":"Creating Instance","text":"<p><code>StorageTab</code> takes the following to be created:</p> <ul> <li> Parent SparkUI <li> AppStatusStore <p><code>StorageTab</code> is created\u00a0when:</p> <ul> <li><code>SparkUI</code> is requested to initialize</li> </ul>"},{"location":"webui/StorageTab/#pages","title":"Pages","text":"<p>When created, <code>StorageTab</code> attaches the following pages (with a reference to itself and the AppStatusStore):</p> <ul> <li>StoragePage</li> <li>RDDPage</li> </ul>"},{"location":"webui/UIUtils/","title":"UIUtils","text":"<p>== [[UIUtils]] UIUtils</p> <p><code>UIUtils</code> is a utility object for...FIXME</p> <p>=== [[headerSparkPage]] <code>headerSparkPage</code> Method</p>"},{"location":"webui/UIUtils/#source-scala","title":"[source, scala]","text":"<p>headerSparkPage(   request: HttpServletRequest,   title: String,   content: =&gt; Seq[Node],   activeTab: SparkUITab,   refreshInterval: Option[Int] = None,   helpText: Option[String] = None,   showVisualization: Boolean = false,   useDataTables: Boolean = false): Seq[Node]</p> <p><code>headerSparkPage</code>...FIXME</p> <p>NOTE: <code>headerSparkPage</code> is used when...FIXME</p>"},{"location":"webui/WebUI/","title":"WebUI","text":"<p><code>WebUI</code> is an abstraction of UIs.</p>"},{"location":"webui/WebUI/#contract","title":"Contract","text":""},{"location":"webui/WebUI/#initializing","title":"Initializing <pre><code>initialize(): Unit\n</code></pre> <p>Initializes components of the UI</p> <p>Used by the implementations themselves.</p>  <p>Note</p> <p><code>initialize</code> does not add anything special to the Scala type hierarchy but a common name to use across WebUIs. In other words, <code>initialize</code> does not participate in any design pattern or a type hierarchy and serves no purpose of being part of the contract.</p>","text":""},{"location":"webui/WebUI/#implementations","title":"Implementations","text":"<ul> <li>HistoryServer</li> <li>MasterWebUI (Spark Standalone)</li> <li>MesosClusterUI (Spark on Mesos)</li> <li>SparkUI</li> <li>WorkerWebUI (Spark Standalone)</li> </ul>"},{"location":"webui/WebUI/#creating-instance","title":"Creating Instance","text":"<p><code>WebUI</code> takes the following to be created:</p> <ul> <li> <code>SecurityManager</code> <li> <code>SSLOptions</code> <li> Port <li> SparkConf <li> Base Path (default: empty) <li> Name (default: empty) Abstract Class <p><code>WebUI</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete WebUIs.</p>"},{"location":"webui/WebUI/#tabs","title":"Tabs <p><code>WebUI</code> uses <code>tabs</code> registry for WebUITabs (that have been attached).</p> <p>Tabs can be attached and detached.</p>","text":""},{"location":"webui/WebUI/#attaching-tab","title":"Attaching Tab <pre><code>attachTab(\n  tab: WebUITab): Unit\n</code></pre> <p><code>attachTab</code> attaches the pages of the given WebUITab (and adds it to the tabs).</p>","text":""},{"location":"webui/WebUI/#detaching-tab","title":"Detaching Tab <pre><code>detachTab(\n  tab: WebUITab): Unit\n</code></pre> <p><code>detachTab</code> detaches the pages of the given WebUITab (and removes it from the tabs).</p>","text":""},{"location":"webui/WebUI/#pages","title":"Pages <p><code>WebUI</code> uses <code>pageToHandlers</code> registry for WebUIPages and their associated <code>ServletContextHandler</code>s.</p> <p>Pages can be attached and detached.</p>","text":""},{"location":"webui/WebUI/#attaching-page","title":"Attaching Page <pre><code>attachPage(\n  page: WebUIPage): Unit\n</code></pre> <p><code>attachPage</code>...FIXME</p> <p><code>attachPage</code> is used when:</p> <ul> <li><code>WebUI</code> is requested to attach a tab</li> <li>others</li> </ul>","text":""},{"location":"webui/WebUI/#detaching-page","title":"Detaching Page <pre><code>detachPage(\n  page: WebUIPage): Unit\n</code></pre> <p><code>detachPage</code> removes the given WebUIPage from the UI (the pageToHandlers registry) with all of the handlers.</p> <p><code>detachPage</code> is used when:</p> <ul> <li><code>WebUI</code> is requested to detach a tab</li> </ul>","text":""},{"location":"webui/WebUI/#logging","title":"Logging <p>Since <code>WebUI</code> is an abstract class, logging is configured using the logger of the implementations.</p>","text":""},{"location":"webui/WebUIPage/","title":"WebUIPage","text":"<p><code>WebUIPage</code> is an abstraction of pages (of a WebUI) that can be rendered to HTML and JSON.</p>"},{"location":"webui/WebUIPage/#contract","title":"Contract","text":""},{"location":"webui/WebUIPage/#rendering-page-to-html","title":"Rendering Page (to HTML) <pre><code>render(\n  request: HttpServletRequest): Seq[Node]\n</code></pre> <p>Used when:</p> <ul> <li><code>WebUI</code> is requested to attach a page (to handle the URL)</li> </ul>","text":""},{"location":"webui/WebUIPage/#implementations","title":"Implementations","text":"<ul> <li>AllExecutionsPage</li> <li>AllJobsPage</li> <li>AllStagesPage</li> <li>ApplicationPage</li> <li>BatchPage</li> <li>DriverPage</li> <li>EnvironmentPage</li> <li>ExecutionPage</li> <li>ExecutorsPage</li> <li>ExecutorThreadDumpPage</li> <li>HistoryPage</li> <li>JobPage</li> <li>LogPage</li> <li>MasterPage</li> <li>MesosClusterPage</li> <li>PoolPage</li> <li>RDDPage</li> <li>StagePage</li> <li>StoragePage</li> <li>StreamingPage</li> <li>StreamingQueryPage</li> <li>StreamingQueryStatisticsPage</li> <li>ThriftServerPage</li> <li>ThriftServerSessionPage</li> <li>WorkerPage</li> </ul>"},{"location":"webui/WebUIPage/#creating-instance","title":"Creating Instance","text":"<p><code>WebUIPage</code> takes the following to be created:</p> <ul> <li> URL Prefix Abstract Class <p><code>WebUIPage</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete WebUIPages.</p>"},{"location":"webui/WebUIPage/#rendering-page-to-json","title":"Rendering Page to JSON <pre><code>renderJson(\n  request: HttpServletRequest): JValue\n</code></pre> <p><code>renderJson</code> returns a <code>JNothing</code> by default.</p> <p><code>renderJson</code>\u00a0is used when:</p> <ul> <li><code>WebUI</code> is requested to attach a page (and handle the <code>/json</code> URL)</li> </ul>","text":""},{"location":"webui/WebUITab/","title":"WebUITab","text":"<p><code>WebUITab</code> is an abstraction of UI tabs with a name and pages.</p>"},{"location":"webui/WebUITab/#implementations","title":"Implementations","text":"<ul> <li>SparkUITab</li> </ul>"},{"location":"webui/WebUITab/#creating-instance","title":"Creating Instance","text":"<p><code>WebUITab</code> takes the following to be created:</p> <ul> <li> WebUI <li> Prefix Abstract Class <p><code>WebUITab</code>\u00a0is an abstract class and cannot be created directly. It is created indirectly for the concrete WebUITabs.</p>"},{"location":"webui/WebUITab/#name","title":"Name <pre><code>name: String\n</code></pre> <p><code>WebUITab</code> has a name that is the prefix capitalized by default.</p>","text":""},{"location":"webui/WebUITab/#pages","title":"Pages <pre><code>pages: ArrayBuffer[WebUIPage]\n</code></pre> <p><code>WebUITab</code> has WebUIPages.</p>","text":""},{"location":"webui/WebUITab/#attaching-page","title":"Attaching Page <pre><code>attachPage(\n  page: WebUIPage): Unit\n</code></pre> <p><code>attachPage</code> registers the WebUIPage (in the pages registry).</p> <p><code>attachPage</code> adds the prefix of this <code>WebUITab</code> before the prefix of the given WebUIPage:</p> <pre><code>[prefix]/[page.prefix]\n</code></pre>","text":""},{"location":"webui/configuration-properties/","title":"web UI Configuration Properties","text":""},{"location":"webui/configuration-properties/#sparkuicustomexecutorlogurl","title":"spark.ui.custom.executor.log.url <p>Specifies custom spark executor log url for supporting external log service instead of using cluster managers' application log urls in the Spark UI. Spark will support some path variables via patterns which can vary on cluster manager. Please check the documentation for your cluster manager to see which patterns are supported, if any. This configuration replaces original log urls in event log, which will be also effective when accessing the application on history server. The new log urls must be permanent, otherwise you might have dead link for executor log urls.</p> <p>Used when:</p> <ul> <li><code>DriverEndpoint</code> is created (and initializes an ExecutorLogUrlHandler)</li> </ul>","text":""},{"location":"webui/configuration-properties/#sparkuienabled","title":"spark.ui.enabled <p>Controls whether the web UI is started for the Spark application</p> <p>Default: <code>true</code></p>","text":""},{"location":"webui/configuration-properties/#sparkuiport","title":"spark.ui.port <p>The port the web UI of a Spark application binds to</p> <p>Default: <code>4040</code></p> <p>If multiple <code>SparkContext</code>s attempt to run on the same host (as different Spark applications), they will bind to successive ports beginning with <code>spark.ui.port</code> (until <code>spark.port.maxRetries</code>).</p> <p>Used when:</p> <ul> <li><code>SparkUI</code> utility is used to get the UI port</li> </ul>","text":""},{"location":"webui/configuration-properties/#sparkuiprometheusenabled","title":"spark.ui.prometheus.enabled <p>internal Expose executor metrics at <code>/metrics/executors/prometheus</code></p> <p>Default: <code>false</code></p> <p>Used when:</p> <ul> <li><code>SparkUI</code> is requested to initialize</li> </ul>","text":""},{"location":"webui/configuration-properties/#review-me","title":"Review Me <p>[[properties]] .web UI Configuration Properties [cols=\"1,1,2\",options=\"header\",width=\"100%\"] |=== | Name | Default Value | Description</p>    [[spark.ui.allowFramingFrom]] <code>spark.ui.allowFramingFrom</code>     Defines the URL to use in <code>ALLOW-FROM</code> in <code>X-Frame-Options</code> header (as described in http://tools.ietf.org/html/rfc7034).    <p>Used exclusively when <code>JettyUtils</code> is requested to spark-webui-JettyUtils.md#createServlet[create an HttpServlet].</p> <p>| [[spark.ui.consoleProgress.update.interval]] <code>spark.ui.consoleProgress.update.interval</code> | <code>200</code> (ms) | Update interval, i.e. how often to show the progress.</p> <p>| [[spark.ui.killEnabled]] <code>spark.ui.killEnabled</code> | <code>true</code> | Enables jobs and stages to be killed from the web UI (<code>true</code>) or not (<code>false</code>).</p> <p>Used exclusively when <code>SparkUI</code> is requested to spark-webui-SparkUI.md#initialize[initialize] (and registers the redirect handlers for <code>/jobs/job/kill</code> and <code>/stages/stage/kill</code> URIs)</p> <p>| [[spark.ui.retainedDeadExecutors]] <code>spark.ui.retainedDeadExecutors</code> | <code>100</code> |</p> <p>| [[spark.ui.timeline.executors.maximum]] <code>spark.ui.timeline.executors.maximum</code> | <code>1000</code> | The maximum number of entries in &lt;&gt; registry. <p>| [[spark.ui.timeline.tasks.maximum]] <code>spark.ui.timeline.tasks.maximum</code> | <code>1000</code> | |===</p>","text":""},{"location":"developer-api/","title":"Developer API","text":""},{"location":"developer-api/#developerapi","title":"DeveloperApi","text":"<ul> <li>SparkEnv</li> <li>SparkListener</li> <li>StatsReportListener</li> <li>TaskCompletionListener</li> <li>TaskFailureListener</li> <li>ExecutorMetrics</li> <li>ShuffleReadMetrics</li> <li>ShuffleWriteMetrics</li> <li>TaskMetrics</li> <li>SparkPlugin</li> <li>Dependency</li> <li>NarrowDependency</li> <li>ShuffleDependency</li> <li>ShuffledRDD</li> <li>ResourceID</li> <li>SparkListenerResourceProfileAdded</li> <li>StorageLevel</li> </ul>"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index afaa2ff7bc..5ea421c3d6 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ