-
Notifications
You must be signed in to change notification settings - Fork 5
Zeppelin: Documentation of components & options
This page describes a number of features that are available with Zeppelin related to Authentication, Storage, and Git integration. Most of the content is extracted from the Zeppelin Docs page here: http://zeppelin.apache.org/docs/0.8.2/
There are also a number of tutorials describing how to implement features that are not available out of the box, and some of those are also mentioned here.
Apache Zeppelin has a pluggable notebook storage mechanism controlled by zeppelin.notebook.storage configuration option with multiple implementations.
Some of the relevant implementations are listed here, for the full documentation see: http://zeppelin.apache.org/docs/0.8.2/setup/storage/storage.html
There are few notebook storage systems available for a use out of the box:
- (default) use local file system and version it using local Git repository - GitNotebookRepo
- all notes are saved in the notebook folder in your local File System - VFSNotebookRepo
- all notes are saved in the notebook folder in hadoop compatible file system - FileSystemNotebookRepo
- storage using Amazon S3 service - S3NotebookRepo
- storage using Azure service - AzureNotebookRepo
- storage using Google Cloud Storage - GCSNotebookRepo
- storage using MongoDB - MongoNotebookRepo
- storage using GitHub - GitHubNotebookRepo
To enable versioning for all your local notebooks though a standard Git repository - uncomment the next property in zeppelin-site.xml in order to use GitNotebookRepo class:
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.GitNotebookRepo</value>
<description>notebook persistence layer implementation</description>
</property>
https://zeppelin.apache.org/docs/0.8.0/setup/storage/storage.html#notebook-storage-in-github
The Zeppelin documentation describes that it is possible to integration Zeppelin with a Github repo, so that you can manage notebook versions
Github actions that can be performed from the Zeppelin UI:
- Pushing Commits to GitHub
- Restore a Commit from GitHub
- Pull Request from ZeppelinUI
- Resolve Conflicts
To enable GitHub tracking, uncomment the following properties in zeppelin-site.xml
<property>
<name>zeppelin.notebook.git.remote.url</name>
<value></value>
<description>remote Git repository URL</description>
</property>`
<property>
<name>zeppelin.notebook.git.remote.username</name>
<value>token</value>
<description>remote Git repository username</description>
</property>
<property>
<name>zeppelin.notebook.git.remote.access-token</name>
<value></value>
<description>remote Git repository password</description>
</property>
<property>
<name>zeppelin.notebook.git.remote.origin</name>
<value>origin</value>
<description>Git repository remote</description>
</property>
And set the zeppelin.notebook.storage propery to org.apache.zeppelin.notebook.repo.GitHubNotebookRepo`
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.GitHubNotebookRepo</value>
</property>
The access token could be obtained by following the steps on this link https://github.com/settings/tokens
Note:
How this works with multiple users? It seems that this configuration links to a single Notebook & Github repo. For multiple users, having one Zeppelin per user may allow us to link Github repos to individual users.
Related Links:
https://docs.qubole.com/en/latest/user-guide/notebooks-and-dashboards/notebooks/zep-notebooks/managing-notebook-versions/link-notebook-github.html https://community.cloudera.com/t5/Community-Articles/How-To-Store-Zeppelin-Notes-in-GitHub-repo/ta-p/247398
Importing/uploading notebooks is also available using the "Import Note" Button and entering a URL with the location of the Zeppelin Notebook.
Notebook Storage in hadoop compatible file system repository Notes may be stored in hadoop compatible file system such as hdfs, so that multiple Zeppelin instances can share the same notes. It supports all the versions of hadoop 2.x. If you use FileSystemNotebookRepo, then zeppelin.notebook.dir is the path on the hadoop compatible file system. And you need to specify HADOOP_CONF_DIR in zeppelin-env.sh so that zeppelin can find the right hadoop configuration files. If your hadoop cluster is kerberized, then you need to specify zeppelin.server.kerberos.keytab and zeppelin.server.kerberos.principal
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.FileSystemNotebookRepo</value>
<description>hadoop compatible file system notebook persistence layer implementation</description>
</property>
Zeppelin natively supports LDAP/PAM based authentication and user role mapping using Apache Shiro.
Zeppelin account information can be setup using Apache Shiro.
Example usage: In the Zeppelin node, setup a shiro.ini configuration file cp conf/shiro.ini.template conf/shiro.ini In shiro.ini, we can create user/pass accounts this way:
[users]
admin = password1, admin
user1 = password2, role1, role2
user2 = password3, role3
user3 = password4, role2
We can also setup groups & roles through this configuration file. For more info: http://zeppelin.apache.org/docs/0.8.2/setup/security/shiro_authentication.html
OAuth may be possible with Zeppelin, but there doesn't seem to be an out of the box solution
Zeppelin uses Apache Shiro for authentication, and there some blogs that describe how to do SSO with OAuth and Shiro: https://developer.okta.com/blog/2020/05/11/java-shiro-oauth
There is also a tutorial describing how to integrate Zeppelin + OAuth using Apache Knox https://medium.com/data-collective/apache-zeppelin-oauth-integration-using-apache-knox-dea2362e3dda
From the above tutorial: OAuth integration is not natively available. but in latest version KnoxSSO support is added. Using KnoxSSO we can integrate Zeppelin with any OAuth provider.
Apache Knox is an Application Gateway for interacting with the REST APIs and UIs of Apache Hadoop deployments. Knox supports OAuth authentication for hadoop applications using KnoxSSO service. KnoxSSO service is an integration service that provides a normalized SSO token for representing the authenticated user.
All REST APIs are available starting with the following endpoint http://[zeppelin-server]:[zeppelin-port]/api
http://zeppelin.apache.org/docs/0.8.2/usage/rest_api/
REST API for running/managing Notebooks
http://zeppelin.apache.org/docs/0.8.2/usage/rest_api/notebook.html
As an example, if we want to run all paragraphs of a notebook, we can send a POST request to the following URL: http://[zeppelin-server]:[zeppelin-port]/api/notebook/job/[noteId]
We can also use the REST API to restart an interpreter:
This PUT method restarts the given interpreter id. http://[zeppelin-server]:[zeppelin-port]/api/interpreter/setting/restart/[interpreter ID]
Sample JSON response {"status":"OK"}
Note: Occasionally Zeppelin/Spark will become unhealthy, and usually an interpreter restart will be the fix. Using this to automate the process, if we discover that Zeppelin is unhealthy could be useful.
The Zeppelin documentation also has a section describing the available options for handling multiple users: http://zeppelin.apache.org/docs/0.8.2/setup/basics/multi_user_support.html
This is available using Shiro, described above
http://zeppelin.apache.org/docs/0.8.2/setup/security/notebook_authorization.html
You can set Zeppelin notebook permissions in each notebooks. Of course only notebook owners can change this configuration. Just click Lock icon and open the permission setting page in your notebook.
As you can see, each Zeppelin notebooks has 3 entities :
- Owners ( users or groups )
- Readers ( users or groups )
- Writers ( users or groups )
- Runners ( users or groups )
By default, the authorization rights allow other users to see the newly created note, meaning the workspace is public. This behavior is controllable and can be set through either ZEPPELIN_NOTEBOOK_PUBLIC variable in conf/zeppelin-env.sh, or through zeppelin.notebook.public property in conf/zeppelin-site.xml. Thus, in order to make newly created note appear only in your private workspace by default, you can set either ZEPPELIN_NOTEBOOK_PUBLIC to false in your conf/zeppelin-env.sh as follows:
export ZEPPELIN_NOTEBOOK_PUBLIC="false"
or set zeppelin.notebook.public property to false in conf/zeppelin-site.xml as follows:
<property>
<name>zeppelin.notebook.public</name>
<value>false</value>
<description>Make notebook public by default when created, private otherwise</description>
</property>
Behind the scenes, when you create a new note only the owners field is filled with current user, leaving readers, runners and writers fields empty. All the notes with at least one empty authorization field are considered to be in public workspace. Thus when setting zeppelin.notebook.public (or corresponding ZEPPELIN_NOTEBOOK_PUBLIC) to false, newly created notes have readers, runners, writers fields filled with current user, making note appear as in private workspace.
Zeppelin provides 3 different modes to run interpreter process: shared, scoped and isolated.
Isolated mode runs a separate interpreter process for each note in the case of per note scope. So, each note has an absolutely isolated session. (But it is still possible to share objects via ResourcePool)
In Scoped mode, Zeppelin still runs a single interpreter JVM process but, in the case of per note scope, each note runs in its own dedicated session. (Note it is still possible to share objects between these notes via ResourcePool)
In Shared mode, single JVM process and a single session serves all notes. As a result, note A can access variables (e.g python, scala, ..) directly created from other notes..
For more detailed explanation see here: http://zeppelin.apache.org/docs/0.8.2/usage/interpreter/interpreter_binding_mode.html
Zeppelin provides Spark integration with the following methods
- Spark Standalone Mode
- Spark & Yarn
- Spark & Mesos
This can be configured by setting the "master" property appropriately:
- local[*] in local mode
- spark://master:7077 in standalone cluster
- yarn-client in Yarn client mode
- yarn-cluster in Yarn cluster mode
- mesos://host:5050 in Mesos cluster
One way that files can be uploaded in Zeppelin, is using the %sh interpreter & wget:
%sh
`wget <file_name> <url_file>
<file_name> = a new local file name after pull it by using 'wget'
<url_file> = URL of data file