Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
snazy committed Nov 25, 2023
1 parent ffd26af commit c52619b
Show file tree
Hide file tree
Showing 19 changed files with 2,310 additions and 5,165 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/demos-docker-build.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
strategy:
max-parallel: 4
matrix:
python-version: [3.7]
python-version: [3.10]

steps:
- uses: actions/checkout@v3
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/notebooks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
strategy:
max-parallel: 4
matrix:
python-version: [3.7]
python-version: [3.10]

steps:
- uses: actions/checkout@v3
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ venv/
__pycache__/
.pytest_cache

# pyenv
.python-version

# Jetbrains IDEs
/.idea
*.iws
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Nessie version is set in Binder at `docker/binder/requirements_base.txt`. Curren

### Iceberg

Currently we are using Iceberg `0.13.1` and it is specified in both iceberg notebooks as well as `docker/utils/__init__.py`
Currently we are using Iceberg `1.4.2` and it is specified in both iceberg notebooks as well as `docker/utils/__init__.py`

### Spark

Expand All @@ -30,7 +30,7 @@ Only has to be updated in `docker/binder/requirements.txt`. Currently, Iceberg s

### Flink

Flink version is set in Binder at `docker/binder/requirements_flink.txt`. Currently, we are using `1.13.6`.
Flink version is set in Binder at `docker/binder/requirements_flink.txt`. Currently, we are using `1.17.1`.

### Hadoop

Expand All @@ -53,7 +53,7 @@ Of course, Binder just lets a user "simply start" a notebook via a simple "click

## Development
For development, you will need to make sure to have the following installed:
- Python 3.7+
- Python 3.10+
- pre-commit

Regarding pre-commit, you will need to make sure is installed through `pre-commit install` in order to install the hooks locally since this repo
Expand Down
2 changes: 1 addition & 1 deletion binder/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# Tag will be automatically generated through pre-commit hook if any changes
# happened in the docker/ folder
FROM ghcr.io/projectnessie/nessie-binder-demos:649ec80b8fa7d9666178380a33b2e645a52d5985
FROM ghcr.io/projectnessie/nessie-binder-demos:8cbd35c0becb32d5c88f68476f6e8671a4b1f138

# Create the necessary folders for the demo, this will be created and owned by {NB_USER}
RUN mkdir -p notebooks && mkdir -p datasets
Expand Down
8 changes: 4 additions & 4 deletions binder/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
## Building binder locally

### Prerequisites
You need to have a python 3.7+ installed.
We recommend to use [pyenv](https://github.com/pyenv/pyenv) for managing your python environment(s).
You need to have a python 3.11+ installed.
We recommend to use [pyenv](https://github.com/pyenv/pyenv) for managing your python environment(s).

To build the binder image locally, firstly, you need to install `jupyter-repo2docker` dependency:

Expand All @@ -29,8 +29,8 @@ Run (or look into) the `build_run_local_docker.sh` script how to do this semi-au
After those steps, the binder should be running on your local machine.
Next, find the output similar to this:
```shell
[C 13:38:25.199 NotebookApp]
[C 13:38:25.199 NotebookApp]

To access the notebook, open this file in a browser:
file:///home/jovyan/.local/share/jupyter/runtime/nbserver-40-open.html
Or copy and paste this URL:
Expand Down
2 changes: 1 addition & 1 deletion docker/binder/postBuild
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ python -m ipykernel install --name "flink-demo" --user
python -c "import utils;utils._copy_all_hadoop_jars_to_pyflink()"
conda deactivate

python -c "import utils;utils.fetch_nessie()"
python -c "import utils;utils.fetch_nessie_jar()"

python -c "import utils;utils.fetch_spark()"

Expand Down
6 changes: 3 additions & 3 deletions docker/binder/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
-r requirements_base.txt
findspark==2.0.1
pandas==1.3.5
pyhive[hive]==0.6.5
pyspark==3.2.1
pandas==1.5.3
pyhive[hive_pure_sasl]==0.7.0
pyspark==3.5.0
2 changes: 1 addition & 1 deletion docker/binder/requirements_base.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
pynessie==0.30.0
pynessie==0.65.0
4 changes: 1 addition & 3 deletions docker/binder/requirements_flink.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,2 @@
-r requirements_base.txt
apache-flink==1.13.6
# flink requires pandas<1.2.0 see https://github.com/apache/flink/blob/release-1.13.6/flink-python/setup.py#L313
pandas==1.1.5
apache-flink==1.17.1
3 changes: 2 additions & 1 deletion docker/binder/start.hive
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@ fi
export HIVE_HOME=$HIVE_PARENT_DIR/$HIVE_FOLDER_NAME

# Create hive warehouse folder
mkdir $HIVE_WAREHOUSE_DIR
rm -rf $HIVE_WAREHOUSE_DIR
mkdir -p $HIVE_WAREHOUSE_DIR

# Copy the needed configs to Hive folder
cp $RESOURCE_DIR/hive/config/hive-site.xml ${HIVE_HOME}/conf/
Expand Down
46 changes: 20 additions & 26 deletions docker/utils/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2020 Dremio
#
Expand All @@ -18,7 +19,6 @@
import os
import shutil
import site
import stat
import sysconfig
import tarfile
from typing import Optional
Expand All @@ -29,7 +29,7 @@
import pyspark

_SPARK_VERSION = pyspark.__version__
_SPARK_FILENAME = f"spark-{_SPARK_VERSION}-bin-hadoop3.2"
_SPARK_FILENAME = f"spark-{_SPARK_VERSION}-bin-hadoop3"
_SPARK_URL = f"https://archive.apache.org/dist/spark/spark-{_SPARK_VERSION}/{_SPARK_FILENAME}.tgz"
except ImportError:
_SPARK_VERSION = None
Expand All @@ -40,22 +40,28 @@
_HADOOP_FILENAME = f"hadoop-{_HADOOP_VERSION}"
_HADOOP_URL = f"https://archive.apache.org/dist/hadoop/common/hadoop-{_HADOOP_VERSION}/{_HADOOP_FILENAME}.tar.gz"

_FLINK_MAJOR_VERSION = "1.13"
_FLINK_MAJOR_VERSION = "1.17"

_ICEBERG_VERSION = "0.13.1"
_ICEBERG_FLINK_FILENAME = f"iceberg-flink-runtime-{_FLINK_MAJOR_VERSION}-{_ICEBERG_VERSION}.jar"
_ICEBERG_VERSION = "1.4.2"
_ICEBERG_FLINK_FILENAME = (
f"iceberg-flink-runtime-{_FLINK_MAJOR_VERSION}-{_ICEBERG_VERSION}.jar"
)
_ICEBERG_FLINK_URL = f"https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-{_FLINK_MAJOR_VERSION}/{_ICEBERG_VERSION}/{_ICEBERG_FLINK_FILENAME}"
_ICEBERG_HIVE_FILENAME = f"iceberg-hive-runtime-{_ICEBERG_VERSION}.jar"
_ICEBERG_HIVE_URL = f"https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-hive-runtime/{_ICEBERG_VERSION}/{_ICEBERG_HIVE_FILENAME}"
_ICEBERG_HIVE_FILENAME = f"iceberg-hive3-{_ICEBERG_VERSION}.jar"
_ICEBERG_HIVE_URL = f"https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-hive3/{_ICEBERG_VERSION}/{_ICEBERG_HIVE_FILENAME}"

_HIVE_VERSION = "2.3.9"
_HIVE_VERSION = "3.1.3"
_HIVE_FILENAME = f"apache-hive-{_HIVE_VERSION}-bin"
_HIVE_URL = (
f"https://archive.apache.org/dist/hive/hive-{_HIVE_VERSION}/{_HIVE_FILENAME}.tar.gz"
)

_NESSIE_VERSION = "0.74.0"


def _link_file_into_dir(source_file: str, target_dir: str, replace_if_exists=True) -> None:
def _link_file_into_dir(
source_file: str, target_dir: str, replace_if_exists=True
) -> None:
assert os.path.isfile(source_file)
assert os.path.isdir(target_dir)

Expand All @@ -75,7 +81,7 @@ def _link_file_into_dir(source_file: str, target_dir: str, replace_if_exists=Tru
os.link(source_file, target_file)
assert os.path.isfile(target_file), (source_file, target_file)

action = 'replaced' if replaced else 'created'
action = "replaced" if replaced else "created"
print(f"Link target was {action}: {target_file} (source: {source_file})")


Expand Down Expand Up @@ -112,7 +118,9 @@ def _copy_all_hadoop_jars_to_pyflink() -> None:
pyflink_lib_dir = _find_pyflink_lib_dir()
for _jar_count, jar in enumerate(_jar_files()):
_link_file_into_dir(jar, pyflink_lib_dir)
print(f"Linked {_jar_count} HADOOP jar files into the pyflink lib dir at location {pyflink_lib_dir}")
print(
f"Linked {_jar_count} HADOOP jar files into the pyflink lib dir at location {pyflink_lib_dir}"
)


def _find_pyflink_lib_dir() -> Optional[str]:
Expand All @@ -139,16 +147,6 @@ def _download_file(filename: str, url: str) -> None:
f.write(r.content)


def fetch_nessie() -> str:
"""Download nessie executable."""
runner = "nessie-quarkus-runner"

url = _get_base_nessie_url()
_download_file(runner, url)
os.chmod(runner, os.stat(runner).st_mode | stat.S_IXUSR)
return runner


def fetch_nessie_jar() -> str:
"""Download nessie Jar in order to run the tests in Mac"""
runner = "nessie-quarkus-runner.jar"
Expand All @@ -159,12 +157,8 @@ def fetch_nessie_jar() -> str:


def _get_base_nessie_url() -> str:
import pynessie

version = pynessie.__version__

return "https://github.com/projectnessie/nessie/releases/download/nessie-{}/nessie-quarkus-{}-runner".format(
version, version
_NESSIE_VERSION, _NESSIE_VERSION
)


Expand Down
Loading

0 comments on commit c52619b

Please sign in to comment.