GitHub - qiyangduan/schemaindex: SchemaIndex is designed for data scientists to index and search metadata more efficiently.

Overview

SchemaIndex is designed for data scientists to find data more efficiently. It can index the tables and files known to the user.

With schemaindex, you can:

Create a data source (e.g. Mysql, Oracle, etc) by registering its connection information.
Reflect the data source and index the metadata.
Search for all tables/entities in those data sources by their names.

Supported Data Sources

I plan to support at least top 10 of most popular databases in db-ranking. This table tracks the progress.: * HDFS * Mysql * Sqlite

#Data Source | #pip install | #Cloud | #Notes |

:--- | :--- | :---: | :--- |

Oracle | requires: 1. pip install cx_oracle, 2. instal oracle instantClient. | OOTB |

Mysql | requires: 1. pip install pymysql | OOTB |

MS SQL Server | requires: 1. conda install pymssql | OOTB |

Sqlite | OOTB | OOTB |

HDFS | OOTB | OOTB |

HDFS_inotify | OOTB | OOTB |

Data Sources to Support on Roadmap

HDP (Hive)

Installation

On Linux

Stardard pip should be able to install schemaindex:

$ pip install schemaindex

How to use

Basic Usage

To start the schemaindex server, please run this command:

$ schemaindex runserver

The following is a sample output:

(py3env1) duan:py3env1$ schemaindex runserver
Server started, please visit : http://localhost:8088/

runserver command should boot up a webserver and also open a browser for you. In the browser, click "datasources" and then click "create ..." to register your own data source. For example, to register a new HDFS data source, you can input information like the following screenshot:

The next step is to reflect the data source and extract all metadata. You can do so by clicking button "Relfect Now!" to extract the metadata of the data source,

or check the box "Reflect Data Source Immediately" during data source creation.

If all previous two steps are successful, you should be able to search the files in "search" box: appearing in "overview" and "search" page, like the following screenshot:

Work with HDFS Index

While creating data source, you can select 'hdfsindex' plugin. This plugin is based on hdfscli library (pip install hdfs). You need to input those parameters:

HDFS Web URL: sometimes is also known as Namenode-UI. Note: Kerberos authentication is not supported. If you need it, please raise a ticket in github.
HDFS Native URL: Usually you can find this link after you openned the namenode-ui/web url. THis should start with hdfs://localhost:9000 (or 8020)

If you check "Real time synchronization:" and you have reflected the hdfs data source, it will start a background java process to capture all hdfs changes and update the index in real time. In background, you should be able to see a process similar to "java ... HdfsINotify2Restful". If you do not see this process, try to restart schemaindex server, or look at the logs at $SCHEMAINDEX/log

Work with Databases

By default, schemaindex comes with a predefined plugin to extract metadata from mainstream databases. It is sqlalchemyindex. This reflect engine is based on python library Sqlalchemy, which works for many databases, including Mysql, Sqlite, etc. For mysql to work, you need to install pymysql (python3) or mysql-python (python2) in advance.

How to start a SchemaIndex Server

All the plugins are located in $SCHEMAINDEX/plugin. Currently only HDFS and SQLALCHEMY are implemented. If you want to add more plugins, you can put the plugin into this folder and run this command:

$ schemaindex reload plugin

The following is a sample output:

(py3env1) duan:py3env1$ schemaindex reload plugin
Plugins are reloaded.
Reflect Plugin Name:                     Path:
hdfsindex                                /home/duan/virenv/py3env1/local/lib/python2.7/site-packages/schemaindex/plugin/hdfsindex
sqlalchemy                               /home/duan/virenv/py3env1/local/lib/python2.7/site-packages/schemaindex/plugin/sqlalchemyindex

Reference

Those questions explain why I created this software:

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.cache/v/cache		.cache/v/cache
.idea		.idea
doc		doc
schemaindex.egg-info		schemaindex.egg-info
schemaindex		schemaindex
temp		temp
test		test
.gitignore		.gitignore
Dockerfile		Dockerfile
MANIFEST.in		MANIFEST.in
README.rst		README.rst
TODO.md		TODO.md
docker-compose.yml		docker-compose.yml
git_push		git_push
requirements.txt		requirements.txt
sdist.sh		sdist.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Supported Data Sources

Data Sources to Support on Roadmap

Installation

On Linux

How to use

Basic Usage

Work with HDFS Index

Work with Databases

How to start a SchemaIndex Server

Reference

About

Releases

Packages

Languages

qiyangduan/schemaindex

Folders and files

Latest commit

History

Repository files navigation

Overview

Supported Data Sources

Data Sources to Support on Roadmap

Installation

On Linux

How to use

Basic Usage

Work with HDFS Index

Work with Databases

How to start a SchemaIndex Server

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages