Skip to content

Commit

Permalink
docs: initial readme
Browse files Browse the repository at this point in the history
  • Loading branch information
adrienaury committed Mar 25, 2024
1 parent 4526d3b commit fff4980
Showing 1 changed file with 73 additions and 2 deletions.
75 changes: 73 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,81 @@
# SILO

TODO
SILO (Sparse Input Linked Output) is an open-source command-line interface (CLI) tool designed for processing data in JSONLine format. It provides functionality to ingest data from standard input (stdin) and isolate entites (which are groups of related values) into a file, allowing users to create a referential of all entities discovered within the JSONLine data.

SILO can be used in addition to LINO and PIMO tools to generate consistency sources.

Here is an short example where SILO can be useful :

- TableA contains data about clients, and have these direct identifiers as columns : ID_CLIENT, EMAIL_CLIENT, ACCOUNT_NUMBER
- TableB contains data about clients, and have these direct identifiers as columns : ID_CLIENT, EMAIL_CLIENT

Unfortunately, available dataset contains a lot of duplication and null values.

**`TableA`**

| ID_CLIENT | EMAIL_CLIENT | ACCOUNT_NUMBER |
| --------- | ------------------- | -------------- |
| 0001 | [email protected] | |
| | | C01 |

**`TableB`**

| ACCOUNT_NUMBER | EMAIL_CLIENT |
| -------------- | ------------------- |
| C01 | [email protected] |

SILO will be able to generate the following referential.

**`Output of SILO`**

| UUID | ID_CLIENT | EMAIL_CLIENT | ACCOUNT_NUMBER |
| ------------------------------------ | --------- | ------------------- | -------------- |
| 79cc287b-3640-49c1-9e6a-86cff87cce41 | 0001 | [email protected] | C01 |

By leveraging SILO's capabilities, users can efficiently identify and link related records across disparate datasets, even in cases where direct identifiers are missing or duplicated.

## Installation

To install SILO, follow these steps:

1. Download the released tar.gz corresponding to your operating system
2. Extract the tar.gz
3. Optionnaly, move the `silo` binary to a shared path like `/usr/bin/silo`

## Usage

TODO
SILO provides two main commands:

### silo scan

The silo scan command is used to ingest data from stdin in JSONLine format, persisted on disk for future reference. Here's how to use it:

```console
$ silo scan my-silo < input.jsonl
⣾ Scanned 5 rows, found 15 links (4084 row/s) [0s]
```

Analysis data is persisted on disk on the `my-silo` path relative to the current directory.

### silo dump

The silo dump command is used to dump each connected entity into a file. This allows users to create a referential of all entities discovered within the JSONLine data. Here's how to use it:

```console
$ silo dump my-silo
{"uuid":"19bef352-ed87-4de8-a4ea-65f1d7db9ced","id":"ID1","key":2}
{"uuid":"19bef352-ed87-4de8-a4ea-65f1d7db9ced","id":"ID2","key":"2"}
{"uuid":"19bef352-ed87-4de8-a4ea-65f1d7db9ced","id":"ID3","key":2.2}
{"uuid":"19bef352-ed87-4de8-a4ea-65f1d7db9ced","id":"ID4","key":"00002"}
{"uuid":"60d7e970-ca56-410f-86f3-a6c1e67f032a","id":"ID2","key":"1"}
{"uuid":"60d7e970-ca56-410f-86f3-a6c1e67f032a","id":"ID4","key":"00001"}
{"uuid":"60d7e970-ca56-410f-86f3-a6c1e67f032a","id":"ID3","key":1.1}
{"uuid":"60d7e970-ca56-410f-86f3-a6c1e67f032a","id":"ID1","key":1}
{"uuid":"a628e8b5-69a7-4707-8f81-da2200ae1e1f","id":"ID2","key":"3"}
{"uuid":"a628e8b5-69a7-4707-8f81-da2200ae1e1f","id":"ID3","key":3.3}
{"uuid":"a628e8b5-69a7-4707-8f81-da2200ae1e1f","id":"ID4","key":"00003"}
{"uuid":"a628e8b5-69a7-4707-8f81-da2200ae1e1f","id":"ID1","key":3}
```

## Contributing

Expand Down

0 comments on commit fff4980

Please sign in to comment.