Chunkifyr 📜🔪

Chunkifyr is a powerful and flexible python library designed to split and chunk text into meaningful segments. Whether you're processing large documents to embed, preparing chunks for RAG application, or simply need to manage text in manageable chunks, Chunkifyr provides a range of customizable chunking strategies.

Features ✨

Language Model Chunking: Utilize the context-awareness of language models to chunk text based on semantics, ensuring coherent segments that align with the text's meaning.
Syntactic Chunking: Break down text into syntactically meaningful segments, preserving grammatical structures.
Semantic Chunking: Group text into segments based on semantic meaning, providing contextually relevant chunks.
Multi File Support: Seamlessly chunk text from multiple files in different formats at once, including PDF, DOCX, TXT, and even webpages.
Customizable Settings: Easily adjust chunk sizes, overlap percentages, and more to fit your specific needs.

Installation 🛠️

Install Chunkifyr via pip:

pip install chunkifyr

Note: Python 3.8+ is required.

Usage 🚀

Here’s a quick example using LMChunker to get you started:

from chunkifyr import LMChunker
from openai import OpenAI

# this creds can be replaced with your local oai server creds, if your running local OAI server. (llama_cpp, llamafile, ollama)
client = OpenAI(api_key="YOUR_API_KEY", base_url="DEPLOYMENT_URL") 

chunker = LMChunker(model="gpt-3.5-turbo-0125", client=client)
chunks = chunker.from_files('path_to_your_text_file.txt')

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}: {chunk.text}")
    print(f"Chunk {i} description: {chunk.meta.description}")
    print()

This example demonstrates how to use the LMChunker with an OpenAI client to break a text file into meaningful chunks. Each chunk also includes a description generated by the model. (which can be further used as metadata when embedding)

Available Chunkers

LMChunker: Utilizes pre-trained language models for contextual chunking.
ClusterSemanticChunker: Generates globally optimal chunks, ensuring that each chunk contains semantically cohesive texts.
SimpleSemanticChunker: Groups similar splits together for basic semantic chunking.
SimpleSyntacticChunker: Simple syntactic chunking with desired chunk size, overlap and seperator. (very similar to langchain character splitter)
SemanticChunker: Groups text semantically using the Adjacent Sentence Clustering method with a configurable similarity threshold.
SyntacticChunker: Split text into meaningful segment based on syntactic structures using hf_tokenizer More soon... (Regex based, etc)

Contributing 🤝

Contributions are welcome! If you have ideas for improving Chunkifyr or encounter any issues, feel free to submit a pull request or open an issue. There are lot more chunker that can be added, for example,

CSVChunker
MarkdownChunker
... Develop a chunker as you wish, create a PR :)

License 📄

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
src/chunkifyr		src/chunkifyr
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chunkifyr 📜🔪

Features ✨

Installation 🛠️

Usage 🚀

Available Chunkers

Contributing 🤝

License 📄

About

Releases

Packages

Languages

License

xdevfaheem/chunkifyr

Folders and files

Latest commit

History

Repository files navigation

Chunkifyr 📜🔪

Features ✨

Installation 🛠️

Usage 🚀

Available Chunkers

Contributing 🤝

License 📄

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages