pinecone things

LuciferUchiha · Jul 26, 2024 · dcb96c3 · dcb96c3
1 parent b1933e8
commit dcb96c3
Show file tree

Hide file tree

Showing 7 changed files with 194 additions and 1,674 deletions.
diff --git a/.gitignore b/.gitignore
@@ -6,4 +6,6 @@ node_modules
 .venv
 venv
 
-*/**/__pycache__
+*/**/__pycache__
+
+*/**/.env
diff --git a/chunker/chunks.jsonl b/chunker/chunks.jsonl
diff --git a/chunker/main.py b/chunker/main.py
diff --git a/chunker/requirements.txt b/chunker/requirements.txt
diff --git a/pinecone/chunks.jsonl b/pinecone/chunks.jsonl
@@ -0,0 +1,10 @@
+{"id": "../pages/digitalGarden/index.mdx#1", "metadata": {"Header 1": "My Digital Garden", "Header 2": "What is a Digital Garden?", "path": "../pages/digitalGarden/index.mdx", "id": "../pages/digitalGarden/index.mdx#1", "page_content": "A digital garden is a mix between a notebook and a blog, it is a place to share thoughts and cultivate them into a garden.\nIt also allows me to have a place where I can store my notes/summaries/tutorials for my studies.  \nThe main difference to a blog is that a blog has articles and publication dates and never changes after it has been\npublished, whereas a digital garden is a place where the written content can be continuously edited and refined. The\nnotes are also very free flowing they can span from just a short cheatsheet to a full set of notes on an entire subject\nwhere you go into every nitty-gritty detail.  \nAnother key difference is the navigation. A blog is usually read in chronological order but a digital garden can be read\nin any order you want and uses lots of internal links to connect all the notes into a Network (although this can be\nquite hard to diligently do).  \nIf you are interested in learning more about digital gardens I can recommend the following\n[article by Maggie Appleton](https://maggieappleton.com/garden-history)."}}
+{"id": "../pages/digitalGarden/index.mdx#2", "metadata": {"Header 1": "My Digital Garden", "Header 2": "How is my Garden Built?", "path": "../pages/digitalGarden/index.mdx", "id": "../pages/digitalGarden/index.mdx#2", "page_content": "The current iteration of my digital garden is built using [Nextra](https://nextra.site/). Nextra is a static site\ngenerator that is built on top of Next.js and MDX. This allows me to write my notes in markdown and also use the MDX\nformat to write JSX in my markdown files. These markdown files are then converted into static HTML files using Next.js\nand can be hosted on any static site hosting service such as [Vercel](https://vercel.com/)."}}
+{"id": "../pages/digitalGarden/index.mdx#3", "metadata": {"Header 1": "My Digital Garden", "Header 2": "The Features", "path": "../pages/digitalGarden/index.mdx", "id": "../pages/digitalGarden/index.mdx#3", "page_content": "In this section I briefly go over some of the features that are supported by my digital garden and how to use them."}}
+{"id": "../pages/digitalGarden/index.mdx#4", "metadata": {"Header 1": "My Digital Garden", "Header 2": "The Features", "Header 3": "Markdown", "path": "../pages/digitalGarden/index.mdx", "id": "../pages/digitalGarden/index.mdx#4", "page_content": "Markdown is supported out of the box. Anything that is supported by markdown can be used in the notes. This includes but\nis not limited to:  \n- Headers\n- Lists\n- Links\n- Images\n- Code Blocks\n- Tables\n- Blockquotes  \nFor a full list of markdown features check out the [Markdown Guide](https://www.markdownguide.org/)."}}
+{"id": "../pages/digitalGarden/index.mdx#5", "metadata": {"Header 1": "My Digital Garden", "Header 2": "The Features", "Header 3": "MDX", "path": "../pages/digitalGarden/index.mdx", "id": "../pages/digitalGarden/index.mdx#5", "page_content": "In addition to the normal markdown format, Nextra also supports the MDX format which allows you to write JSX, i.e. react code in a\nmarkdown file. To find out more about MDX check out the [official MDX documentation](https://mdxjs.com/).  \n#### Admonitions / Callouts  \nAdmonitions aren't included in standard markdown but have become very popular. Recently GitHub has also added support for\nadmonitions in markdown FileSystem, however they call them alerts.  \nAdmonitions are very useful to highlight certain text and add a category to the text. I have added a custom component that\nbuilds on nextra's callouts to be able to add custom callout types. To use callouts in a MDX file you can use the following syntax:  \n```\n<Callout type=\"warning\">\nThis Is a big scary warning.\n</Callout>\n```  \nRenders to:  \n<Callout type=\"warning\">\nThis Is a big scary warning.\n</Callout>  \nYou can also change the title of the banner:  \n```\n<Callout type=\"info\" title=\"The following types are supported\">\ninfo, warning, error, example, todo\n</Callout>\n```  \n<Callout type=\"info\" title=\"The following types are supported\">\ninfo, warning, error, example, todo\n</Callout>  \nThe default callout type uses the websites primary color, a rocket icon and has no title:  \n<Callout>\nThis is a default callout.\n</Callout>"}}
+{"id": "../pages/digitalGarden/index.mdx#6", "metadata": {"Header 1": "My Digital Garden", "Header 2": "The Features", "Header 3": "Jupyter Notebooks", "path": "../pages/digitalGarden/index.mdx", "id": "../pages/digitalGarden/index.mdx#6", "page_content": "<Callout type=\"todo\">\nTODO add how the hound works and how to use it.\n</Callout>"}}
+{"id": "../pages/digitalGarden/index.mdx#7", "metadata": {"Header 1": "My Digital Garden", "Header 2": "The Features", "Header 3": "LaTeX", "path": "../pages/digitalGarden/index.mdx", "id": "../pages/digitalGarden/index.mdx#7", "page_content": "It has recently become very popular to write LaTeX equations in markdown. Nextra supports this by using [KaTeX](https://katex.org/).\nYou can render LaTeX content either inline between `$\\LaTeX$` $\\LaTeX$ or as a block between `$$I = \\int_0^{2\\pi} \\sin(x)\\,dx$$`:  \n$$\nI = \\int_0^{2\\pi} \\sin(x)\\,dx\n$$  \nAnnoyingly Jupyter Notebooks use MathJax to render LaTeX content in the same way instead of KaTeX. This means that KaTeX\nsupports some things and MathJax supports other things. Importantly however is that the Jupyter Notebooks get converted\nto Markdown and therefore in the end it will only be rendered in KaTeX.  \nTherefore, if something is written that is supported in MathJax but not in KaTeX it might look okay but in the end,\nit will not be rendered by KaTeX. This leads to [my LaTeX Notation Guideline](./maths/latexGuidelines) to avoid\nconflicts whilst still keeping nice Formulas.  \nYou can see what is supported by KaTeX [here,](https://katex.org/docs/supported.html) and you can see what is supported\nby MathJax [here](https://docs.mathjax.org/en/latest/input/tex/macros/index.html)."}}
+{"id": "../pages/digitalGarden/index.mdx#8", "metadata": {"Header 1": "My Digital Garden", "Header 2": "The Features", "Header 3": "PlantUML", "path": "../pages/digitalGarden/index.mdx", "id": "../pages/digitalGarden/index.mdx#8", "page_content": "If you ever need to create diagrams and especially UML diagrams, PlantUML is the way to go. I started with Mermaid\nto create UML diagrams but swapped to PlantUML for the additional features and the ability to create custom themes\n(so everything can be minimalist and purple :D).  \nTo render PlantUML diagrams the [Remark plugin Simple PlantUML](https://github.com/akebifiky/remark-simple-plantuml) is\nused which uses the official PlantUML server to generate an image and then adds it.  \nAn Example can be seen below, on the [official website](https://plantuml.com/) and also on [REAL WORLD PlantUML](https://real-world-plantuml.com/?type=class).  \n```plantuml\n@startuml\n\ninterface Command {\nexecute()\nundo()\n}\nclass Invoker{\nsetCommand()\n}\nclass Client\nclass Receiver{\naction()\n}\nclass ConcreteCommand{\nexecute()\nundo()\n}\n\nCommand <|-down- ConcreteCommand\nClient -right-> Receiver\nClient --> ConcreteCommand\nInvoker o-right-> Command\nReceiver <-left- ConcreteCommand\n\n@enduml\n```  \nTo use my custom theme you can use the following line at the beginning of the PlantUML file:  \n```\n@startuml\n!theme purplerain from http://raw.githubusercontent.com/LuciferUchiha/georgerowlands.ch/main\n\n...\n\n@enduml\n```  \nHowever, it seems like when using a custom theme There can not be more then one per page? My custom theme also has some processes built in for simple text coloring for example in cases of success, failure etc.  \n```plantuml\n@startuml\n!theme purplerain from http://raw.githubusercontent.com/LuciferUchiha/georgerowlands.ch/main\n\nBob -> Alice :  normal\nBob <- Alice :  $success(\"success: Hi Bob\")\nBob -x Alice :  $failure(\"failure\")\nBob ->> Alice : $warning(\"warning\")\nBob ->> Alice : $info(\"finished\")\n\n@enduml\n```"}}
+{"id": "../pages/digitalGarden/index.mdx#9", "metadata": {"Header 1": "My Digital Garden", "Header 2": "How can I Contribute?", "path": "../pages/digitalGarden/index.mdx", "id": "../pages/digitalGarden/index.mdx#9", "page_content": "Do you enjoy the content and want to contribute to the garden by adding some new plants or watering the existing ones?\nThen feel free to make a pull request. There are however some rules to keep in mind before adding or changing content.  \n- Markdown filenames and folders are written in camelCase.\n- Titles should follow the\n[IEEE Editorial Style Manual](https://www.ieee.org/content/dam/ieee-org/ieee/web/org/conferences/style_references_manual.pdf).\nThey should also be added to the markdown file and specified in the `_meta.json` which maps files to titles and is also\nresponsible for the ordering.\n- LaTeX should conform with my notation and guideline, if something is not defined there you can of course add it."}}
+{"id": "../pages/digitalGarden/cs/algorithmsDataStructures/analysisOfAlgorithms.mdx#1", "metadata": {"Header 1": "Analysis of Algorithms", "path": "../pages/digitalGarden/cs/algorithmsDataStructures/analysisOfAlgorithms.mdx", "id": "../pages/digitalGarden/cs/algorithmsDataStructures/analysisOfAlgorithms.mdx#1", "page_content": "Asymptotic Complexity / Analysis of Algorithms  \nThe master method and how to calculate it and stuff, go back to algd1, MIT 6.006 and Algorithms Illuminated will help.  \nTelescoping? How to get to recurrance relation and then asymptotic complexity."}}
diff --git a/pinecone/main.py b/pinecone/main.py
@@ -0,0 +1,176 @@
+import os
+import jsonlines
+
+from langchain_core.documents import Document
+from langchain_text_splitters import MarkdownHeaderTextSplitter
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+from pinecone.grpc import PineconeGRPC as Pinecone
+from pinecone import ServerlessSpec
+from tqdm import tqdm
+from pydantic import BaseModel
+from openai import OpenAI
+from dotenv import load_dotenv
+
+load_dotenv()
+pinecone_key = os.getenv("PINECONE_KEY")
+pc = Pinecone(api_key=pinecone_key)
+openai_key = os.getenv("OPENAI_KEY")
+client = OpenAI(api_key=openai_key)
+
+
+class PineconeDocument(BaseModel):
+    id: str
+    values: list[float]
+    metadata: dict = {}
+
+    @classmethod
+    def from_document_with_embedding(cls, doc: Document, embedding: list[float]):
+        # add the page_content to the metadata
+        doc.metadata["page_content"] = doc.page_content
+        return cls(id=doc.metadata["id"], values=embedding, metadata=doc.metadata)
+
+    def dict(self):
+        return {
+            "id": self.id,
+            "values": self.values,
+            "metadata": self.metadata
+        }
+
+
+def save_documents_to_jsonl(documents: list[PineconeDocument], file_path):
+    with jsonlines.open(file_path, mode="w") as writer:
+        for doc in documents:
+            doc_dict = doc.dict()
+            # to save space, remove the vector
+            del doc_dict["values"]
+            writer.write(doc_dict)
+
+
+def get_documents() -> list[Document]:
+    pages_path = "../pages/"
+
+    excluded_files = ["../pages/index.mdx", "../pages/_app.mdx"]
+
+    documents = []
+    for root, dirs, files in os.walk(pages_path):
+        for file in files:
+            # join in unix style
+            file_path = os.path.join(root, file).replace("\\", "/")
+            if file_path in excluded_files:
+                continue
+            if file_path.endswith(".mdx"):
+                file_content = open(file_path, "r", encoding="utf8").read()
+
+                # parse to langchain document
+                doc = Document(page_content=file_content,
+                               metadata={"path": file_path})
+                documents.append(doc)
+
+    return documents
+
+
+def preprocess(document: Document) -> Document:
+    # let's see how good it works if we do nothing
+    return document
+
+
+def chunk_pages(documents: list[Document]) -> list[Document]:
+    header_level = 3
+    chunk_size = 2048
+    chunk_overlap = 256
+
+    headers_to_split_on = [(f"{'#' * header_level}", f"Header {header_level}")
+                           for header_level in range(1, header_level + 1)]
+
+    markdown_splitter = MarkdownHeaderTextSplitter(
+        headers_to_split_on=headers_to_split_on)
+    text_splitter = RecursiveCharacterTextSplitter(
+        chunk_size=chunk_size, chunk_overlap=chunk_overlap
+    )
+
+    chunks = []
+    for doc in documents:
+        doc_id = doc.metadata["path"]
+        doc_position = 0
+        doc_splits = markdown_splitter.split_text(doc.page_content)
+        for split in doc_splits:
+            # append metadata without overwriting
+            split.metadata.update(doc.metadata)
+
+        # if there is no header 1 then it is just imports so the split can be removed
+        doc_splits = [split for split in doc_splits if split.metadata.get(
+            "Header 1") is not None]
+
+        for split in doc_splits:
+            split_chunks_content = text_splitter.split_text(split.page_content)
+            for chunk_content in split_chunks_content:
+                chunk = Document(page_content=chunk_content,
+                                 metadata=split.metadata.copy())
+                doc_position += 1
+                chunk.metadata["id"] = f"{doc_id}#{doc_position}"
+                chunks.append(chunk)
+
+    print(f"Split to {len(chunks)} chunks")
+    return chunks
+
+
+def setup_index(index_name: str):
+    if index_name not in pc.list_indexes().names():
+        pc.create_index(
+            name=index_name,
+            dimension=1536,
+            metric="cosine",
+            spec=ServerlessSpec(
+                cloud='aws',
+                region='us-east-1'
+            )
+        )
+
+    index = pc.Index(index_name)
+    return index
+
+
+def upsert_chunks_to_index(index, chunks: list[PineconeDocument], batch_size: int = 1000):
+    for i in tqdm(range(0, len(chunks), batch_size)):
+        batch = chunks[i:i + batch_size]
+        index.upsert(vectors=[doc.dict() for doc in batch])
+
+
+def embed_documents(documents: list[Document], batch_size: int = 256, model: str = "text-embedding-3-small") -> list[tuple[Document, list[float]]]:
+    embedded_documents = []
+    for i in tqdm(range(0, len(documents), batch_size)):
+        batch = documents[i:i + batch_size]
+        batch_texts = [doc.page_content for doc in batch]
+        response = client.embeddings.create(
+            input=batch_texts, model=model).data
+        embeddings = [result.embedding for result in response]
+        embedded_documents.extend(zip(batch, embeddings))
+
+    return embedded_documents
+
+
+if __name__ == "__main__":
+    index = setup_index(index_name="digital-garden")
+
+    documents = get_documents()
+    print(f"Found {len(documents)} documents")
+    processed_documents = [preprocess(document) for document in documents]
+
+    chunks = chunk_pages(processed_documents)
+    chunks = chunks[:10]
+
+    embedded_chunks = embed_documents(chunks)
+    pinecone_chunks = [PineconeDocument.from_document_with_embedding(
+        doc, embedding) for doc, embedding in embedded_chunks]
+
+    upsert_chunks_to_index(index, pinecone_chunks)
+
+    save_documents_to_jsonl(pinecone_chunks, "./chunks.jsonl")
+    print("Chunks saved to ./chunks.jsonl")
+
+    # just for fun, how many words have I roughly written
+    total_chars = sum([len(chunk.page_content) for chunk in chunks])
+    total_words = sum([len(chunk.page_content.split()) for chunk in chunks])
+    total_book_pages = total_chars // 2000
+    print(
+        f"You have written about {total_chars} characters, {total_words} words, which is about {total_book_pages} book pages")
diff --git a/pinecone/requirements.txt b/pinecone/requirements.txt
@@ -0,0 +1,5 @@
+python-dotenv==1.0.1
+jsonlines==4.0.0
+langchain==0.2.11
+openai==1.37.1
+pinecone-client[grpc]==5.0.0