diff --git a/fern/pages/changelog/2024-03-24-command-r-retrieval-augmented-generation-at-production-scale.mdx b/fern/pages/changelog/2024-03-24-command-r-retrieval-augmented-generation-at-production-scale.mdx index 1ebdecc0..366ae231 100644 --- a/fern/pages/changelog/2024-03-24-command-r-retrieval-augmented-generation-at-production-scale.mdx +++ b/fern/pages/changelog/2024-03-24-command-r-retrieval-augmented-generation-at-production-scale.mdx @@ -17,4 +17,4 @@ Command R is a generative model optimized for long context tasks such as retriev - Strong capabilities across 10 key languages - Model weights available on HuggingFace for research and evaluation -For more information, check out the [official blog post](https://txt.cohere.com/command-r/) or the [Command R documentation](/docs/command-r). +For more information, check out the [official blog post](https://cohere.com/blog/command-r/) or the [Command R documentation](/docs/command-r). diff --git a/fern/pages/cookbooks/creating-a-qa-bot.mdx b/fern/pages/cookbooks/creating-a-qa-bot.mdx index 64cc0dba..d75dad4d 100644 --- a/fern/pages/cookbooks/creating-a-qa-bot.mdx +++ b/fern/pages/cookbooks/creating-a-qa-bot.mdx @@ -123,7 +123,7 @@ The vector database we built using `VectorStoreIndex` comes with an in-built ret retriever = index.as_retriever(similarity_top_k=top_k) ``` -We recently released [Rerank-3](https://txt.cohere.com/rerank-3/) (April '24), which we can use to improve the quality of retrieval, as well as reduce latency and the cost of inference. To use the retriever with `rerank`, we create a thin wrapper around `index.as_retriever` as follows: +We recently released [Rerank-3](https://cohere.com/blog/rerank-3/) (April '24), which we can use to improve the quality of retrieval, as well as reduce latency and the cost of inference. To use the retriever with `rerank`, we create a thin wrapper around `index.as_retriever` as follows: ```python PYTHON class RetrieverWithRerank: diff --git a/fern/pages/cookbooks/document-parsing-for-enterprises.mdx b/fern/pages/cookbooks/document-parsing-for-enterprises.mdx index 5490ec55..d2c0791b 100644 --- a/fern/pages/cookbooks/document-parsing-for-enterprises.mdx +++ b/fern/pages/cookbooks/document-parsing-for-enterprises.mdx @@ -23,7 +23,7 @@ The bread and butter of natural language processing technology is text. Once we In the context of enterprise Retrieval Augmented Generation (RAG), the information is often locked in complex file types such as PDFs. These formats are made for sharing information between humans, but not so much with language models. -In this notebook, we will use a real-world pharmaceutical drug label to test out various performant approaches to parsing PDFs. This will allow us to use [Cohere's Command-R model](https://txt.cohere.com/command-r/) in a RAG setting to answer questions and asks about this label, such as "I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of" a given pharmaceutical. +In this notebook, we will use a real-world pharmaceutical drug label to test out various performant approaches to parsing PDFs. This will allow us to use [Cohere's Command-R model](https://cohere.com/blog/command-r/) in a RAG setting to answer questions and asks about this label, such as "I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of" a given pharmaceutical. Document Parsing Result -**_Read the accompanying [blog post here](https://txt.cohere.ai/search-cohere-langchain/)._** +**_Read the accompanying [blog post here](https://cohere.com/blog/search-cohere-langchain/)._** This notebook contains two examples for performing multilingual search using Cohere and Langchain. Langchain is a library that assists the development of applications built on top of large language models (LLMs), such as Cohere's models. diff --git a/fern/pages/cookbooks/wikipedia-semantic-search.mdx b/fern/pages/cookbooks/wikipedia-semantic-search.mdx index 8e7914e0..35bf1edc 100644 --- a/fern/pages/cookbooks/wikipedia-semantic-search.mdx +++ b/fern/pages/cookbooks/wikipedia-semantic-search.mdx @@ -8,7 +8,7 @@ import { CookbookHeader } from "../../components/cookbook-header"; -This notebook contains the starter code to do simple [semantic search](https://txt.cohere.ai/what-is-semantic-search/) on the [Wikipedia embeddings archives](https://txt.cohere.ai/embedding-archives-wikipedia/) published by Cohere. These archives embed Wikipedia sites in multiple languages. In this example, we'll use [Wikipedia Simple English](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings). +This notebook contains the starter code to do simple [semantic search](https://cohere.com/llmu/what-is-semantic-search/) on the [Wikipedia embeddings archives](https://cohere.com/blog/embedding-archives-wikipedia/) published by Cohere. These archives embed Wikipedia sites in multiple languages. In this example, we'll use [Wikipedia Simple English](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings). Let's now download 1,000 records from the English Wikipedia embeddings archive so we can search it afterwards. @@ -42,7 +42,7 @@ Downloading: 0%| | 0.00/1.29k [00:00 diff --git a/fern/pages/llm-university/intro-building-apps/app-examples.mdx b/fern/pages/llm-university/intro-building-apps/app-examples.mdx index 35fd6edf..28aefebb 100644 --- a/fern/pages/llm-university/intro-building-apps/app-examples.mdx +++ b/fern/pages/llm-university/intro-building-apps/app-examples.mdx @@ -6,7 +6,7 @@ hidden: true createdAt: "Wed May 03 2023 02:07:08 GMT+0000 (Coordinated Universal Time)" updatedAt: "Thu Apr 18 2024 10:45:10 GMT+0000 (Coordinated Universal Time)" --- -### Semantic Search With Cohere and Langchain +### Semantic Search With Cohere and Langchain Use the embed endpoint with Langchain to efficiently build semantic search applications on top of Cohere’s multilingual model. diff --git a/fern/pages/llm-university/intro-large-language-models/semantic-search-temp.mdx b/fern/pages/llm-university/intro-large-language-models/semantic-search-temp.mdx index 775c8364..0fc1294e 100644 --- a/fern/pages/llm-university/intro-large-language-models/semantic-search-temp.mdx +++ b/fern/pages/llm-university/intro-large-language-models/semantic-search-temp.mdx @@ -205,4 +205,4 @@ Follow along, as in Module 3 of this course, you'll be able to build semantic se ### Original Source -This material comes from the post What is Semantic Search? +This material comes from the post What is Semantic Search? diff --git a/fern/pages/llm-university/intro-large-language-models/similarity-between-words-and-sentences-deprecated.mdx b/fern/pages/llm-university/intro-large-language-models/similarity-between-words-and-sentences-deprecated.mdx index c8122c02..d93e6a4a 100644 --- a/fern/pages/llm-university/intro-large-language-models/similarity-between-words-and-sentences-deprecated.mdx +++ b/fern/pages/llm-university/intro-large-language-models/similarity-between-words-and-sentences-deprecated.mdx @@ -165,4 +165,4 @@ In the previous chapter, we learned that sentence embeddings are the bread and b ### Original Source -This material comes from the post What is Similarity Between Sentences? +This material comes from the post What is Similarity Between Sentences? diff --git a/fern/pages/llm-university/intro-large-language-models/similarity-between-words-and-sentences.mdx b/fern/pages/llm-university/intro-large-language-models/similarity-between-words-and-sentences.mdx index 8e4b24fe..c8706c60 100644 --- a/fern/pages/llm-university/intro-large-language-models/similarity-between-words-and-sentences.mdx +++ b/fern/pages/llm-university/intro-large-language-models/similarity-between-words-and-sentences.mdx @@ -213,4 +213,4 @@ In the previous chapter, we learned that sentence embeddings are the bread and b ### Original Source -This material comes from the post What is Similarity Between Sentences? +This material comes from the post What is Similarity Between Sentences? diff --git a/fern/pages/llm-university/intro-large-language-models/text-embeddings.mdx b/fern/pages/llm-university/intro-large-language-models/text-embeddings.mdx index 46026888..53dfe48c 100644 --- a/fern/pages/llm-university/intro-large-language-models/text-embeddings.mdx +++ b/fern/pages/llm-university/intro-large-language-models/text-embeddings.mdx @@ -138,4 +138,4 @@ Sentence embeddings can be extended to language embeddings, in which the numbers ### Original Source -This material comes from the post What Are Word and Sentence Embeddings? +This material comes from the post What Are Word and Sentence Embeddings? diff --git a/fern/pages/llm-university/intro-large-language-models/the-attention-mechanism.mdx b/fern/pages/llm-university/intro-large-language-models/the-attention-mechanism.mdx index c266fb83..0c51b111 100644 --- a/fern/pages/llm-university/intro-large-language-models/the-attention-mechanism.mdx +++ b/fern/pages/llm-university/intro-large-language-models/the-attention-mechanism.mdx @@ -160,4 +160,4 @@ In this post, you learned what attention mechanisms are. They are a very useful ### Original Source -This material comes from the post [What is Attention in Language Models?](https://txt.cohere.com/what-is-attention-in-language-models/) +This material comes from the post [What is Attention in Language Models?](https://cohere.com/llmu/what-is-attention-in-language-models/) diff --git a/fern/pages/llm-university/intro-large-language-models/transformer-models.mdx b/fern/pages/llm-university/intro-large-language-models/transformer-models.mdx index a8655c90..8a23bf02 100644 --- a/fern/pages/llm-university/intro-large-language-models/transformer-models.mdx +++ b/fern/pages/llm-university/intro-large-language-models/transformer-models.mdx @@ -195,4 +195,4 @@ The repetition of these steps is what writes the amazing text you’ve seen tran ### Original Source -This material comes from the post What Are Transformer Models and How Do They Work? +This material comes from the post What Are Transformer Models and How Do They Work? diff --git a/fern/pages/llm-university/intro-nlp/how-to-evaluate-a-classifier.mdx b/fern/pages/llm-university/intro-nlp/how-to-evaluate-a-classifier.mdx index ab84e79c..e86d2a19 100644 --- a/fern/pages/llm-university/intro-nlp/how-to-evaluate-a-classifier.mdx +++ b/fern/pages/llm-university/intro-nlp/how-to-evaluate-a-classifier.mdx @@ -74,4 +74,4 @@ In this chapter, you learned the basic four metrics to evaluate classification m ### Original Source -This material comes from the posts Text Classification Intuition for Software Developers and Classification Evaluation Metrics: Accuracy, Precision, Recall, and F1 Visually Explained. +This material comes from the posts Text Classification Intuition for Software Developers and Classification Evaluation Metrics: Accuracy, Precision, Recall, and F1 Visually Explained. diff --git a/fern/pages/llm-university/intro-prompt-engineering/chaining-prompts-2.mdx b/fern/pages/llm-university/intro-prompt-engineering/chaining-prompts-2.mdx index ded02645..7a0fc2ff 100644 --- a/fern/pages/llm-university/intro-prompt-engineering/chaining-prompts-2.mdx +++ b/fern/pages/llm-university/intro-prompt-engineering/chaining-prompts-2.mdx @@ -70,7 +70,7 @@ The solution is, instead of prompting the same question to the model once, we ca Let’s look at an example taken from a paper by [Wang et al., 2023](https://arxiv.org/abs/2203.11171?ref=txt.cohere.com) that introduces the concept of _self -consistency_. -First, revisiting [the previous chapter](https://txt.cohere.com/constructing-prompts/#chain-of-thought), we looked at the concept of _chain-of-thought prompting_ introduced by [Wei et. al, 2023](https://arxiv.org/abs/2201.11903?ref=txt.cohere.com), where a model is prompted in such a way that it is encouraged to do a reasoning step before giving the final response. In those settings, however, the model is typically encouraged to do “greedy decoding,” which means biasing towards the correct and safe path. This can be done by adjusting settings like the temperature value. +First, revisiting [the previous chapter](https://cohere.com/llmu/constructing-prompts/#chain-of-thought), we looked at the concept of _chain-of-thought prompting_ introduced by [Wei et. al, 2023](https://arxiv.org/abs/2201.11903?ref=txt.cohere.com), where a model is prompted in such a way that it is encouraged to do a reasoning step before giving the final response. In those settings, however, the model is typically encouraged to do “greedy decoding,” which means biasing towards the correct and safe path. This can be done by adjusting settings like the temperature value. With self-consistency, we can build on the chain-of-thought approach by sampling from several paths instead of one. We also make the paths much more diverse by adjusting the settings towards being more “creative,” again using settings like temperature. We then do a majority vote out of all answers. @@ -157,4 +157,4 @@ This is a fascinating area of prompt engineering because it opens up so much roo ### Original Source -This material comes from the post: [Chaining Prompts for the Command Model](https://txt.cohere.com/chaining-prompts/). +This material comes from the post: [Chaining Prompts for the Command Model](https://cohere.com/llmu/chaining-prompts/). diff --git a/fern/pages/llm-university/intro-prompt-engineering/constructing-prompts.mdx b/fern/pages/llm-university/intro-prompt-engineering/constructing-prompts.mdx index 3afc842e..ff8dbbc0 100644 --- a/fern/pages/llm-university/intro-prompt-engineering/constructing-prompts.mdx +++ b/fern/pages/llm-university/intro-prompt-engineering/constructing-prompts.mdx @@ -141,7 +141,7 @@ While LLMs excel in text generation tasks, they struggle in context-aware scenar In real applications, being able to add context to a prompt is key because this is what enables personalized generative AI for a team or company. It makes many use cases possible, such as intelligent assistants, customer support, and productivity tools, that retrieve the right information from a wide range of sources and add it to the prompt. -This is a whole topic on its own, but to provide some idea, [this demo](https://txt.cohere.com/search-cohere-langchain/#example-2-search-based-question-answering) shows an example of information retrieval in action. In this article though, we’ll assume that the right information is already retrieved and added to the prompt. +This is a whole topic on its own, but to provide some idea, [this demo](https://cohere.com/blog/search-cohere-langchain/#example-2-search-based-question-answering) shows an example of information retrieval in action. In this article though, we’ll assume that the right information is already retrieved and added to the prompt. Here’s an example where we ask the model to list the features of the CO-1T wireless headphone without any additional context: diff --git a/fern/pages/llm-university/intro-prompt-engineering/evaluating-outputs.mdx b/fern/pages/llm-university/intro-prompt-engineering/evaluating-outputs.mdx index 358d9a8d..b2722981 100644 --- a/fern/pages/llm-university/intro-prompt-engineering/evaluating-outputs.mdx +++ b/fern/pages/llm-university/intro-prompt-engineering/evaluating-outputs.mdx @@ -209,5 +209,5 @@ Ultimately, each evaluation approach has its potential pitfalls. An evaluation o ### Original Source -This material comes from the post: [Evaluating LLM Outputs](https://txt.cohere.com/evaluating-llm-outputs/). +This material comes from the post: [Evaluating LLM Outputs](https://cohere.com/llmu/evaluating-llm-outputs/). diff --git a/fern/pages/llm-university/intro-semantic-search/dense-retrieval.mdx b/fern/pages/llm-university/intro-semantic-search/dense-retrieval.mdx index 222aa76a..e1420a7b 100644 --- a/fern/pages/llm-university/intro-semantic-search/dense-retrieval.mdx +++ b/fern/pages/llm-university/intro-semantic-search/dense-retrieval.mdx @@ -99,7 +99,7 @@ As you can see, dense retrieval did much better than keyword search here. The se ### Searching in Other Languages -As you may have noticed, the `dense_retrieval` function has a parameter called `results_lang` (see [code lab](https://colab.research.google.com/github/cohere-ai/notebooks/blob/main/notebooks/End_To_End_Wikipedia_Search.ipynb#scrollTo=kbeNQtzAMagI&line=2&uniqifier=1)). This parameter determines the language in which the search results are outputted. It is defaulted to English ('en') , but for this demo, it can also be set to German ('de'), French ('fr'), Spanish ('es'), Italian ('it'), Japanese ('ja'), Arabic ('ar'), (Simplified) Chinese ('zh'), Korean ('ko'), and Hindi ('hi'). However, the Cohere multilingual embedding handles [over 100 languages](https://txt.cohere.com/multilingual/). +As you may have noticed, the `dense_retrieval` function has a parameter called `results_lang` (see [code lab](https://colab.research.google.com/github/cohere-ai/notebooks/blob/main/notebooks/End_To_End_Wikipedia_Search.ipynb#scrollTo=kbeNQtzAMagI&line=2&uniqifier=1)). This parameter determines the language in which the search results are outputted. It is defaulted to English ('en') , but for this demo, it can also be set to German ('de'), French ('fr'), Spanish ('es'), Italian ('it'), Japanese ('ja'), Arabic ('ar'), (Simplified) Chinese ('zh'), Korean ('ko'), and Hindi ('hi'). However, the Cohere multilingual embedding handles [over 100 languages](https://cohere.com/blog/multilingual/). For the first example, let's search for results to the English query "Who was the first person to win two Nobel prizes" in Arabic. The line of code is the following: diff --git a/fern/pages/llm-university/intro-semantic-search/multilingual-semantic-search-with-cohere-and-langchain.mdx b/fern/pages/llm-university/intro-semantic-search/multilingual-semantic-search-with-cohere-and-langchain.mdx index 20ce03f9..4df03cd1 100644 --- a/fern/pages/llm-university/intro-semantic-search/multilingual-semantic-search-with-cohere-and-langchain.mdx +++ b/fern/pages/llm-university/intro-semantic-search/multilingual-semantic-search-with-cohere-and-langchain.mdx @@ -1,7 +1,7 @@ --- title: "Multilingual Semantic Search With Cohere and Langchain" slug: "docs/multilingual-semantic-search-with-cohere-and-langchain" -subtitle: "From: https://txt.cohere.com/search-cohere-langchain/" +subtitle: "From: https://cohere.com/blog/search-cohere-langchain/" hidden: true createdAt: "Fri Apr 28 2023 19:10:35 GMT+0000 (Coordinated Universal Time)" updatedAt: "Mon Oct 23 2023 14:40:59 GMT+0000 (Coordinated Universal Time)" @@ -18,7 +18,7 @@ The summary above was generated by Cohere’s new Summarize Beta endpoint, with The Cohere Platform provides an API for developers and organizations to access cutting-edge LLMs without needing machine learning know-how. The platform handles all the complexities of curating massive amounts of text data, model development, distributed training, model serving, and more. This means that developers can focus on creating value on the applied side rather than spending time and effort on the capability-building side. -There are two key types of language processing capabilities that the Cohere Platform provides — [text generation](https://txt.cohere.com/search-cohere-langchain/#:~:text=Cohere%20Platform%20provides%20%E2%80%94-,text%20generation,-and%20text%20embedding) and [text embedding](/reference/embed?ref=txt.cohere.com&__hstc=14363112.89f2baed82ac4713854553225677badd.1682345384753.1682611655483.1682709019083.16&__hssc=14363112.1.1682709019083&__hsfp=2014138109) — and each is served by a different type of model. +There are two key types of language processing capabilities that the Cohere Platform provides — [text generation](https://cohere.com/blog/search-cohere-langchain/#:~:text=Cohere%20Platform%20provides%20%E2%80%94-,text%20generation,-and%20text%20embedding) and [text embedding](/reference/embed?ref=txt.cohere.com&__hstc=14363112.89f2baed82ac4713854553225677badd.1682345384753.1682611655483.1682709019083.16&__hssc=14363112.1.1682709019083&__hsfp=2014138109) — and each is served by a different type of model. With text generation, we enter a piece of text, or prompt, and get back a stream of text as a completion to the prompt. One example is asking the model to write a haiku (the prompt) and getting an originally written haiku in return (the completion). diff --git a/fern/pages/llm-university/intro-semantic-search/what-is-semantic-search.mdx b/fern/pages/llm-university/intro-semantic-search/what-is-semantic-search.mdx index 6d3ab41b..f3c3d896 100644 --- a/fern/pages/llm-university/intro-semantic-search/what-is-semantic-search.mdx +++ b/fern/pages/llm-university/intro-semantic-search/what-is-semantic-search.mdx @@ -211,4 +211,4 @@ Follow along, in the upcoming chapters, you'll be able to build semantic search ### Original Source -This material comes from the post What is Semantic Search? +This material comes from the post What is Semantic Search? diff --git a/fern/pages/llm-university/intro-semantic-search/wikipedia-embeddings.mdx b/fern/pages/llm-university/intro-semantic-search/wikipedia-embeddings.mdx index 4b2d5978..2d9ec979 100644 --- a/fern/pages/llm-university/intro-semantic-search/wikipedia-embeddings.mdx +++ b/fern/pages/llm-university/intro-semantic-search/wikipedia-embeddings.mdx @@ -6,4 +6,4 @@ hidden: true createdAt: "Tue May 23 2023 18:53:20 GMT+0000 (Coordinated Universal Time)" updatedAt: "Mon Oct 23 2023 14:40:59 GMT+0000 (Coordinated Universal Time)" --- -[https://txt.cohere.com/embedding-archives-wikipedia/](https://txt.cohere.com/embedding-archives-wikipedia/) +[https://cohere.com/blog/embedding-archives-wikipedia/](https://cohere.com/blog/embedding-archives-wikipedia/) diff --git a/fern/pages/llm-university/intro-text-generation/chaining-prompts.mdx b/fern/pages/llm-university/intro-text-generation/chaining-prompts.mdx index 88129411..8a7a1ccc 100644 --- a/fern/pages/llm-university/intro-text-generation/chaining-prompts.mdx +++ b/fern/pages/llm-university/intro-text-generation/chaining-prompts.mdx @@ -188,7 +188,7 @@ In this story generation example, we have seen how to chain multiple prompts tog This demonstration has also shown how some of the topics we have discussed in this generative AI series can come together, to name a few: designing prompts, modifying model parameters, getting multiple generations, using likelihood to rank outputs, and more. I hope it has helped to illustrate how to apply these concepts in practical applications. -In this chapter, we used an example of chaining Generate calls since text generation is our focus for the module. But depending on the use case, chances are you might also need to throw other endpoints into the mix, such as Embed and Classify. You can see some examples of chaining multiple endpoints in this blog post. +In this chapter, we used an example of chaining Generate calls since text generation is our focus for the module. But depending on the use case, chances are you might also need to throw other endpoints into the mix, such as Embed and Classify. You can see some examples of chaining multiple endpoints in this blog post. ### Original Source diff --git a/fern/pages/llm-university/intro-text-generation/creating-custom-models.mdx b/fern/pages/llm-university/intro-text-generation/creating-custom-models.mdx index 3ae861fd..271799ed 100644 --- a/fern/pages/llm-university/intro-text-generation/creating-custom-models.mdx +++ b/fern/pages/llm-university/intro-text-generation/creating-custom-models.mdx @@ -32,7 +32,7 @@ A generative model is already trained on a huge volume of data, making it great - **Specific styles:** Generating text with a certain style or voice, e.g., when generating product descriptions that represent your company’s brand - **Specific formats:** Parsing information from a unique format or structure, e.g., when extracting information from specific types of invoices, resumes, or contracts - **Specific domains:** Dealing with text in highly specialized domains such as medical, scientific, or legal, e.g., when summarizing text dense with technical information -- **Specific knowledge:** Generating text that closely follows a certain theme, e.g., when generating playing cards that are playable, like what we did with Magic the Gathering +- **Specific knowledge:** Generating text that closely follows a certain theme, e.g., when generating playing cards that are playable, like what we did with Magic the Gathering In these cases, with enough examples in the prompt, you might still be able to make the generation work. But there is an element of unpredictability — something you want to eliminate when looking to deploy your application beyond a basic demo. diff --git a/fern/pages/llm-university/intro-text-representation/classification-models.mdx b/fern/pages/llm-university/intro-text-representation/classification-models.mdx index ccc75de1..89223ebb 100644 --- a/fern/pages/llm-university/intro-text-representation/classification-models.mdx +++ b/fern/pages/llm-university/intro-text-representation/classification-models.mdx @@ -44,7 +44,7 @@ Chatbots tend to pair intent classifiers with entity extractors – another lang #### Content Moderation -A significant portion of human interaction now happens online through social media, online forums, and group chats (like Discord or Slack). More often than not, these online communities need moderation to keep their communities safe from different types of online harm. Language understanding systems can empower moderation teams in combating toxic, abusive, and hateful language. +A significant portion of human interaction now happens online through social media, online forums, and group chats (like Discord or Slack). More often than not, these online communities need moderation to keep their communities safe from different types of online harm. Language understanding systems can empower moderation teams in combating toxic, abusive, and hateful language. A content filter can classify texts as either neutral or toxic: @@ -62,8 +62,8 @@ If you're able to map real-world problems to text classification problems, that' In this chapter you learned what a classification model is, and how they can be used for numerous applications. -https\://txt.cohere.com/text-classification-use-cases/ +https\://txt.cohere.com/text-classification-use-cases/ ### Original Source -This material comes from the posts Text Classification Intuition for Software Developers and Classification Evaluation Metrics: Accuracy, Precision, Recall, and F1 Visually Explained. +This material comes from the posts Text Classification Intuition for Software Developers and Classification Evaluation Metrics: Accuracy, Precision, Recall, and F1 Visually Explained. diff --git a/fern/pages/llm-university/intro-text-representation/classify-endpoint.mdx b/fern/pages/llm-university/intro-text-representation/classify-endpoint.mdx index 5e29f745..eb9dc698 100644 --- a/fern/pages/llm-university/intro-text-representation/classify-endpoint.mdx +++ b/fern/pages/llm-university/intro-text-representation/classify-endpoint.mdx @@ -166,9 +166,9 @@ As your task gets more complex, you will likely need to bring in additional trai If you’d like to learn more about text classification, here are some additional resources: -- Intuition and use case examples -- An example application: toxicity detection -- Evaluating a classifier’s performance +- Intuition and use case examples +- An example application: toxicity detection +- Evaluating a classifier’s performance - Classify endpoint API reference ### Conclusion @@ -177,4 +177,4 @@ In this chapter, you've learned to classify text based on mood using Cohere's `C ### Original Source -This material comes from the post Hello, World! Meet Language AI: Part 2 +This material comes from the post Hello, World! Meet Language AI: Part 2 diff --git a/fern/pages/llm-university/intro-text-representation/clustering-hacker-news-posts.mdx b/fern/pages/llm-university/intro-text-representation/clustering-hacker-news-posts.mdx index f3b9c080..56415a19 100644 --- a/fern/pages/llm-university/intro-text-representation/clustering-hacker-news-posts.mdx +++ b/fern/pages/llm-university/intro-text-representation/clustering-hacker-news-posts.mdx @@ -215,5 +215,5 @@ In this post, you harnessed the power of embeddings and clustering methods in or ### Original Source -This material comes from the post Combing For Insight in 10,000 Hacker News Posts With Text Clustering +This material comes from the post Combing For Insight in 10,000 Hacker News Posts With Text Clustering diff --git a/fern/pages/llm-university/intro-text-representation/embed-endpoint.mdx b/fern/pages/llm-university/intro-text-representation/embed-endpoint.mdx index 86ee3a43..af4d3f51 100644 --- a/fern/pages/llm-university/intro-text-representation/embed-endpoint.mdx +++ b/fern/pages/llm-university/intro-text-representation/embed-endpoint.mdx @@ -119,5 +119,5 @@ In this chapter you learned about the Embed endpoint. Text embeddings make possi ### Original Source -This material comes from the post Hello, World! Meet Language AI: Part 2 +This material comes from the post Hello, World! Meet Language AI: Part 2 diff --git a/fern/pages/llm-university/intro-text-representation/evaluation-metrics.mdx b/fern/pages/llm-university/intro-text-representation/evaluation-metrics.mdx index 2f902f27..a9a188d6 100644 --- a/fern/pages/llm-university/intro-text-representation/evaluation-metrics.mdx +++ b/fern/pages/llm-university/intro-text-representation/evaluation-metrics.mdx @@ -202,4 +202,4 @@ In this chapter, you took a deep dive into the metrics used for evaluating class ### Original Source -This material comes from the post Classification Evaluation Metrics: Accuracy, Precision, Recall, and F1 Visually Explained. +This material comes from the post Classification Evaluation Metrics: Accuracy, Precision, Recall, and F1 Visually Explained. diff --git a/fern/pages/llm-university/intro-text-representation/text-classification-2.mdx b/fern/pages/llm-university/intro-text-representation/text-classification-2.mdx index 174ac9fe..bd198b93 100644 --- a/fern/pages/llm-university/intro-text-representation/text-classification-2.mdx +++ b/fern/pages/llm-university/intro-text-representation/text-classification-2.mdx @@ -19,7 +19,7 @@ A classification task falls under one of these two categories: **Binary classification**, where the number of classes is two. Here are some examples: - A **spam classifier** could assign emails one one of two classes: "Spam" or "Not spam". -- An online forum could use a **toxicity classifier** to assist with [content moderation](https://txt.cohere.com/cohere-for-content-moderation) by classifying posts as "Neutral" or "Toxic". +- An online forum could use a **toxicity classifier** to assist with [content moderation](https://cohere.com/blog/cohere-for-content-moderation) by classifying posts as "Neutral" or "Toxic". **Multi-class classification**, where the number of classes is more than two. Here are some examples: diff --git a/fern/pages/llm-university/llmu/brief-intro-what-is-nlp-and-llms.mdx b/fern/pages/llm-university/llmu/brief-intro-what-is-nlp-and-llms.mdx index 1cecb37b..67cd22d5 100644 --- a/fern/pages/llm-university/llmu/brief-intro-what-is-nlp-and-llms.mdx +++ b/fern/pages/llm-university/llmu/brief-intro-what-is-nlp-and-llms.mdx @@ -1,7 +1,7 @@ --- title: "Brief intro: What is NLP and LLMs?" slug: "docs/brief-intro-what-is-nlp-and-llms" -subtitle: "From: https://txt.cohere.com/hello-world-p1/" +subtitle: "From: https://cohere.com/blog/hello-world-p1/" hidden: false image: "../../../assets/images/985382a-Cohere_LLM_University.png" diff --git a/fern/pages/llm-university/module-8-chat-and-retrieval-augmented-generation-rag.mdx b/fern/pages/llm-university/module-8-chat-and-retrieval-augmented-generation-rag.mdx index e7601782..ba337427 100644 --- a/fern/pages/llm-university/module-8-chat-and-retrieval-augmented-generation-rag.mdx +++ b/fern/pages/llm-university/module-8-chat-and-retrieval-augmented-generation-rag.mdx @@ -14,8 +14,8 @@ By the end of this module, you will be able to build RAG-powered applications by Here is what you'll learn in this module: -- **[Getting Started With RAG](https://txt.cohere.com/rag-start/)**: Learn the basics of RAG and how to get started with RAG with the Chat endpoint. -- **[RAG With Chat, Embed, and Rerank](https://txt.cohere.com/rag-chatbot/)**: Learn how to build a RAG-powered chatbot using the Chat, Embed, and Rerank endpoints. -- **[RAG With Connectors](https://txt.cohere.com/rag-connectors/)**: Learn about connectors and how to build RAG applications using the web search connector. -- **[RAG With Quickstart Connectors](https://txt.cohere.com/rag-quickstart-connectors/)**: Learn how to connect RAG applications to datastores by leveraging Cohere’s pre-built quickstart connectors. -- **[RAG Over Large-Scale Data](https://txt.cohere.com/rag-large-scale-data/)**: Learn how to build RAG applications over multiple datastores and long documents. +- **[Getting Started With RAG](https://cohere.com/llmu/rag-start/)**: Learn the basics of RAG and how to get started with RAG with the Chat endpoint. +- **[RAG With Chat, Embed, and Rerank](https://cohere.com/llmu/rag-chatbot/)**: Learn how to build a RAG-powered chatbot using the Chat, Embed, and Rerank endpoints. +- **[RAG With Connectors](https://cohere.com/llmu/rag-connectors/)**: Learn about connectors and how to build RAG applications using the web search connector. +- **[RAG With Quickstart Connectors](https://cohere.com/llmu/rag-quickstart-connectors/)**: Learn how to connect RAG applications to datastores by leveraging Cohere’s pre-built quickstart connectors. +- **[RAG Over Large-Scale Data](https://cohere.com/llmu/rag-large-scale-data/)**: Learn how to build RAG applications over multiple datastores and long documents. diff --git a/fern/pages/llm-university/sandbox/analyzing-text-using-embeddings-copy-copy.mdx b/fern/pages/llm-university/sandbox/analyzing-text-using-embeddings-copy-copy.mdx index 9e0ad793..8e3f4b98 100644 --- a/fern/pages/llm-university/sandbox/analyzing-text-using-embeddings-copy-copy.mdx +++ b/fern/pages/llm-university/sandbox/analyzing-text-using-embeddings-copy-copy.mdx @@ -1,7 +1,7 @@ --- title: "REMOVE The Embed Endpoint" slug: "docs/analyzing-text-using-embeddings-copy-copy" -subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://txt.cohere.com/hello-world-p2/" +subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://cohere.com/blog/hello-world-p2/" hidden: true createdAt: "Mon May 01 2023 14:37:28 GMT+0000 (Coordinated Universal Time)" updatedAt: "Mon Oct 23 2023 14:40:59 GMT+0000 (Coordinated Universal Time)" diff --git a/fern/pages/llm-university/sandbox/chapter-1-how-to-use-coheres-endpoints-copy-copy-copy.mdx b/fern/pages/llm-university/sandbox/chapter-1-how-to-use-coheres-endpoints-copy-copy-copy.mdx index d840dbde..3aadc168 100644 --- a/fern/pages/llm-university/sandbox/chapter-1-how-to-use-coheres-endpoints-copy-copy-copy.mdx +++ b/fern/pages/llm-university/sandbox/chapter-1-how-to-use-coheres-endpoints-copy-copy-copy.mdx @@ -1,7 +1,7 @@ --- title: "REMOVE: How to Use Cohere's Endpoints" slug: "docs/chapter-1-how-to-use-coheres-endpoints-copy-copy-copy" -subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://txt.cohere.com/hello-world-p2/" +subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://cohere.com/blog/hello-world-p2/" hidden: true createdAt: "Thu Apr 27 2023 16:04:06 GMT+0000 (Coordinated Universal Time)" updatedAt: "Mon Oct 23 2023 14:40:59 GMT+0000 (Coordinated Universal Time)" @@ -199,9 +199,9 @@ As your task gets more complex, you will likely need to bring in additional trai If you’d like to learn more about text classification, here are some additional resources: -Intuition and [use case examples](https://txt.cohere.com/text-classification-use-cases/) -An example application: [toxicity detection](https://txt.cohere.com/toxicity-sms/) -Evaluating a [classifier’s performance](https://txt.cohere.com/classification-eval-metrics/) +Intuition and [use case examples](https://cohere.com/blog/text-classification-use-cases/) +An example application: [toxicity detection](https://cohere.com/blog/toxicity-sms/) +Evaluating a [classifier’s performance](https://cohere.com/blog/classification-eval-metrics/) Classify endpoint [API reference](/classify-reference?ref=txt.cohere.com&__hstc=14363112.89f2baed82ac4713854553225677badd.1682345384753.1682447142806.1682463578843.8&__hssc=14363112.1.1682463578843&__hsfp=2014138109) # Analyzing Text diff --git a/fern/pages/llm-university/sandbox/chapter-1-how-to-use-coheres-endpoints-copy-copy.mdx b/fern/pages/llm-university/sandbox/chapter-1-how-to-use-coheres-endpoints-copy-copy.mdx index a024264c..3298c31e 100644 --- a/fern/pages/llm-university/sandbox/chapter-1-how-to-use-coheres-endpoints-copy-copy.mdx +++ b/fern/pages/llm-university/sandbox/chapter-1-how-to-use-coheres-endpoints-copy-copy.mdx @@ -1,7 +1,7 @@ --- title: "REMOVE Semantic Search Text Using Embeddings" slug: "docs/chapter-1-how-to-use-coheres-endpoints-copy-copy" -subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://txt.cohere.com/hello-world-p2/" +subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://cohere.com/blog/hello-world-p2/" hidden: true createdAt: "Thu Apr 27 2023 16:02:03 GMT+0000 (Coordinated Universal Time)" updatedAt: "Mon Oct 23 2023 14:40:59 GMT+0000 (Coordinated Universal Time)" diff --git a/fern/pages/llm-university/sandbox/chapter-2-hello-world-meet-language-ai-part-2.mdx b/fern/pages/llm-university/sandbox/chapter-2-hello-world-meet-language-ai-part-2.mdx index f432fc43..3014c7e4 100644 --- a/fern/pages/llm-university/sandbox/chapter-2-hello-world-meet-language-ai-part-2.mdx +++ b/fern/pages/llm-university/sandbox/chapter-2-hello-world-meet-language-ai-part-2.mdx @@ -1,7 +1,7 @@ --- title: "REMOVE Three tasks: Classify, Analyze, Generate" slug: "docs/chapter-2-hello-world-meet-language-ai-part-2" -subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://txt.cohere.com/hello-world-p2/" +subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://cohere.com/blog/hello-world-p2/" hidden: true createdAt: "Tue Apr 25 2023 16:30:31 GMT+0000 (Coordinated Universal Time)" updatedAt: "Mon Oct 23 2023 14:40:59 GMT+0000 (Coordinated Universal Time)" diff --git a/fern/pages/llm-university/sandbox/chapter-4-deploying-with-nextjs.mdx b/fern/pages/llm-university/sandbox/chapter-4-deploying-with-nextjs.mdx index 897e05d9..e0934b8c 100644 --- a/fern/pages/llm-university/sandbox/chapter-4-deploying-with-nextjs.mdx +++ b/fern/pages/llm-university/sandbox/chapter-4-deploying-with-nextjs.mdx @@ -1,7 +1,7 @@ --- title: "DEPRECATE Deploying with Next.js" slug: "docs/chapter-4-deploying-with-nextjs" -subtitle: "From: https://txt.cohere.com/add-nlp-language-ai-to-next-js-app/" +subtitle: "From: https://cohere.com/blog/add-nlp-language-ai-to-next-js-app/" hidden: true createdAt: "Wed Apr 26 2023 20:11:43 GMT+0000 (Coordinated Universal Time)" updatedAt: "Mon Oct 23 2023 14:40:59 GMT+0000 (Coordinated Universal Time)" diff --git a/fern/pages/llm-university/sandbox/chapter-5-advanced-nlp-in-google-sheets.mdx b/fern/pages/llm-university/sandbox/chapter-5-advanced-nlp-in-google-sheets.mdx index 0d3e7fb5..f163b7bc 100644 --- a/fern/pages/llm-university/sandbox/chapter-5-advanced-nlp-in-google-sheets.mdx +++ b/fern/pages/llm-university/sandbox/chapter-5-advanced-nlp-in-google-sheets.mdx @@ -189,7 +189,7 @@ The **Generate** endpoint has several hyperparameters, some of which are listed - **Frequency and Presence penalty** — This is used to reduce token repetitiveness by applying penalties. Frequency applies penalties depending on how many times a token appears while Presence applies the penalties equally without bias. - **Return likelihoods** — This, as the name suggests, is used to set if and how token likelihoods should be returned in the response. -You can [learn more about this here](https://txt.cohere.com/llm-parameters-best-outputs-language-ai/). +You can [learn more about this here](https://cohere.com/blog/llm-parameters-best-outputs-language-ai/). You’ll also notice we pass ‘--’ for our stop sequence hyperparameter. If you look carefully at the prompt, you’ll notice we use this sequence between our examples. These let Cohere know when to stop generating. diff --git a/fern/pages/llm-university/sandbox/chapter-5-building-a-classifier-with-the-cohere-api.mdx b/fern/pages/llm-university/sandbox/chapter-5-building-a-classifier-with-the-cohere-api.mdx index 3c1778c7..cfa0ad7f 100644 --- a/fern/pages/llm-university/sandbox/chapter-5-building-a-classifier-with-the-cohere-api.mdx +++ b/fern/pages/llm-university/sandbox/chapter-5-building-a-classifier-with-the-cohere-api.mdx @@ -1,7 +1,7 @@ --- title: "REMOVE Three Different Ways to Build a Classifier" slug: "docs/chapter-5-building-a-classifier-with-the-cohere-api" -subtitle: "From this post: https://txt.cohere.com/classify-three-options/" +subtitle: "From this post: https://cohere.com/blog/classify-three-options/" hidden: true createdAt: "Wed Apr 26 2023 00:26:49 GMT+0000 (Coordinated Universal Time)" updatedAt: "Mon Oct 23 2023 14:40:59 GMT+0000 (Coordinated Universal Time)" @@ -126,7 +126,7 @@ Now that the classifier is ready, we can test the 100 data points on it and get y_pred = X_test.apply(classify_text, args=(examples,)).tolist() ``` -We’ll use Accuracy and F1-score to evaluate the classifier against this test dataset (more on how to [evaluate a classifier here](https://txt.cohere.com/classification-eval-metrics/). +We’ll use Accuracy and F1-score to evaluate the classifier against this test dataset (more on how to [evaluate a classifier here](https://cohere.com/blog/classification-eval-metrics/). ``` # Compute metrics on the test dataset diff --git a/fern/pages/llm-university/sandbox/chapter-8-building-and-deploying-a-discord-bot.mdx b/fern/pages/llm-university/sandbox/chapter-8-building-and-deploying-a-discord-bot.mdx index bef74fd0..b56387e5 100644 --- a/fern/pages/llm-university/sandbox/chapter-8-building-and-deploying-a-discord-bot.mdx +++ b/fern/pages/llm-university/sandbox/chapter-8-building-and-deploying-a-discord-bot.mdx @@ -138,4 +138,4 @@ This has been a demonstration of a Language AI system that shows some of the pos ### Original Source -This material comes from the post [Building a Search-Based Discord Bot with Language Models](https://txt.cohere.com/building-a-search-based-discord-bot-with-language-models/) by [Nick Frosst](https://txt.cohere.com/author/nicholas/) (a cofounder of Cohere) and [Jay Alammar](https://txt.cohere.com/author/jay/). +This material comes from the post [Building a Search-Based Discord Bot with Language Models](https://cohere.com/blog/building-a-search-based-discord-bot-with-language-models/) by [Nick Frosst](https://cohere.com/blog/authors/nicholas/) (a cofounder of Cohere) and [Jay Alammar](https://cohere.com/blog/authors/jay/). diff --git a/fern/pages/llm-university/sandbox/classification-models-remove.mdx b/fern/pages/llm-university/sandbox/classification-models-remove.mdx index 613cbc6a..99856962 100644 --- a/fern/pages/llm-university/sandbox/classification-models-remove.mdx +++ b/fern/pages/llm-university/sandbox/classification-models-remove.mdx @@ -31,7 +31,7 @@ Chatbots tend to pair intent classifiers with entity extractors – another lang #### Content Moderation -A significant portion of human interaction now happens online through social media, online forums, and group chats (like Discord or Slack). More often than not, these online communities need moderation to keep their communities safe from different types of online harm. Language understanding systems can empower moderation teams in combating toxic, abusive, and hateful language. +A significant portion of human interaction now happens online through social media, online forums, and group chats (like Discord or Slack). More often than not, these online communities need moderation to keep their communities safe from different types of online harm. Language understanding systems can empower moderation teams in combating toxic, abusive, and hateful language. A content filter can classify texts as either neutral or toxic: @@ -49,8 +49,8 @@ If you're able to map real-world problems to text classification problems, that' In this chapter you learned what a classification model is, and how they can be used for numerous applications. -https\://txt.cohere.com/text-classification-use-cases/ +https\://txt.cohere.com/text-classification-use-cases/ ### Original Source -This material comes from the post Text Classification Intuition for Software Developers +This material comes from the post Text Classification Intuition for Software Developers diff --git a/fern/pages/llm-university/sandbox/creating-custom-generative-models-copy.mdx b/fern/pages/llm-university/sandbox/creating-custom-generative-models-copy.mdx index 869eabd0..cd206f36 100644 --- a/fern/pages/llm-university/sandbox/creating-custom-generative-models-copy.mdx +++ b/fern/pages/llm-university/sandbox/creating-custom-generative-models-copy.mdx @@ -33,7 +33,7 @@ A generative model is already trained on a huge volume of data, making it great - **Specific styles:** Generating text with a certain style or voice, e.g., when generating product descriptions that represent your company’s brand - **Specific formats:** Parsing information from a unique format or structure, e.g., when extracting information from specific types of invoices, resumes, or contracts - **Specific domains:** Dealing with text in highly specialized domains such as medical, scientific, or legal, e.g., when summarizing text dense with technical information -- **Specific knowledge:** Generating text that closely follows a certain theme, e.g., when generating playing cards that are playable, like what we did with Magic the Gathering +- **Specific knowledge:** Generating text that closely follows a certain theme, e.g., when generating playing cards that are playable, like what we did with Magic the Gathering In these cases, with enough examples in the prompt, you might still be able to make the generation work. But there is an element of unpredictability — something you want to eliminate when looking to deploy your application beyond a basic demo. diff --git a/fern/pages/llm-university/sandbox/remove-analyzing-text-using-embeddings-copy.mdx b/fern/pages/llm-university/sandbox/remove-analyzing-text-using-embeddings-copy.mdx index 3f58a6a8..ea0b6ae0 100644 --- a/fern/pages/llm-university/sandbox/remove-analyzing-text-using-embeddings-copy.mdx +++ b/fern/pages/llm-university/sandbox/remove-analyzing-text-using-embeddings-copy.mdx @@ -1,7 +1,7 @@ --- title: "REMOVE Semantic Search Using the Embed Endpoint" slug: "docs/remove-analyzing-text-using-embeddings-copy" -subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://txt.cohere.com/hello-world-p2/" +subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://cohere.com/blog/hello-world-p2/" hidden: true createdAt: "Tue May 02 2023 16:00:34 GMT+0000 (Coordinated Universal Time)" updatedAt: "Mon Oct 23 2023 14:40:59 GMT+0000 (Coordinated Universal Time)" diff --git a/fern/pages/llm-university/sandbox/remove-hello-world-meet-language-ai-copy.mdx b/fern/pages/llm-university/sandbox/remove-hello-world-meet-language-ai-copy.mdx index 9feba396..c264eb9e 100644 --- a/fern/pages/llm-university/sandbox/remove-hello-world-meet-language-ai-copy.mdx +++ b/fern/pages/llm-university/sandbox/remove-hello-world-meet-language-ai-copy.mdx @@ -1,7 +1,7 @@ --- title: "REMOVE Generating Text" slug: "docs/remove-hello-world-meet-language-ai-copy" -subtitle: "From this post: https://txt.cohere.ai/hello-world-p1/" +subtitle: "From this post: https://cohere.com/blog/hello-world-p1/" hidden: true createdAt: "Wed Apr 26 2023 13:35:02 GMT+0000 (Coordinated Universal Time)" updatedAt: "Mon Oct 23 2023 14:40:59 GMT+0000 (Coordinated Universal Time)" @@ -242,5 +242,5 @@ We have taken a quick tour of text generation, but there is so much more to expl - A guide to [prompt engineering](/prompt-engineering-wiki?ref=txt.cohere.com&__hstc=14363112.89f2baed82ac4713854553225677badd.1682345384753.1682373613982.1682437866580.5&__hssc=14363112.1.1682437866580&__hsfp=2014138109) - Controlling [generation outputs](/token-picking?ref=txt.cohere.com&__hstc=14363112.89f2baed82ac4713854553225677badd.1682345384753.1682373613982.1682437866580.5&__hssc=14363112.1.1682437866580&__hsfp=2014138109) -- Some [use case ideas](https://txt.cohere.com/llm-use-cases/) with text generation +- Some [use case ideas](https://cohere.com/blog/llm-use-cases/) with text generation - The [Generate API reference](/generate-reference?ref=txt.cohere.com&__hstc=14363112.89f2baed82ac4713854553225677badd.1682345384753.1682373613982.1682437866580.5&__hssc=14363112.1.1682437866580&__hsfp=2014138109) diff --git a/fern/pages/llm-university/sandbox/remove-how-to-use-coheres-endpoints-copy.mdx b/fern/pages/llm-university/sandbox/remove-how-to-use-coheres-endpoints-copy.mdx index 0e8c31b2..17e96a08 100644 --- a/fern/pages/llm-university/sandbox/remove-how-to-use-coheres-endpoints-copy.mdx +++ b/fern/pages/llm-university/sandbox/remove-how-to-use-coheres-endpoints-copy.mdx @@ -1,7 +1,7 @@ --- title: "REMOVE Semantic Search Using Embeddings" slug: "docs/remove-how-to-use-coheres-endpoints-copy" -subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://txt.cohere.com/hello-world-p2/" +subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://cohere.com/blog/hello-world-p2/" hidden: true createdAt: "Wed May 03 2023 14:16:38 GMT+0000 (Coordinated Universal Time)" updatedAt: "Mon Oct 23 2023 14:40:59 GMT+0000 (Coordinated Universal Time)" diff --git a/fern/pages/llm-university/sandbox/the-embed-endpoint-copy.mdx b/fern/pages/llm-university/sandbox/the-embed-endpoint-copy.mdx index 36a844bc..09cd997f 100644 --- a/fern/pages/llm-university/sandbox/the-embed-endpoint-copy.mdx +++ b/fern/pages/llm-university/sandbox/the-embed-endpoint-copy.mdx @@ -1,7 +1,7 @@ --- title: "REMOVE The Embed Endpoint (COPY)" slug: "docs/the-embed-endpoint-copy" -subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://txt.cohere.com/hello-world-p2/" +subtitle: "Intro to the Cohere Endpoints: Classify, Embed, Search. From https://cohere.com/blog/hello-world-p2/" hidden: true createdAt: "Wed May 03 2023 14:32:57 GMT+0000 (Coordinated Universal Time)" updatedAt: "Mon Oct 23 2023 14:40:59 GMT+0000 (Coordinated Universal Time)" diff --git a/fern/pages/models/models.mdx b/fern/pages/models/models.mdx index fc2e4c86..b7abc161 100644 --- a/fern/pages/models/models.mdx +++ b/fern/pages/models/models.mdx @@ -29,7 +29,7 @@ At the end of each major sections below, you'll find technical details about how In this section, we'll provide some high-level context on Cohere's offerings, and what the strengths of each are. - The Command family of models includes [Command](https://cohere.com/models/command?_gl=1*15hfaqm*_ga*MTAxNTg1NTM1MS4xNjk1MjMwODQw*_ga_CRGS116RZS*MTcxNzYwMzYxMy4zNTEuMS4xNzE3NjAzNjUxLjIyLjAuMA..), [Command R](/docs/command-r), and [Command R+](/docs/command-r-plus). Together, they are the text-generation LLMs powering conversational agents, summarization, copywriting, and similar use cases. They work through the [Chat](/reference/chat) endpoint, which can be used with or without [retrieval augmented generation](/docs/retrieval-augmented-generation-rag) (RAG). -- [Rerank](https://txt.cohere.com/rerank/?_gl=1*1t6ls4x*_ga*MTAxNTg1NTM1MS4xNjk1MjMwODQw*_ga_CRGS116RZS*MTcxNzYwMzYxMy4zNTEuMS4xNzE3NjAzNjUxLjIyLjAuMA..) is the fastest way to inject the intelligence of a language model into an existing search system. It can be accessed via the [Rerank](/reference/rerank-1) endpoint. +- [Rerank](https://cohere.com/blog/rerank/?_gl=1*1t6ls4x*_ga*MTAxNTg1NTM1MS4xNjk1MjMwODQw*_ga_CRGS116RZS*MTcxNzYwMzYxMy4zNTEuMS4xNzE3NjAzNjUxLjIyLjAuMA..) is the fastest way to inject the intelligence of a language model into an existing search system. It can be accessed via the [Rerank](/reference/rerank-1) endpoint. - [Embed](https://cohere.com/models/embed?_gl=1*1t6ls4x*_ga*MTAxNTg1NTM1MS4xNjk1MjMwODQw*_ga_CRGS116RZS*MTcxNzYwMzYxMy4zNTEuMS4xNzE3NjAzNjUxLjIyLjAuMA..) improves the accuracy of search, classification, clustering, and RAG results. It also powers the [Embed](/reference/embed) and [Classify](/reference/classify) endpoints. ## Command diff --git a/scripts/cookbooks-json/creating-a-qa-bot.json b/scripts/cookbooks-json/creating-a-qa-bot.json index 947448e1..e94f4710 100644 --- a/scripts/cookbooks-json/creating-a-qa-bot.json +++ b/scripts/cookbooks-json/creating-a-qa-bot.json @@ -13,7 +13,7 @@ }, "title": "Creating a QA Bot From Technical Documentation", "slug": "creating-a-qa-bot", - "body": "[block:html]\n{\n \"html\": \"
\\n \\n
\\n \\n \\n \\n \\n \\n \\n
\\n Back to Cookbooks\\n
\\n\\n \\n Open in GitHub\\n
\\n \\n \\n \\n \\n \\n \\n
\\n
\\n
\\n\\n
\\n

Creating a QA Bot From Technical Documentation

\\n
\\n\\n\"\n}\n[/block]\n\n\nThis notebook demonstrates how to create a chatbot (single turn) that answers user questions based on technical documentation made available to the model.\n\nWe use the `aws-documentation` dataset ([link](https://github.com/siagholami/aws-documentation/tree/main)) for representativeness. This dataset contains 26k+ AWS documentation pages, preprocessed into 120k+ chunks, and 100 questions based on real user questions.\n\nWe proceed as follows:\n1. Embed the AWS documentation into a vector database using Cohere embeddings and `llama_index`\n2. Build a retriever using Cohere's `rerank` for better accuracy, lower inference costs and lower latency\n3. Create model answers for the eval set of 100 questions\n4. Evaluate the model answers against the golden answers of the eval set\n\n\n## Setup\n\n\n```python\n%%capture\n!pip install cohere datasets llama_index llama-index-llms-cohere llama-index-embeddings-cohere\n```\n\n\n```python\nimport cohere\nimport datasets\nfrom llama_index.core import StorageContext, VectorStoreIndex, load_index_from_storage\nfrom llama_index.core.schema import TextNode\nfrom llama_index.embeddings.cohere import CohereEmbedding\nimport pandas as pd\n\nimport json\nfrom pathlib import Path\nfrom tqdm import tqdm\nfrom typing import List\n\n```\n\n\n```python\napi_key = \"\" # \nco = cohere.Client(api_key=api_key)\n```\n\n## 1. Embed technical documentation and store as vector database\n\n* Load the dataset from HuggingFace\n* Compute embeddings using Cohere's implementation in LlamaIndex, `CohereEmbedding`\n* Store inside a vector database, `VectorStoreIndex` from LlamaIndex\n\n\nBecause this process is lengthy (~2h for all documents on a MacBookPro), we store the index to disc for future reuse. We also provide a (commented) code snippet to index only a subset of the data. If you use this snippet, bear in mind that many documents will become unavailable to the model and, as a result, performance will suffer!\n\n\n```python\ndata = datasets.load_dataset(\"sauravjoshi23/aws-documentation-chunked\")\nprint(data)\n\nmap_id2index = {sample[\"id\"]: index for index, sample in enumerate(data[\"train\"])}\n\n```\n\n /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: \n The secret `HF_TOKEN` does not exist in your Colab secrets.\n To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n You will be able to reuse this secret in all of your notebooks.\n Please note that authentication is recommended but still optional to access public models or datasets.\n warnings.warn(\n\n\n DatasetDict({\n train: Dataset({\n features: ['id', 'text', 'source'],\n num_rows: 187147\n })\n })\n\n\n\n```python\n\noverwrite = True # only compute index if it doesn't exist\npath_index = Path(\".\") / \"aws-documentation_index_cohere\"\n\nembed_model = CohereEmbedding(\n cohere_api_key=api_key,\n model_name=\"embed-english-v3.0\",\n)\n\nif not path_index.exists() or overwrite:\n # Documents are prechunked. Keep them as-is for now\n stub_len = len(\"https://github.com/siagholami/aws-documentation/tree/main/documents/\")\n documents = [\n # -- for indexing full dataset --\n TextNode(\n text=sample[\"text\"],\n title=sample[\"source\"][stub_len:], # save source minus stub\n id_=sample[\"id\"],\n ) for sample in data[\"train\"]\n # -- for testing on subset --\n # TextNode(\n # text=data[\"train\"][index][\"text\"],\n # title=data[\"train\"][index][\"source\"][stub_len:],\n # id_=data[\"train\"][index][\"id\"],\n # ) for index in range(1_000)\n ]\n index = VectorStoreIndex(documents, embed_model=embed_model)\n index.storage_context.persist(path_index)\n\nelse:\n storage_context = StorageContext.from_defaults(persist_dir=path_index)\n index = load_index_from_storage(storage_context, embed_model=embed_model)\n\n```\n\n## 2. Build a retriever using Cohere's `rerank`\n\nThe vector database we built using `VectorStoreIndex` comes with an in-built retriever. We can call that retriever to fetch the top $k$ documents most relevant to the user question with:\n\n```python\nretriever = index.as_retriever(similarity_top_k=top_k)\n```\n\nWe recently released [Rerank-3](https://txt.cohere.com/rerank-3/) (April '24), which we can use to improve the quality of retrieval, as well as reduce latency and the cost of inference. To use the retriever with `rerank`, we create a thin wrapper around `index.as_retriever` as follows:\n\n\n```python\nclass RetrieverWithRerank:\n def __init__(self, retriever, api_key):\n self.retriever = retriever\n self.co = cohere.Client(api_key=api_key)\n\n def retrieve(self, query: str, top_n: int):\n # First call to the retriever fetches the closest indices\n nodes = self.retriever.retrieve(query)\n nodes = [\n {\n \"text\": node.node.text,\n \"llamaindex_id\": node.node.id_,\n }\n for node\n in nodes\n ]\n # Call co.rerank to improve the relevance of retrieved documents\n reranked = self.co.rerank(query=query, documents=nodes, model=\"rerank-english-v3.0\", top_n=top_n)\n nodes = [nodes[node.index] for node in reranked.results]\n return nodes\n\n\ntop_k = 60 # how many documents to fetch on first pass\ntop_n = 20 # how many documents to sub-select with rerank\n\nretriever = RetrieverWithRerank(\n index.as_retriever(similarity_top_k=top_k),\n api_key=api_key,\n)\n\n```\n\n\n```python\nquery = \"What happens to my Amazon EC2 instances if I delete my Auto Scaling group?\"\n\ndocuments = retriever.retrieve(query, top_n=top_n)\n\nresp = co.chat(message=query, model=\"command-r\", temperature=0., documents=documents)\nprint(resp.text)\n\n```\n\nThis works! With `co.chat`, you get the additional benefit that citations are returned for every span of text. Here's a simple function to display the citations inside square brackets.\n\n\n```python\ndef build_answer_with_citations(response):\n \"\"\" \"\"\"\n text = response.text\n citations = response.citations\n\n # Construct text_with_citations adding citation spans as we iterate through citations\n end = 0\n text_with_citations = \"\"\n\n for citation in citations:\n # Add snippet between last citatiton and current citation\n start = citation.start\n text_with_citations += text[end : start]\n end = citation.end # overwrite\n citation_blocks = \" [\" + \", \".join([stub[4:] for stub in citation.document_ids]) + \"] \"\n text_with_citations += text[start : end] + citation_blocks\n # Add any left-over\n text_with_citations += text[end:]\n\n return text_with_citations\n\ngrounded_answer = build_answer_with_citations(resp)\nprint(grounded_answer)\n\n```\n\n## 3. Create model answers for 100 QA pairs\n\nNow that we have a running pipeline, we need to assess its performance.\n\nThe author of the repository provides 100 QA pairs that we can test the model on. Let's download these questions, then run inference on all 100 questions. Later, we will use Command-R+ -- Cohere's largest and most powerful model -- to measure performance.\n\n\n```python\nurl = \"https://github.com/siagholami/aws-documentation/blob/main/QA_true.csv?raw=true\"\nqa_pairs = pd.read_csv(url)\nqa_pairs.sample(2)\n\n```\n\nWe'll use the fields as follows:\n* `Question`: the user question, passed to `co.chat` to generate the answer\n* `Answer_True`: treat as the ground gruth; compare to the model-generated answer to determine its correctness\n* `Document_True`: treat as the (single) golden document; check the rank of this document inside the model's retrieved documents\n\nWe'll loop over each question and generate our model answer. We'll also complete two steps that will be useful for evaluating our model next:\n1. We compute the rank of the golden document amid the retrieved documents -- this will inform how well our retrieval system performs\n2. We prepare the grading prompts -- these will be sent to an LLM scorer to compute the goodness of responses\n\n\n```python\n\nLLM_EVAL_TEMPLATE = \"\"\"## References\n{references}\n\nQUESTION: based on the above reference documents, answer the following question: {question}\nANSWER: {answer}\nSTUDENT RESPONSE: {completion}\n\nBased on the question and answer above, grade the studen't reponse. A correct response will contain exactly \\\nthe same information as in the answer, even if it is worded differently. If the student's reponse is correct, \\\ngive it a score of 1. Otherwise, give it a score of 0. Let's think step by step. Return your answer as \\\nas a compilable JSON with the following structure:\n{{\n \"reasoning\": ,\n \"score: ,\n}}\"\"\"\n\n\ndef get_rank_of_golden_within_retrieved(golden: str, retrieved: List[dict]) -> int:\n \"\"\"\n Returns the rank that the golden document (single) has within the retrieved documents\n * `golden` contains the source of the document, e.g. 'amazon-ec2-user-guide/EBSEncryption.md'\n * `retrieved` has a list of responses with key 'llamaindex_id', which links back to document sources\n \"\"\"\n # Create {document: rank} map using llamaindex_id (count first occurrence of any document; they can\n # appear multiple times because they're chunked)\n doc_to_rank = {}\n for rank, doc in enumerate(retrieved):\n # retrieve source of document\n _id = doc[\"llamaindex_id\"]\n source = data[\"train\"][map_id2index[_id]][\"source\"]\n # format as in dataset\n source = source[stub_len:] # remove stub\n source = source.replace(\"/doc_source\", \"\") # remove /doc_source/\n if source not in doc_to_rank:\n doc_to_rank[source] = rank + 1\n\n # Return rank of `golden`, defaulting to len(retrieved) + 1 if it's absent\n return doc_to_rank.get(golden, len(retrieved) + 1)\n\n```\n\n\n```python\nfrom tqdm import tqdm\n\nanswers = []\ngolden_answers = []\nranks = []\ngrading_prompts = [] # best computed in batch\n\nfor _, row in tqdm(qa_pairs.iterrows(), total=len(qa_pairs)):\n query, golden_answer, golden_doc = row[\"Question\"], row[\"Answer_True\"], row[\"Document_True\"]\n golden_answers.append(golden_answer)\n\n # --- Produce answer using retriever ---\n documents = retriever.retrieve(query, top_n=top_n)\n resp = co.chat(message=query, model=\"command-r\", temperature=0., documents=documents)\n answer = resp.text\n answers.append(answer)\n\n # --- Do some prework for evaluation later ---\n # Rank\n rank = get_rank_of_golden_within_retrieved(golden_doc, documents)\n ranks.append(rank)\n # Score: construct the grading prompts for LLM evals, then evaluate in batch\n # Need to reformat documents slightly\n documents = [{\"index\": str(i), \"text\": doc[\"text\"]} for i, doc in enumerate(documents)]\n references_text = \"\\n\\n\".join(\"\\n\".join([f\"{k}: {v}\" for k, v in doc.items()]) for doc in documents)\n # ^ snippet looks complicated, but all it does it unpack all kwargs from `documents`\n # into text separated by \\n\\n\n grading_prompt = LLM_EVAL_TEMPLATE.format(\n references=references_text, question=query, answer=golden_answer, completion=answer,\n )\n grading_prompts.append(grading_prompt)\n\n```\n\n## 4. Evaluate model performance\n\nWe want to test our model performance on two dimensions:\n1. How good is the final answer? We'll compare our model answer to the golden answer using Command-R+ as a judge.\n2. How good is the retrieval? We'll use the rank of the golden document within the retrieved documents to this end.\n\nNote that this pipeline is for illustration only. To measure performance in practice, we would want to run more in-depths tests on a broader, representative dataset.\n\n\n```python\nresults = pd.DataFrame()\nresults[\"answer\"] = answers\nresults[\"golden_answer\"] = qa_pairs[\"Answer_True\"]\nresults[\"rank\"] = ranks\n\n```\n\n### 4.1 Compare answer to golden answer\n\nWe'll use Command-R+ as a judge of whether the answers produced by our model convey the same information as the golden answers. Since we've defined the grading prompts earlier, we can simply ask our LLM judge to evaluate that grading prompt. After a little bit of postprocessing, we can then extract our model scores.\n\n\n```python\nscores = []\nreasonings = []\n\ndef remove_backticks(text: str) -> str:\n \"\"\"\n Some models are trained to output JSON in Markdown formatting:\n ```json {json object}```\n Remove the backticks from those model responses so that they become\n parasable by json.loads.\n \"\"\"\n if text.startswith(\"```json\"):\n text = text[7:]\n if text.endswith(\"```\"):\n text = text[:-3]\n return text\n\n\nfor prompt in tqdm(grading_prompts, total=len(grading_prompts)):\n resp = co.chat(message=prompt, model=\"command-r-plus\", temperature=0.)\n # Convert response to JSON to extract the `score` and `reasoning` fields\n # We remove backticks for compatibility with different LLMs\n parsed = json.loads(remove_backticks(resp.text))\n scores.append(parsed[\"score\"])\n reasonings.append(parsed[\"reasoning\"])\n\n```\n\n\n```python\nresults[\"score\"] = scores\nresults[\"reasoning\"] = reasonings\n```\n\n\n```python\nprint(f\"Average score: {results['score'].mean():.3f}\")\n\n```\n\n### 4.2 Compute rank\n\nWe've already computed the rank of the golden documents using `get_rank_of_golden_within_retrieved`. Here, we'll plot the histogram of ranks, using blue when the answer scored a 1, and red when the answer scored a 0.\n\n\n```python\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\nsns.set_theme(style=\"darkgrid\", rc={\"grid.color\": \".8\"})\n\nresults[\"rank_shifted_left\"] = results[\"rank\"] - 0.1\nresults[\"rank_shifted_right\"] = results[\"rank\"] + 0.1\n\nf, ax = plt.subplots(figsize=(5, 3))\nsns.histplot(data=results.loc[results[\"score\"] == 1], x=\"rank_shifted_left\", color=\"skyblue\", label=\"Correct answer\", binwidth=1)\nsns.histplot(data=results.loc[results[\"score\"] == 0], x=\"rank_shifted_right\", color=\"red\", label=\"False answer\", binwidth=1)\n\nax.set_xticks([1, 5, 0, 10, 15, 20])\nax.set_title(\"Rank of golden document (max means golden doc. wasn't retrieved)\")\nax.set_xlabel(\"Rank\")\nax.legend();\n\n```\n\nWe see that retrieval works well overall: for 80% of questions, the golden document is within the top 5 documents. However, we also notice that approx. half the false answers come from instances where the golden document wasn't retrieved (`rank = top_k = 20`). This should be improved, e.g. by adding metadata to the documents such as their section headings, or altering the chunking strategy.\n\nThere is also a non-negligible instance of false answers where the top document was retrieved. On closer inspection, many of these are due to the model phrasing its answers more verbosely than the (very laconic) golden documents. This highlights the importance of checking eval results before jumping to conclusions about model performance.\n\n## Conclusions\n\nIn this notebook, we've built a QA bot that answers user questions based on technical documentation. We've learnt:\n\n1. How to embed the technical documentation into a vector database using Cohere embeddings and `llama_index`\n2. How to build a custom retriever that leverages Cohere's `rerank`\n3. How to evaluate model performance against a predetermined set of golden QA pairs", + "body": "[block:html]\n{\n \"html\": \"\\n\\n
\\n

Creating a QA Bot From Technical Documentation

\\n
\\n\\n\"\n}\n[/block]\n\n\nThis notebook demonstrates how to create a chatbot (single turn) that answers user questions based on technical documentation made available to the model.\n\nWe use the `aws-documentation` dataset ([link](https://github.com/siagholami/aws-documentation/tree/main)) for representativeness. This dataset contains 26k+ AWS documentation pages, preprocessed into 120k+ chunks, and 100 questions based on real user questions.\n\nWe proceed as follows:\n1. Embed the AWS documentation into a vector database using Cohere embeddings and `llama_index`\n2. Build a retriever using Cohere's `rerank` for better accuracy, lower inference costs and lower latency\n3. Create model answers for the eval set of 100 questions\n4. Evaluate the model answers against the golden answers of the eval set\n\n\n## Setup\n\n\n```python\n%%capture\n!pip install cohere datasets llama_index llama-index-llms-cohere llama-index-embeddings-cohere\n```\n\n\n```python\nimport cohere\nimport datasets\nfrom llama_index.core import StorageContext, VectorStoreIndex, load_index_from_storage\nfrom llama_index.core.schema import TextNode\nfrom llama_index.embeddings.cohere import CohereEmbedding\nimport pandas as pd\n\nimport json\nfrom pathlib import Path\nfrom tqdm import tqdm\nfrom typing import List\n\n```\n\n\n```python\napi_key = \"\" # \nco = cohere.Client(api_key=api_key)\n```\n\n## 1. Embed technical documentation and store as vector database\n\n* Load the dataset from HuggingFace\n* Compute embeddings using Cohere's implementation in LlamaIndex, `CohereEmbedding`\n* Store inside a vector database, `VectorStoreIndex` from LlamaIndex\n\n\nBecause this process is lengthy (~2h for all documents on a MacBookPro), we store the index to disc for future reuse. We also provide a (commented) code snippet to index only a subset of the data. If you use this snippet, bear in mind that many documents will become unavailable to the model and, as a result, performance will suffer!\n\n\n```python\ndata = datasets.load_dataset(\"sauravjoshi23/aws-documentation-chunked\")\nprint(data)\n\nmap_id2index = {sample[\"id\"]: index for index, sample in enumerate(data[\"train\"])}\n\n```\n\n /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: \n The secret `HF_TOKEN` does not exist in your Colab secrets.\n To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n You will be able to reuse this secret in all of your notebooks.\n Please note that authentication is recommended but still optional to access public models or datasets.\n warnings.warn(\n\n\n DatasetDict({\n train: Dataset({\n features: ['id', 'text', 'source'],\n num_rows: 187147\n })\n })\n\n\n\n```python\n\noverwrite = True # only compute index if it doesn't exist\npath_index = Path(\".\") / \"aws-documentation_index_cohere\"\n\nembed_model = CohereEmbedding(\n cohere_api_key=api_key,\n model_name=\"embed-english-v3.0\",\n)\n\nif not path_index.exists() or overwrite:\n # Documents are prechunked. Keep them as-is for now\n stub_len = len(\"https://github.com/siagholami/aws-documentation/tree/main/documents/\")\n documents = [\n # -- for indexing full dataset --\n TextNode(\n text=sample[\"text\"],\n title=sample[\"source\"][stub_len:], # save source minus stub\n id_=sample[\"id\"],\n ) for sample in data[\"train\"]\n # -- for testing on subset --\n # TextNode(\n # text=data[\"train\"][index][\"text\"],\n # title=data[\"train\"][index][\"source\"][stub_len:],\n # id_=data[\"train\"][index][\"id\"],\n # ) for index in range(1_000)\n ]\n index = VectorStoreIndex(documents, embed_model=embed_model)\n index.storage_context.persist(path_index)\n\nelse:\n storage_context = StorageContext.from_defaults(persist_dir=path_index)\n index = load_index_from_storage(storage_context, embed_model=embed_model)\n\n```\n\n## 2. Build a retriever using Cohere's `rerank`\n\nThe vector database we built using `VectorStoreIndex` comes with an in-built retriever. We can call that retriever to fetch the top $k$ documents most relevant to the user question with:\n\n```python\nretriever = index.as_retriever(similarity_top_k=top_k)\n```\n\nWe recently released [Rerank-3](https://cohere.com/blog/rerank-3/) (April '24), which we can use to improve the quality of retrieval, as well as reduce latency and the cost of inference. To use the retriever with `rerank`, we create a thin wrapper around `index.as_retriever` as follows:\n\n\n```python\nclass RetrieverWithRerank:\n def __init__(self, retriever, api_key):\n self.retriever = retriever\n self.co = cohere.Client(api_key=api_key)\n\n def retrieve(self, query: str, top_n: int):\n # First call to the retriever fetches the closest indices\n nodes = self.retriever.retrieve(query)\n nodes = [\n {\n \"text\": node.node.text,\n \"llamaindex_id\": node.node.id_,\n }\n for node\n in nodes\n ]\n # Call co.rerank to improve the relevance of retrieved documents\n reranked = self.co.rerank(query=query, documents=nodes, model=\"rerank-english-v3.0\", top_n=top_n)\n nodes = [nodes[node.index] for node in reranked.results]\n return nodes\n\n\ntop_k = 60 # how many documents to fetch on first pass\ntop_n = 20 # how many documents to sub-select with rerank\n\nretriever = RetrieverWithRerank(\n index.as_retriever(similarity_top_k=top_k),\n api_key=api_key,\n)\n\n```\n\n\n```python\nquery = \"What happens to my Amazon EC2 instances if I delete my Auto Scaling group?\"\n\ndocuments = retriever.retrieve(query, top_n=top_n)\n\nresp = co.chat(message=query, model=\"command-r\", temperature=0., documents=documents)\nprint(resp.text)\n\n```\n\nThis works! With `co.chat`, you get the additional benefit that citations are returned for every span of text. Here's a simple function to display the citations inside square brackets.\n\n\n```python\ndef build_answer_with_citations(response):\n \"\"\" \"\"\"\n text = response.text\n citations = response.citations\n\n # Construct text_with_citations adding citation spans as we iterate through citations\n end = 0\n text_with_citations = \"\"\n\n for citation in citations:\n # Add snippet between last citatiton and current citation\n start = citation.start\n text_with_citations += text[end : start]\n end = citation.end # overwrite\n citation_blocks = \" [\" + \", \".join([stub[4:] for stub in citation.document_ids]) + \"] \"\n text_with_citations += text[start : end] + citation_blocks\n # Add any left-over\n text_with_citations += text[end:]\n\n return text_with_citations\n\ngrounded_answer = build_answer_with_citations(resp)\nprint(grounded_answer)\n\n```\n\n## 3. Create model answers for 100 QA pairs\n\nNow that we have a running pipeline, we need to assess its performance.\n\nThe author of the repository provides 100 QA pairs that we can test the model on. Let's download these questions, then run inference on all 100 questions. Later, we will use Command-R+ -- Cohere's largest and most powerful model -- to measure performance.\n\n\n```python\nurl = \"https://github.com/siagholami/aws-documentation/blob/main/QA_true.csv?raw=true\"\nqa_pairs = pd.read_csv(url)\nqa_pairs.sample(2)\n\n```\n\nWe'll use the fields as follows:\n* `Question`: the user question, passed to `co.chat` to generate the answer\n* `Answer_True`: treat as the ground gruth; compare to the model-generated answer to determine its correctness\n* `Document_True`: treat as the (single) golden document; check the rank of this document inside the model's retrieved documents\n\nWe'll loop over each question and generate our model answer. We'll also complete two steps that will be useful for evaluating our model next:\n1. We compute the rank of the golden document amid the retrieved documents -- this will inform how well our retrieval system performs\n2. We prepare the grading prompts -- these will be sent to an LLM scorer to compute the goodness of responses\n\n\n```python\n\nLLM_EVAL_TEMPLATE = \"\"\"## References\n{references}\n\nQUESTION: based on the above reference documents, answer the following question: {question}\nANSWER: {answer}\nSTUDENT RESPONSE: {completion}\n\nBased on the question and answer above, grade the studen't reponse. A correct response will contain exactly \\\nthe same information as in the answer, even if it is worded differently. If the student's reponse is correct, \\\ngive it a score of 1. Otherwise, give it a score of 0. Let's think step by step. Return your answer as \\\nas a compilable JSON with the following structure:\n{{\n \"reasoning\": ,\n \"score: ,\n}}\"\"\"\n\n\ndef get_rank_of_golden_within_retrieved(golden: str, retrieved: List[dict]) -> int:\n \"\"\"\n Returns the rank that the golden document (single) has within the retrieved documents\n * `golden` contains the source of the document, e.g. 'amazon-ec2-user-guide/EBSEncryption.md'\n * `retrieved` has a list of responses with key 'llamaindex_id', which links back to document sources\n \"\"\"\n # Create {document: rank} map using llamaindex_id (count first occurrence of any document; they can\n # appear multiple times because they're chunked)\n doc_to_rank = {}\n for rank, doc in enumerate(retrieved):\n # retrieve source of document\n _id = doc[\"llamaindex_id\"]\n source = data[\"train\"][map_id2index[_id]][\"source\"]\n # format as in dataset\n source = source[stub_len:] # remove stub\n source = source.replace(\"/doc_source\", \"\") # remove /doc_source/\n if source not in doc_to_rank:\n doc_to_rank[source] = rank + 1\n\n # Return rank of `golden`, defaulting to len(retrieved) + 1 if it's absent\n return doc_to_rank.get(golden, len(retrieved) + 1)\n\n```\n\n\n```python\nfrom tqdm import tqdm\n\nanswers = []\ngolden_answers = []\nranks = []\ngrading_prompts = [] # best computed in batch\n\nfor _, row in tqdm(qa_pairs.iterrows(), total=len(qa_pairs)):\n query, golden_answer, golden_doc = row[\"Question\"], row[\"Answer_True\"], row[\"Document_True\"]\n golden_answers.append(golden_answer)\n\n # --- Produce answer using retriever ---\n documents = retriever.retrieve(query, top_n=top_n)\n resp = co.chat(message=query, model=\"command-r\", temperature=0., documents=documents)\n answer = resp.text\n answers.append(answer)\n\n # --- Do some prework for evaluation later ---\n # Rank\n rank = get_rank_of_golden_within_retrieved(golden_doc, documents)\n ranks.append(rank)\n # Score: construct the grading prompts for LLM evals, then evaluate in batch\n # Need to reformat documents slightly\n documents = [{\"index\": str(i), \"text\": doc[\"text\"]} for i, doc in enumerate(documents)]\n references_text = \"\\n\\n\".join(\"\\n\".join([f\"{k}: {v}\" for k, v in doc.items()]) for doc in documents)\n # ^ snippet looks complicated, but all it does it unpack all kwargs from `documents`\n # into text separated by \\n\\n\n grading_prompt = LLM_EVAL_TEMPLATE.format(\n references=references_text, question=query, answer=golden_answer, completion=answer,\n )\n grading_prompts.append(grading_prompt)\n\n```\n\n## 4. Evaluate model performance\n\nWe want to test our model performance on two dimensions:\n1. How good is the final answer? We'll compare our model answer to the golden answer using Command-R+ as a judge.\n2. How good is the retrieval? We'll use the rank of the golden document within the retrieved documents to this end.\n\nNote that this pipeline is for illustration only. To measure performance in practice, we would want to run more in-depths tests on a broader, representative dataset.\n\n\n```python\nresults = pd.DataFrame()\nresults[\"answer\"] = answers\nresults[\"golden_answer\"] = qa_pairs[\"Answer_True\"]\nresults[\"rank\"] = ranks\n\n```\n\n### 4.1 Compare answer to golden answer\n\nWe'll use Command-R+ as a judge of whether the answers produced by our model convey the same information as the golden answers. Since we've defined the grading prompts earlier, we can simply ask our LLM judge to evaluate that grading prompt. After a little bit of postprocessing, we can then extract our model scores.\n\n\n```python\nscores = []\nreasonings = []\n\ndef remove_backticks(text: str) -> str:\n \"\"\"\n Some models are trained to output JSON in Markdown formatting:\n ```json {json object}```\n Remove the backticks from those model responses so that they become\n parasable by json.loads.\n \"\"\"\n if text.startswith(\"```json\"):\n text = text[7:]\n if text.endswith(\"```\"):\n text = text[:-3]\n return text\n\n\nfor prompt in tqdm(grading_prompts, total=len(grading_prompts)):\n resp = co.chat(message=prompt, model=\"command-r-plus\", temperature=0.)\n # Convert response to JSON to extract the `score` and `reasoning` fields\n # We remove backticks for compatibility with different LLMs\n parsed = json.loads(remove_backticks(resp.text))\n scores.append(parsed[\"score\"])\n reasonings.append(parsed[\"reasoning\"])\n\n```\n\n\n```python\nresults[\"score\"] = scores\nresults[\"reasoning\"] = reasonings\n```\n\n\n```python\nprint(f\"Average score: {results['score'].mean():.3f}\")\n\n```\n\n### 4.2 Compute rank\n\nWe've already computed the rank of the golden documents using `get_rank_of_golden_within_retrieved`. Here, we'll plot the histogram of ranks, using blue when the answer scored a 1, and red when the answer scored a 0.\n\n\n```python\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\nsns.set_theme(style=\"darkgrid\", rc={\"grid.color\": \".8\"})\n\nresults[\"rank_shifted_left\"] = results[\"rank\"] - 0.1\nresults[\"rank_shifted_right\"] = results[\"rank\"] + 0.1\n\nf, ax = plt.subplots(figsize=(5, 3))\nsns.histplot(data=results.loc[results[\"score\"] == 1], x=\"rank_shifted_left\", color=\"skyblue\", label=\"Correct answer\", binwidth=1)\nsns.histplot(data=results.loc[results[\"score\"] == 0], x=\"rank_shifted_right\", color=\"red\", label=\"False answer\", binwidth=1)\n\nax.set_xticks([1, 5, 0, 10, 15, 20])\nax.set_title(\"Rank of golden document (max means golden doc. wasn't retrieved)\")\nax.set_xlabel(\"Rank\")\nax.legend();\n\n```\n\nWe see that retrieval works well overall: for 80% of questions, the golden document is within the top 5 documents. However, we also notice that approx. half the false answers come from instances where the golden document wasn't retrieved (`rank = top_k = 20`). This should be improved, e.g. by adding metadata to the documents such as their section headings, or altering the chunking strategy.\n\nThere is also a non-negligible instance of false answers where the top document was retrieved. On closer inspection, many of these are due to the model phrasing its answers more verbosely than the (very laconic) golden documents. This highlights the importance of checking eval results before jumping to conclusions about model performance.\n\n## Conclusions\n\nIn this notebook, we've built a QA bot that answers user questions based on technical documentation. We've learnt:\n\n1. How to embed the technical documentation into a vector database using Cohere embeddings and `llama_index`\n2. How to build a custom retriever that leverages Cohere's `rerank`\n3. How to evaluate model performance against a predetermined set of golden QA pairs", "html": "", "htmlmode": false, "fullscreen": false, diff --git a/scripts/cookbooks-json/document-parsing-for-enterprises.json b/scripts/cookbooks-json/document-parsing-for-enterprises.json index aa79cf66..80d47cdc 100644 --- a/scripts/cookbooks-json/document-parsing-for-enterprises.json +++ b/scripts/cookbooks-json/document-parsing-for-enterprises.json @@ -13,7 +13,7 @@ }, "title": "Advanced Document Parsing For Enterprises", "slug": "document-parsing-for-enterprises", - "body": "[block:html]\n{\n \"html\": \"\\n\\n
\\n

Advanced Document Parsing For Enterprises

\\n
\\n\\n\\n
\\n \\n
\\n \\\"Giannis\\n

Giannis Chatziveroglou

\\n
\\n \\n
\\n \\\"Justin\\n

Justin Lee

\\n
\\n \\n
\\n\\n\"\n}\n[/block]\n\n\n## Introduction\n\nThe bread and butter of natural language processing technology is text. Once we can reduce a set of data into text, we can do all kinds of things with it: question answering, summarization, classification, sentiment analysis, searching and indexing, and more.\n\nIn the context of enterprise Retrieval Augmented Generation (RAG), the information is often locked in complex file types such as PDFs. These formats are made for sharing information between humans, but not so much with language models.\n\nIn this notebook, we will use a real-world pharmaceutical drug label to test out various performant approaches to parsing PDFs. This will allow us to use [Cohere's Command-R model](https://txt.cohere.com/command-r/) in a RAG setting to answer questions and asks about this label, such as \"I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of\" a given pharmaceutical.\n\n[block:html]{\"html\":\"\\\"Document\"}[/block]\n\n\n## PDF Parsing\n\nWe will go over five proprietary as well as open source options for processing PDFs. The parsing mechanisms demonstrated in the following sections are\n- [Google Document AI](#gcp)\n- [AWS Textract](#aws)\n- [Unstructured.io](#unstructured)\n- [LlamaParse](#llama)\n- [pdf2image + pytesseract](#pdf2image)\n\nBy way of example, we will be parsing a [21-page PDF](https://www.accessdata.fda.gov/drugsatfda_docs/label/2023/215500s000lbl.pdf) containing the label for a recent FDA drug approval, the beginning of which is shown below. Then, we will perform a series of basic RAG tasks with our different parsings and evaluate their performance.\n\n[block:html]{\"html\":\"\\\"Drug\"}[/block]\n\n## Getting Set Up\n\nBefore we dive into the technical weeds, we need to set up the notebook's runtime and filesystem environments. The code cells below do the following:\n- Install required libraries\n- Confirm that data dependencies from the GitHub repo have been downloaded. These will be under `data/document-parsing` and contain the following:\n - the PDF document that we will be working with, `fda-approved-drug.pdf` (this can also be found here: https://www.accessdata.fda.gov/drugsatfda_docs/label/2023/215500s000lbl.pdf)\n - precomputed parsed documents for each parsing solution. While the point of this notebook is to illustrate how this is done, we provide the parsed final results to allow readers to skip ahead to the RAG section without having to set up the required infrastructure for each solution.)\n- Add utility functions needed for later sections\n\n\n```python\n%%capture\n! sudo apt install tesseract-ocr poppler-utils\n! pip install \"cohere<5\" fsspec hnswlib google-cloud-documentai google-cloud-storage boto3 langchain-text-splitters llama_parse pytesseract pdf2image pandas\n\n```\n\n\n```python\ndata_dir = \"data/document-parsing\"\nsource_filename = \"example-drug-label\"\nextension = \"pdf\"\n```\n\n\n```python\nfrom pathlib import Path\n\nsources = [\"gcp\", \"aws\", \"unstructured-io\", \"llamaparse-text\", \"llamaparse-markdown\", \"pytesseract\"]\n\nfilenames = [\"{}-parsed-fda-approved-drug.txt\".format(source) for source in sources]\nfilenames.append(\"fda-approved-drug.pdf\")\n\nfor filename in filenames: \n file_path = Path(f\"{data_dir}/{filename}\")\n if file_path.is_file() == False:\n print(f\"File {filename} not found at {data_dir}!\")\n```\n\n### Utility Functions\nMake sure to include the notebook's utility functions in the runtime.\n\n\n```python\ndef store_document(path: str, doc_content: str):\n with open(path, 'w') as f:\n f.write(doc_content)\n```\n\n\n```python\nimport json\n\ndef insert_citations_in_order(text, citations, documents):\n \"\"\"\n A helper function to pretty print citations.\n \"\"\"\n\n citations_reference = {}\n for index, doc in enumerate(documents):\n citations_reference[index] = doc\n\n offset = 0\n # Process citations in the order they were provided\n for citation in citations:\n # Adjust start/end with offset\n start, end = citation['start'] + offset, citation['end'] + offset\n citation_numbers = []\n for doc_id in citation[\"document_ids\"]:\n for citation_index, doc in citations_reference.items():\n if doc[\"id\"] == doc_id:\n citation_numbers.append(citation_index)\n references = \"(\" + \", \".join(\"[{}]\".format(num) for num in citation_numbers) + \")\"\n modification = f'{text[start:end]} {references}'\n # Replace the cited text with its bolded version + placeholder\n text = text[:start] + modification + text[end:]\n # Update the offset for subsequent replacements\n offset += len(modification) - (end - start)\n\n # Add the citations at the bottom of the text\n text_with_citations = f'{text}'\n citations_reference = [\"[{}]: {}\".format(x[\"id\"], x[\"text\"]) for x in citations_reference.values()]\n\n return text_with_citations, \"\\n\".join(citations_reference)\n```\n\n\n```python\ndef format_docs_for_chat(documents):\n return [{\"id\": str(index), \"text\": x} for index, x in enumerate(documents)]\n```\n\n## Document Parsing Solutions\n\nFor demonstration purposes, we have collected and saved the parsed documents from each solution in this notebook. Skip to the [next section](#document-questions) to run RAG with Command-R on the pre-fetched versions. You can find all parsed resources in detail at the link [here](https://github.com/gchatz22/temp-cohere-resources/tree/main/data).\n\n\n### Solution 1: Google Cloud Document AI [[Back to Solutions]](#top)\n\nDocument AI helps developers create high-accuracy processors to extract, classify, and split documents.\n\nExternal documentation: https://cloud.google.com/document-ai\n\n#### Parsing the document\n\nThe following block can be executed in one of two ways:\n- Inside a Google Vertex AI environment\n - No authentication needed\n- From this notebook\n - Authentication is needed\n - There are pointers inside the code on which lines to uncomment in order to make this work\n\n**Note: You can skip to the next block if you want to use the pre-existing parsed version.**\n\n\n```python\n\"\"\"\nExtracted from https://cloud.google.com/document-ai/docs/samples/documentai-batch-process-document\n\"\"\"\n\nimport re\nfrom typing import Optional\n\nfrom google.api_core.client_options import ClientOptions\nfrom google.api_core.exceptions import InternalServerError\nfrom google.api_core.exceptions import RetryError\nfrom google.cloud import documentai # type: ignore\nfrom google.cloud import storage\n\nproject_id = \"\"\nlocation = \"\"\nprocessor_id = \"\"\ngcs_output_uri = \"\"\n# credentials_file = \"populate if you are running in a non Vertex AI environment.\"\ngcs_input_prefix = \"\"\n\n\ndef batch_process_documents(\n project_id: str,\n location: str,\n processor_id: str,\n gcs_output_uri: str,\n gcs_input_prefix: str,\n timeout: int = 400\n) -> None:\n parsed_documents = []\n\n # Client configs\n opts = ClientOptions(api_endpoint=f\"{location}-documentai.googleapis.com\")\n # With credentials\n # opts = ClientOptions(api_endpoint=f\"{location}-documentai.googleapis.com\", credentials_file=credentials_file)\n\n client = documentai.DocumentProcessorServiceClient(client_options=opts)\n processor_name = client.processor_path(project_id, location, processor_id)\n\n # Input storage configs\n gcs_prefix = documentai.GcsPrefix(gcs_uri_prefix=gcs_input_prefix)\n input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)\n\n # Output storage configs\n gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(gcs_uri=gcs_output_uri, field_mask=None)\n output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)\n storage_client = storage.Client()\n # With credentials\n # storage_client = storage.Client.from_service_account_json(json_credentials_path=credentials_file)\n\n # Batch process docs request\n request = documentai.BatchProcessRequest(\n name=processor_name,\n input_documents=input_config,\n document_output_config=output_config,\n )\n\n # batch_process_documents returns a long running operation\n operation = client.batch_process_documents(request)\n\n # Continually polls the operation until it is complete.\n # This could take some time for larger files\n try:\n print(f\"Waiting for operation {operation.operation.name} to complete...\")\n operation.result(timeout=timeout)\n except (RetryError, InternalServerError) as e:\n print(e.message)\n\n # Get output document information from completed operation metadata\n metadata = documentai.BatchProcessMetadata(operation.metadata)\n if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:\n raise ValueError(f\"Batch Process Failed: {metadata.state_message}\")\n\n print(\"Output files:\")\n # One process per Input Document\n for process in list(metadata.individual_process_statuses):\n matches = re.match(r\"gs://(.*?)/(.*)\", process.output_gcs_destination)\n if not matches:\n print(\"Could not parse output GCS destination:\", process.output_gcs_destination)\n continue\n\n output_bucket, output_prefix = matches.groups()\n output_blobs = storage_client.list_blobs(output_bucket, prefix=output_prefix)\n\n # Document AI may output multiple JSON files per source file\n # (Large documents get split in multiple file \"versions\" doc --> parsed_doc_0 + parsed_doc_1 ...)\n for blob in output_blobs:\n # Document AI should only output JSON files to GCS\n if blob.content_type != \"application/json\":\n print(f\"Skipping non-supported file: {blob.name} - Mimetype: {blob.content_type}\")\n continue\n\n # Download JSON file as bytes object and convert to Document Object\n print(f\"Fetching {blob.name}\")\n document = documentai.Document.from_json(blob.download_as_bytes(), ignore_unknown_fields=True)\n # Store the filename and the parsed versioned document content as a tuple\n parsed_documents.append((blob.name.split(\"/\")[-1].split(\".\")[0], document.text))\n\n print(\"Finished document parsing process.\")\n return parsed_documents\n\n# Call service\n# versioned_parsed_documents = batch_process_documents(\n# project_id=project_id,\n# location=location,\n# processor_id=processor_id,\n# gcs_output_uri=gcs_output_uri,\n# gcs_input_prefix=gcs_input_prefix\n# )\n```\n\n\n```python\n\"\"\"\nPost process parsed document and store it locally.\nMake sure to run this in a Google Vertex AI environment or include a credentials file.\n\"\"\"\n\n\"\"\"\nfrom pathlib import Path\nfrom collections import defaultdict\n\nparsed_documents = []\ncombined_versioned_parsed_documents = defaultdict(list)\n\n# Assemble versioned documents together ({\"doc_name\": [(0, doc_content_0), (1, doc_content_1), ...]}).\nfor filename, doc_content in versioned_parsed_documents:\n filename, version = \"-\".join(filename.split(\"-\")[:-1]), filename.split(\"-\")[-1]\n combined_versioned_parsed_documents[filename].append((version, doc_content))\n\n# Sort documents by version and join the content together.\nfor filename, docs in combined_versioned_parsed_documents.items():\n doc_content = \" \".join([x[1] for x in sorted(docs, key=lambda x: x[0])])\n parsed_documents.append((filename, doc_content))\n\n# Store parsed documents in local storage.\nfor filename, doc_content in parsed_documents:\n file_path = \"{}/{}-parsed-{}.txt\".format(data_dir, \"gcp\", source_filename)\n store_document(file_path, doc_content)\n\"\"\"\n```\n\n#### Visualize the parsed document\n\n\n```python\nfilename = \"gcp-parsed-{}.txt\".format(source_filename)\nwith open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n parsed_document = doc.read()\n\nprint(parsed_document[:1000])\n```\n\n\n### Solution 2: AWS Textract [[Back to Solutions]](#top)\n\n[Amazon Textract](https://aws.amazon.com/textract/) is an OCR service offered by AWS. It can detect text, forms, tables, and more in PDFs and images. In this section, we go over how to use Textract's asynchronous API.\n\n#### Parsing the document\n\nWe assume that you are working within the AWS ecosystem (from a SageMaker notebook, EC2 instance, a Lambda function, etc.) with valid credentials. Much of the code here is from supplemental materials created by AWS and offered here:\n\n- https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/python/example_code/textract\n- https://github.com/aws-samples/textract-paragraph-identification/tree/main\n\nAt minimum, you will need access to the following AWS resources to get started:\n\n- Textract\n- an S3 bucket containing the document(s) to process - in this case, our `example-drug-label.pdf` file\n- an SNS topic that Textract can publish to. This is used to send a notification that parsing is complete.\n- an IAM role that Textract will assume, granting access to the S3 bucket and SNS topic\n\nFirst, we bring in the `TextractWrapper` class provided in the [AWS Code Examples repository](https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/textract/textract_wrapper.py). This class makes it simpler to interface with the Textract service.\n\n\n```python\n# source: https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/python/example_code/textract\n\n# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n# SPDX-License-Identifier: Apache-2.0\n\n\"\"\"\nPurpose\n\nShows how to use the AWS SDK for Python (Boto3) with Amazon Textract to\ndetect text, form, and table elements in document images.\n\"\"\"\n\nimport json\nimport logging\nfrom botocore.exceptions import ClientError\n\nlogger = logging.getLogger(__name__)\n\n\n# snippet-start:[python.example_code.textract.TextractWrapper]\nclass TextractWrapper:\n \"\"\"Encapsulates Textract functions.\"\"\"\n\n def __init__(self, textract_client, s3_resource, sqs_resource):\n \"\"\"\n :param textract_client: A Boto3 Textract client.\n :param s3_resource: A Boto3 Amazon S3 resource.\n :param sqs_resource: A Boto3 Amazon SQS resource.\n \"\"\"\n self.textract_client = textract_client\n self.s3_resource = s3_resource\n self.sqs_resource = sqs_resource\n\n # snippet-end:[python.example_code.textract.TextractWrapper]\n\n # snippet-start:[python.example_code.textract.DetectDocumentText]\n def detect_file_text(self, *, document_file_name=None, document_bytes=None):\n \"\"\"\n Detects text elements in a local image file or from in-memory byte data.\n The image must be in PNG or JPG format.\n\n :param document_file_name: The name of a document image file.\n :param document_bytes: In-memory byte data of a document image.\n :return: The response from Amazon Textract, including a list of blocks\n that describe elements detected in the image.\n \"\"\"\n if document_file_name is not None:\n with open(document_file_name, \"rb\") as document_file:\n document_bytes = document_file.read()\n try:\n response = self.textract_client.detect_document_text(\n Document={\"Bytes\": document_bytes}\n )\n logger.info(\"Detected %s blocks.\", len(response[\"Blocks\"]))\n except ClientError:\n logger.exception(\"Couldn't detect text.\")\n raise\n else:\n return response\n\n # snippet-end:[python.example_code.textract.DetectDocumentText]\n\n # snippet-start:[python.example_code.textract.AnalyzeDocument]\n def analyze_file(\n self, feature_types, *, document_file_name=None, document_bytes=None\n ):\n \"\"\"\n Detects text and additional elements, such as forms or tables, in a local image\n file or from in-memory byte data.\n The image must be in PNG or JPG format.\n\n :param feature_types: The types of additional document features to detect.\n :param document_file_name: The name of a document image file.\n :param document_bytes: In-memory byte data of a document image.\n :return: The response from Amazon Textract, including a list of blocks\n that describe elements detected in the image.\n \"\"\"\n if document_file_name is not None:\n with open(document_file_name, \"rb\") as document_file:\n document_bytes = document_file.read()\n try:\n response = self.textract_client.analyze_document(\n Document={\"Bytes\": document_bytes}, FeatureTypes=feature_types\n )\n logger.info(\"Detected %s blocks.\", len(response[\"Blocks\"]))\n except ClientError:\n logger.exception(\"Couldn't detect text.\")\n raise\n else:\n return response\n\n # snippet-end:[python.example_code.textract.AnalyzeDocument]\n\n # snippet-start:[python.example_code.textract.helper.prepare_job]\n def prepare_job(self, bucket_name, document_name, document_bytes):\n \"\"\"\n Prepares a document image for an asynchronous detection job by uploading\n the image bytes to an Amazon S3 bucket. Amazon Textract must have permission\n to read from the bucket to process the image.\n\n :param bucket_name: The name of the Amazon S3 bucket.\n :param document_name: The name of the image stored in Amazon S3.\n :param document_bytes: The image as byte data.\n \"\"\"\n try:\n bucket = self.s3_resource.Bucket(bucket_name)\n bucket.upload_fileobj(document_bytes, document_name)\n logger.info(\"Uploaded %s to %s.\", document_name, bucket_name)\n except ClientError:\n logger.exception(\"Couldn't upload %s to %s.\", document_name, bucket_name)\n raise\n\n # snippet-end:[python.example_code.textract.helper.prepare_job]\n\n # snippet-start:[python.example_code.textract.helper.check_job_queue]\n def check_job_queue(self, queue_url, job_id):\n \"\"\"\n Polls an Amazon SQS queue for messages that indicate a specified Textract\n job has completed.\n\n :param queue_url: The URL of the Amazon SQS queue to poll.\n :param job_id: The ID of the Textract job.\n :return: The status of the job.\n \"\"\"\n status = None\n try:\n queue = self.sqs_resource.Queue(queue_url)\n messages = queue.receive_messages()\n if messages:\n msg_body = json.loads(messages[0].body)\n msg = json.loads(msg_body[\"Message\"])\n if msg.get(\"JobId\") == job_id:\n messages[0].delete()\n status = msg.get(\"Status\")\n logger.info(\n \"Got message %s with status %s.\", messages[0].message_id, status\n )\n else:\n logger.info(\"No messages in queue %s.\", queue_url)\n except ClientError:\n logger.exception(\"Couldn't get messages from queue %s.\", queue_url)\n else:\n return status\n\n # snippet-end:[python.example_code.textract.helper.check_job_queue]\n\n # snippet-start:[python.example_code.textract.StartDocumentTextDetection]\n def start_detection_job(\n self, bucket_name, document_file_name, sns_topic_arn, sns_role_arn\n ):\n \"\"\"\n Starts an asynchronous job to detect text elements in an image stored in an\n Amazon S3 bucket. Textract publishes a notification to the specified Amazon SNS\n topic when the job completes.\n The image must be in PNG, JPG, or PDF format.\n\n :param bucket_name: The name of the Amazon S3 bucket that contains the image.\n :param document_file_name: The name of the document image stored in Amazon S3.\n :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic\n where the job completion notification is published.\n :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM)\n role that can be assumed by Textract and grants permission\n to publish to the Amazon SNS topic.\n :return: The ID of the job.\n \"\"\"\n try:\n response = self.textract_client.start_document_text_detection(\n DocumentLocation={\n \"S3Object\": {\"Bucket\": bucket_name, \"Name\": document_file_name}\n },\n NotificationChannel={\n \"SNSTopicArn\": sns_topic_arn,\n \"RoleArn\": sns_role_arn,\n },\n )\n job_id = response[\"JobId\"]\n logger.info(\n \"Started text detection job %s on %s.\", job_id, document_file_name\n )\n except ClientError:\n logger.exception(\"Couldn't detect text in %s.\", document_file_name)\n raise\n else:\n return job_id\n\n # snippet-end:[python.example_code.textract.StartDocumentTextDetection]\n\n # snippet-start:[python.example_code.textract.GetDocumentTextDetection]\n def get_detection_job(self, job_id):\n \"\"\"\n Gets data for a previously started text detection job.\n\n :param job_id: The ID of the job to retrieve.\n :return: The job data, including a list of blocks that describe elements\n detected in the image.\n \"\"\"\n try:\n response = self.textract_client.get_document_text_detection(JobId=job_id)\n job_status = response[\"JobStatus\"]\n logger.info(\"Job %s status is %s.\", job_id, job_status)\n except ClientError:\n logger.exception(\"Couldn't get data for job %s.\", job_id)\n raise\n else:\n return response\n\n # snippet-end:[python.example_code.textract.GetDocumentTextDetection]\n\n # snippet-start:[python.example_code.textract.StartDocumentAnalysis]\n def start_analysis_job(\n self,\n bucket_name,\n document_file_name,\n feature_types,\n sns_topic_arn,\n sns_role_arn,\n ):\n \"\"\"\n Starts an asynchronous job to detect text and additional elements, such as\n forms or tables, in an image stored in an Amazon S3 bucket. Textract publishes\n a notification to the specified Amazon SNS topic when the job completes.\n The image must be in PNG, JPG, or PDF format.\n\n :param bucket_name: The name of the Amazon S3 bucket that contains the image.\n :param document_file_name: The name of the document image stored in Amazon S3.\n :param feature_types: The types of additional document features to detect.\n :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic\n where job completion notification is published.\n :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM)\n role that can be assumed by Textract and grants permission\n to publish to the Amazon SNS topic.\n :return: The ID of the job.\n \"\"\"\n try:\n response = self.textract_client.start_document_analysis(\n DocumentLocation={\n \"S3Object\": {\"Bucket\": bucket_name, \"Name\": document_file_name}\n },\n NotificationChannel={\n \"SNSTopicArn\": sns_topic_arn,\n \"RoleArn\": sns_role_arn,\n },\n FeatureTypes=feature_types,\n )\n job_id = response[\"JobId\"]\n logger.info(\n \"Started text analysis job %s on %s.\", job_id, document_file_name\n )\n except ClientError:\n logger.exception(\"Couldn't analyze text in %s.\", document_file_name)\n raise\n else:\n return job_id\n\n # snippet-end:[python.example_code.textract.StartDocumentAnalysis]\n\n # snippet-start:[python.example_code.textract.GetDocumentAnalysis]\n def get_analysis_job(self, job_id):\n \"\"\"\n Gets data for a previously started detection job that includes additional\n elements.\n\n :param job_id: The ID of the job to retrieve.\n :return: The job data, including a list of blocks that describe elements\n detected in the image.\n \"\"\"\n try:\n response = self.textract_client.get_document_analysis(JobId=job_id)\n job_status = response[\"JobStatus\"]\n logger.info(\"Job %s status is %s.\", job_id, job_status)\n except ClientError:\n logger.exception(\"Couldn't get data for job %s.\", job_id)\n raise\n else:\n return response\n\n\n# snippet-end:[python.example_code.textract.GetDocumentAnalysis]\n```\n\nNext, we set up Textract and S3, and provide this to an instance of `TextractWrapper`.\n\n\n```python\nimport boto3\n\ntextract_client = boto3.client('textract')\ns3_client = boto3.client('s3')\n\ntextractWrapper = TextractWrapper(textract_client, s3_client, None)\n```\n\nWe are now ready to make calls to Textract. At a high level, Textract has two modes: synchronous and asynchronous. Synchronous calls return the parsed output once it is completed. As of the time of writing (March 2024), however, multipage PDF processing is only supported [asynchronously](https://docs.aws.amazon.com/textract/latest/dg/sync.html). So for our purposes here, we will only explore the asynchronous route.\n\nAsynchronous calls follow the below process:\n\n1. Send a request to Textract with an SNS topic, S3 bucket, and the name (key) of the document inside that bucket to process. Textract returns a Job ID that can be used to track the status of the request\n2. Textract fetches the document from S3 and processes it\n3. Once the request is complete, Textract sends out a message to the SNS topic. This can be used in conjunction with other services such as Lambda or SQS for downstream processes.\n4. The parsed results can be fetched from Textract in chunks via the job ID.\n\n\n```python\nbucket_name = \"your-bucket-name\"\nsns_topic_arn = \"your-sns-arn\" # this can be found under the topic you created in the Amazon SNS dashboard\nsns_role_arn = \"sns-role-arn\" # this is an IAM role that allows Textract to interact with SNS\n\nfile_name = \"example-drug-label.pdf\"\n```\n\n\n```python\n# kick off a text detection job. This returns a job ID.\njob_id = textractWrapper.start_detection_job(bucket_name=bucket_name, document_file_name=file_name,\n sns_topic_arn=sns_topic_arn, sns_role_arn=sns_role_arn)\n```\n\nOnce the job completes, this will return a dictionary with the following keys:\n\n```dict_keys(['DocumentMetadata', 'JobStatus', 'NextToken', 'Blocks', 'AnalyzeDocumentModelVersion', 'ResponseMetadata'])```\n\nThis response corresponds to one chunk of information parsed by Textract. The number of chunks a document is parsed into depends on the length of the document. The two keys we are most interested in are `Blocks` and `NextToken`. `Blocks` contains all of the information that was extracted from this chunk, while `NextToken` tells us what chunk comes next, if any.\n\nTextract returns an information-rich representation of the extracted text, such as their position on the page and hierarchical relationships with other entities, all the way down to the individual word level. Since we are only interested in the raw text, we need a way to parse through all of the chunks and their `Blocks`. Lucky for us, Amazon provides some [helper functions](https://github.com/aws-samples/textract-paragraph-identification/tree/main) for this purpose, which we utilize below.\n\n\n```python\ndef get_text_results_from_textract(job_id):\n response = textract_client.get_document_text_detection(JobId=job_id)\n collection_of_textract_responses = []\n pages = [response]\n\n collection_of_textract_responses.append(response)\n\n while 'NextToken' in response:\n next_token = response['NextToken']\n response = textract_client.get_document_text_detection(JobId=job_id, NextToken=next_token)\n pages.append(response)\n collection_of_textract_responses.append(response)\n return collection_of_textract_responses\n\ndef get_the_text_with_required_info(collection_of_textract_responses):\n total_text = []\n total_text_with_info = []\n running_sequence_number = 0\n\n font_sizes_and_line_numbers = {}\n for page in collection_of_textract_responses:\n per_page_text = []\n blocks = page['Blocks']\n for block in blocks:\n if block['BlockType'] == 'LINE':\n block_text_dict = {}\n running_sequence_number += 1\n block_text_dict.update(text=block['Text'])\n block_text_dict.update(page=block['Page'])\n block_text_dict.update(left_indent=round(block['Geometry']['BoundingBox']['Left'], 2))\n font_height = round(block['Geometry']['BoundingBox']['Height'], 3)\n line_number = running_sequence_number\n block_text_dict.update(font_height=round(block['Geometry']['BoundingBox']['Height'], 3))\n block_text_dict.update(indent_from_top=round(block['Geometry']['BoundingBox']['Top'], 2))\n block_text_dict.update(text_width=round(block['Geometry']['BoundingBox']['Width'], 2))\n block_text_dict.update(line_number=running_sequence_number)\n\n if font_height in font_sizes_and_line_numbers:\n line_numbers = font_sizes_and_line_numbers[font_height]\n line_numbers.append(line_number)\n font_sizes_and_line_numbers[font_height] = line_numbers\n else:\n line_numbers = []\n line_numbers.append(line_number)\n font_sizes_and_line_numbers[font_height] = line_numbers\n\n total_text.append(block['Text'])\n per_page_text.append(block['Text'])\n total_text_with_info.append(block_text_dict)\n\n return total_text, total_text_with_info, font_sizes_and_line_numbers\n\ndef get_text_with_line_spacing_info(total_text_with_info):\n i = 1\n text_info_with_line_spacing_info = []\n while (i < len(total_text_with_info) - 1):\n previous_line_info = total_text_with_info[i - 1]\n current_line_info = total_text_with_info[i]\n next_line_info = total_text_with_info[i + 1]\n if current_line_info['page'] == next_line_info['page'] and previous_line_info['page'] == current_line_info[\n 'page']:\n line_spacing_after = round((next_line_info['indent_from_top'] - current_line_info['indent_from_top']), 2)\n spacing_with_prev = round((current_line_info['indent_from_top'] - previous_line_info['indent_from_top']), 2)\n current_line_info.update(line_space_before=spacing_with_prev)\n current_line_info.update(line_space_after=line_spacing_after)\n text_info_with_line_spacing_info.append(current_line_info)\n else:\n text_info_with_line_spacing_info.append(None)\n i += 1\n return text_info_with_line_spacing_info\n```\n\nWe feed in the Job ID from before into the function `get_text_results_from_textract` to fetch all of the chunks associated with this job. Then, we pass the resulting list into `get_the_text_with_required_info` and `get_text_with_line_spacing_info` to organize the text into lines.\n\nFinally, we can concatenate the lines into one string to pass into our downstream RAG pipeline.\n\n\n```python\nall_text = \"\\n\".join([line[\"text\"] if line else \"\" for line in text_info_with_line_spacing])\n\nwith open(f\"aws-parsed-{source_filename}.txt\", \"w\") as f:\n f.write(all_text)\n```\n\n#### Visualize the parsed document\n\n\n```python\nfilename = \"aws-parsed-{}.txt\".format(source_filename)\nwith open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n parsed_document = doc.read()\n\nprint(parsed_document[:1000])\n```\n\n\n### Solution 3: Unstructured.io [[Back to Solutions]](#top)\n\nUnstructured.io provides libraries with open-source components for pre-processing text documents such as PDFs, HTML and Word Documents.\n\nExternal documentation: https://github.com/Unstructured-IO/unstructured-api\n\n#### Parsing the document\n\nThe guide assumes an endpoint exists that hosts this service. The API is offered in two forms\n1. [a hosted version](https://unstructured.io/)\n2. [an OSS docker image](https://github.com/Unstructured-IO/unstructured-api?tab=readme-ov-file#dizzy-instructions-for-using-the-docker-image)\n\n**Note: You can skip to the next block if you want to use the pre-existing parsed version.**\n\n\n```python\nimport os\nimport requests\n\nUNSTRUCTURED_URL = \"\" # enter service endpoint\n\nparsed_documents = []\n\ninput_path = \"{}/{}.{}\".format(data_dir, source_filename, extension)\nwith open(input_path, 'rb') as file_data:\n response = requests.post(\n url=UNSTRUCTURED_URL,\n files={\"files\": (\"{}.{}\".format(source_filename, extension), file_data)},\n data={\n \"output_format\": (None, \"application/json\"),\n \"stratergy\": \"hi_res\",\n \"pdf_infer_table_structure\": \"true\",\n \"include_page_breaks\": \"true\"\n },\n headers={\"Accept\": \"application/json\"}\n )\n\nparsed_response = response.json()\n\nparsed_document = \" \".join([parsed_entry[\"text\"] for parsed_entry in parsed_response])\nprint(\"Parsed {}\".format(source_filename))\n```\n\n\n```python\n\"\"\"\nPost process parsed document and store it locally.\n\"\"\"\n\nfile_path = \"{}/{}-parsed-fda-approved-drug.txt\".format(data_dir, \"unstructured-io\")\nstore_document(file_path, parsed_document)\n```\n\n#### Visualize the parsed document\n\n\n```python\nfilename = \"unstructured-io-parsed-{}.txt\".format(source_filename)\nwith open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n parsed_document = doc.read()\n\nprint(parsed_document[:1000])\n```\n\n\n\n### Solution 4: LlamaParse [[Back to Solutions]](#top)\n\nLlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks.\n\nExternal documentation: https://github.com/run-llama/llama_parse\n\n#### Parsing the document\n\nThe following block uses the LlamaParse cloud offering. You can learn more and fetch a respective API key for the service [here](https://cloud.llamaindex.ai/parse).\n\nParsing documents with LlamaParse offers an option for two output modes both of which we will explore and compare below\n- Text\n- Markdown\n\n**Note: You can skip to the next block if you want to use the pre-existing parsed version.**\n\n\n```python\nimport os\nfrom llama_parse import LlamaParse\n\nimport nest_asyncio # needed to notebook env\nnest_asyncio.apply() # needed to notebook env\n\nllama_index_api_key = \"{API_KEY}\"\ninput_path = \"{}/{}.{}\".format(data_dir, source_filename, extension)\n```\n\n\n```python\n# Text mode\ntext_parser = LlamaParse(\n api_key=llama_index_api_key,\n result_type=\"text\"\n)\n\ntext_response = text_parser.load_data(input_path)\ntext_parsed_document = \" \".join([parsed_entry.text for parsed_entry in text_response])\n\nprint(\"Parsed {} to text\".format(source_filename))\n```\n\n\n```python\n\"\"\"\nPost process parsed document and store it locally.\n\"\"\"\n\nfile_path = \"{}/{}-text-parsed-fda-approved-drug.txt\".format(data_dir, \"llamaparse\")\nstore_document(file_path, text_parsed_document)\n```\n\n\n```python\n# Markdown mode\nmarkdown_parser = LlamaParse(\n api_key=llama_index_api_key,\n result_type=\"markdown\"\n)\n\nmarkdown_response = markdown_parser.load_data(input_path)\nmarkdown_parsed_document = \" \".join([parsed_entry.text for parsed_entry in markdown_response])\n\nprint(\"Parsed {} to markdown\".format(source_filename))\n```\n\n\n```python\n\"\"\"\nPost process parsed document and store it locally.\n\"\"\"\n\nfile_path = \"{}/{}-markdown-parsed-fda-approved-drug.txt\".format(data_dir, \"llamaparse\")\nstore_document(file_path, markdown_parsed_document)\n```\n\n#### Visualize the parsed document\n\n\n```python\n# Text parsing\n\nfilename = \"llamaparse-text-parsed-{}.txt\".format(source_filename)\n\nwith open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n parsed_document = doc.read()\n \nprint(parsed_document[:1000])\n```\n\n\n```python\n# Markdown parsing\n\nfilename = \"llamaparse-markdown-parsed-fda-approved-drug.txt\"\nwith open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n parsed_document = doc.read()\n \nprint(parsed_document[:1000])\n```\n\n\n\n### Solution 5: pdf2image + pytesseract [[Back to Solutions]](#top)\n\nThe final parsing method we examine does not rely on cloud services, but rather relies on two libraries: `pdf2image`, and `pytesseract`. `pytesseract` lets you perform OCR locally on images, but not PDF files. So, we first convert our PDF into a set of images via `pdf2image`.\n\n#### Parsing the document\n\n\n```python\nfrom matplotlib import pyplot as plt\nfrom pdf2image import convert_from_path\nimport pytesseract\n```\n\n\n```python\n# pdf2image extracts as a list of PIL.Image objects\npages = convert_from_path(filename)\n```\n\n\n```python\n# we look at the first page as a sanity check:\n\nplt.imshow(pages[0])\nplt.axis('off')\nplt.show()\n```\n\nNow, we can process the image of each page with `pytesseract` and concatenate the results to get our parsed document.\n\n\n```python\nlabel_ocr_pytesseract = \"\".join([pytesseract.image_to_string(page) for page in pages])\n```\n\n\n```python\nprint(label_ocr_pytesseract[:200])\n```\n\n HIGHLIGHTS OF PRESCRIBING INFORMATION\n \n These highlights do not include all the information needed to use\n IWILFIN™ safely and effectively. See full prescribing information for\n IWILFIN.\n \n IWILFIN™ (eflor\n\n\n\n```python\nlabel_ocr_pytesseract = \"\".join([pytesseract.image_to_string(page) for page in pages])\n\nwith open(f\"pytesseract-parsed-{source_filename}.txt\", \"w\") as f:\n f.write(label_ocr_pytesseract)\n```\n\n#### Visualize the parsed document\n\n\n```python\nfilename = \"pytesseract-parsed-{}.txt\".format(source_filename)\nwith open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n parsed_document = doc.read()\n\nprint(parsed_document[:1000])\n```\n\n\n## Document Questions\n\nWe can now ask a set of simple + complex questions and see how each parsing solution performs with Command-R. The questions are\n- **What are the most common adverse reactions of Iwilfin?**\n - Task: Simple information extraction\n- **What is the recommended dosage of IWILFIN on body surface area between 0.5 and 0.75?**\n - Task: Tabular data extraction\n- **I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.**\n - Task: Overall document summary\n\n\n```python\nimport cohere\nco = cohere.Client(api_key=\"{API_KEY}\")\n```\n\n\n```python\n\"\"\"\nDocument Questions\n\"\"\"\nprompt = \"What are the most common adverse reactions of Iwilfin?\"\n# prompt = \"What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?\"\n# prompt = \"I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.\"\n\n\"\"\"\nChoose one of the above solutions\n\"\"\"\nsource = \"gcp\"\n# source = \"aws\"\n# source = \"unstructured-io\"\n# source = \"llamaparse-text\"\n# source = \"llamaparse-markdown\"\n# source = \"pytesseract\"\n```\n\n## Data Ingestion\n\n\nIn order to set up our RAG implementation, we need to separate the parsed text into chunks and load the chunks to an index. The index will allow us to retrieve relevant passages from the document for different queries. Here, we use a simple implementation of indexing using the `hnswlib` library. Note that there are many different indexing solutions that are appropriate for specific production use cases.\n\n\n```python\n\"\"\"\nRead parsed document content and chunk data\n\"\"\"\n\nimport os\nfrom langchain_text_splitters import RecursiveCharacterTextSplitter\n\ndocuments = []\n\nwith open(\"{}/{}-parsed-fda-approved-drug.txt\".format(data_dir, source), \"r\") as doc:\ndoc_content = doc.read()\n\n\"\"\"\nPersonal notes on chunking\nhttps://medium.com/@ayhamboucher/llm-based-context-splitter-for-large-documents-445d3f02b01b\n\"\"\"\n\n\n# Chunk doc content\ntext_splitter = RecursiveCharacterTextSplitter(\n chunk_size=512,\n chunk_overlap=200,\n length_function=len,\n is_separator_regex=False\n)\n\n# Split the text into chunks with some overlap\nchunks_ = text_splitter.create_documents([doc_content])\ndocuments = [c.page_content for c in chunks_]\n\nprint(\"Source document has been broken down to {} chunks\".format(len(documents)))\n```\n\n\n```python\n\"\"\"\nEmbed document chunks\n\"\"\"\ndocument_embeddings = co.embed(texts=documents, model=\"embed-english-v3.0\", input_type=\"search_document\").embeddings\n```\n\n\n```python\n\"\"\"\nCreate document index and add embedded chunks\n\"\"\"\n\nimport hnswlib\n\nindex = hnswlib.Index(space='ip', dim=1024) # space: inner product\nindex.init_index(max_elements=len(document_embeddings), ef_construction=512, M=64)\nindex.add_items(document_embeddings, list(range(len(document_embeddings))))\nprint(\"Count:\", index.element_count)\n```\n\n Count: 115\n\n\n## Retrieval\n\nIn this step, we use k-nearest neighbors to fetch the most relevant documents for our query. Once the nearest neighbors are retrieved, we use Cohere's reranker to reorder the documents in the most relevant order with regards to our input search query.\n\n\n```python\n\"\"\"\nEmbed search query\nFetch k nearest neighbors\n\"\"\"\n\nquery_emb = co.embed(texts=[prompt], model='embed-english-v3.0', input_type=\"search_query\").embeddings\ndefault_knn = 10\nknn = default_knn if default_knn <= index.element_count else index.element_count\nresult = index.knn_query(query_emb, k=knn)\nneighbors = [(result[0][0][i], result[1][0][i]) for i in range(len(result[0][0]))]\nrelevant_docs = [documents[x[0]] for x in sorted(neighbors, key=lambda x: x[1])]\n```\n\n\n```python\n\"\"\"\nRerank retrieved documents\n\"\"\"\n\nrerank_results = co.rerank(query=prompt, documents=relevant_docs, top_n=3, model='rerank-english-v2.0').results\nreranked_relevant_docs = format_docs_for_chat([x.document[\"text\"] for x in rerank_results])\n```\n\n## Final Step: Call Command-R + RAG!\n\n\n```python\n\"\"\"\nCall the /chat endpoint with command-r\n\"\"\"\n\nresponse = co.chat(\n message=prompt,\n model=\"command-r\",\n documents=reranked_relevant_docs\n)\n\ncited_response, citations_reference = insert_citations_in_order(response.text, response.citations, reranked_relevant_docs)\nprint(cited_response)\nprint(\"\\n\")\nprint(\"References:\")\nprint(citations_reference)\n```\n\n## Head-to-head Comparisons\n\nRun the code cells below to make head to head comparisons of the different parsing techniques across different questions.\n\n\n```python\nimport pandas as pd\nresults = pd.read_csv(\"{}/results-table.csv\".format(data_dir))\n```\n\n\n```python\nquestion = input(\"\"\"\nQuestion 1: What are the most common adverse reactions of Iwilfin?\nQuestion 2: What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?\nQuestion 3: I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.\n\nPick which question you want to see (1,2,3): \"\"\")\nreferences = input(\"Do you want to see the references as well? References are long and noisy (y/n): \")\nprint(\"\\n\\n\")\n\nindex = {\"1\": 0, \"2\": 3, \"3\": 6}[question]\n\nfor src in [\"gcp\", \"aws\", \"unstructured-io\", \"llamaparse-text\", \"llamaparse-markdown\", \"pytesseract\"]:\n print(\"| {} |\".format(src))\n print(\"\\n\")\n print(results[src][index])\n if references == \"y\":\n print(\"\\n\")\n print(\"References:\")\n print(results[src][index+1])\n print(\"\\n\")\n```\n\n \n Question 1: What are the most common adverse reactions of Iwilfin?\n Question 2: What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?\n Question 3: I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.\n \n Pick which question you want to see (1,2,3): 3\n Do you want to see the references as well? References are long and noisy (y/n): n\n \n \n \n | gcp |\n \n \n Compound Name: eflornithine hydrochloride ([0], [1], [2]) (IWILFIN ([1])™)\n \n Indication: used to reduce the risk of relapse in adult and paediatric patients with high-risk neuroblastoma (HRNB) ([1], [3]), who have responded at least partially to prior multiagent, multimodality therapy. ([1], [3], [4])\n \n Route of Administration: IWILFIN™ tablets ([1], [3], [4]) are taken orally twice daily ([3], [4]), with doses ranging from 192 to 768 mg based on body surface area. ([3], [4])\n \n Mechanism of Action: IWILFIN™ is an ornithine decarboxylase inhibitor. ([0], [2])\n \n \n \n | aws |\n \n \n Compound Name: eflornithine ([0], [1], [2], [3]) (IWILFIN ([0])™)\n \n Indication: used to reduce the risk of relapse ([0], [3]) in adults ([0], [3]) and paediatric patients ([0], [3]) with high-risk neuroblastoma (HRNB) ([0], [3]) who have responded to prior therapies. ([0], [3], [4])\n \n Route of Administration: Oral ([2], [4])\n \n Mechanism of Action: IWILFIN is an ornithine decarboxylase inhibitor. ([1])\n \n \n | unstructured-io |\n \n \n Compound Name: Iwilfin ([1], [2], [3], [4]) (eflornithine) ([0], [2], [3], [4])\n \n Indication: Iwilfin is indicated to reduce the risk of relapse ([1], [3]) in adult and paediatric patients ([1], [3]) with high-risk neuroblastoma (HRNB) ([1], [3]), who have responded to prior anti-GD2 ([1]) immunotherapy ([1], [4]) and multi-modality therapy. ([1])\n \n Route of Administration: Oral ([0], [3])\n \n Mechanism of Action: Iwilfin is an ornithine decarboxylase inhibitor. ([1], [2], [3], [4])\n \n \n | llamaparse-text |\n \n \n Compound Name: IWILFIN ([2], [3]) (eflornithine) ([3])\n \n Indication: IWILFIN is used to reduce the risk of relapse ([1], [2], [3]) in adult and paediatric patients ([1], [2], [3]) with high-risk neuroblastoma (HRNB) ([1], [2], [3]), who have responded at least partially to certain prior therapies. ([2], [3])\n \n Route of Administration: IWILFIN is administered as a tablet. ([2])\n \n Mechanism of Action: IWILFIN is an ornithine decarboxylase inhibitor. ([0], [1], [4])\n \n \n | llamaparse-markdown |\n \n \n Compound Name: IWILFIN ([1], [2]) (eflornithine) ([1])\n \n Indication: IWILFIN is indicated to reduce the risk of relapse ([1], [2]) in adult and paediatric patients ([1], [2]) with high-risk neuroblastoma (HRNB) ([1], [2]), who have responded at least partially ([1], [2], [3]) to prior anti-GD2 immunotherapy ([1], [2]) and multiagent, multimodality therapy. ([1], [2], [3])\n \n Route of Administration: Oral ([0], [1], [3], [4])\n \n Mechanism of Action: IWILFIN acts as an ornithine decarboxylase inhibitor. ([1])\n \n \n | pytesseract |\n \n \n Compound Name: IWILFIN™ ([0], [2]) (eflornithine) ([0], [2])\n \n Indication: IWILFIN is indicated to reduce the risk of relapse ([0], [2]) in adult and paediatric patients ([0], [2]) with high-risk neuroblastoma (HRNB) ([0], [2]), who have responded positively to prior anti-GD2 immunotherapy and multiagent, multimodality therapy. ([0], [2], [4])\n \n Route of Administration: IWILFIN is administered orally ([0], [1], [3], [4]), in the form of a tablet. ([1])\n \n Mechanism of Action: IWILFIN acts as an ornithine decarboxylase inhibitor. ([0])", + "body": "[block:html]\n{\n \"html\": \"\\n\\n
\\n

Advanced Document Parsing For Enterprises

\\n
\\n\\n\\n
\\n \\n
\\n \\\"Giannis\\n

Giannis Chatziveroglou

\\n
\\n \\n
\\n \\\"Justin\\n

Justin Lee

\\n
\\n \\n
\\n\\n\"\n}\n[/block]\n\n\n## Introduction\n\nThe bread and butter of natural language processing technology is text. Once we can reduce a set of data into text, we can do all kinds of things with it: question answering, summarization, classification, sentiment analysis, searching and indexing, and more.\n\nIn the context of enterprise Retrieval Augmented Generation (RAG), the information is often locked in complex file types such as PDFs. These formats are made for sharing information between humans, but not so much with language models.\n\nIn this notebook, we will use a real-world pharmaceutical drug label to test out various performant approaches to parsing PDFs. This will allow us to use [Cohere's Command-R model](https://cohere.com/blog/command-r/) in a RAG setting to answer questions and asks about this label, such as \"I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of\" a given pharmaceutical.\n\n[block:html]{\"html\":\"\\\"Document\"}[/block]\n\n\n## PDF Parsing\n\nWe will go over five proprietary as well as open source options for processing PDFs. The parsing mechanisms demonstrated in the following sections are\n- [Google Document AI](#gcp)\n- [AWS Textract](#aws)\n- [Unstructured.io](#unstructured)\n- [LlamaParse](#llama)\n- [pdf2image + pytesseract](#pdf2image)\n\nBy way of example, we will be parsing a [21-page PDF](https://www.accessdata.fda.gov/drugsatfda_docs/label/2023/215500s000lbl.pdf) containing the label for a recent FDA drug approval, the beginning of which is shown below. Then, we will perform a series of basic RAG tasks with our different parsings and evaluate their performance.\n\n[block:html]{\"html\":\"\\\"Drug\"}[/block]\n\n## Getting Set Up\n\nBefore we dive into the technical weeds, we need to set up the notebook's runtime and filesystem environments. The code cells below do the following:\n- Install required libraries\n- Confirm that data dependencies from the GitHub repo have been downloaded. These will be under `data/document-parsing` and contain the following:\n - the PDF document that we will be working with, `fda-approved-drug.pdf` (this can also be found here: https://www.accessdata.fda.gov/drugsatfda_docs/label/2023/215500s000lbl.pdf)\n - precomputed parsed documents for each parsing solution. While the point of this notebook is to illustrate how this is done, we provide the parsed final results to allow readers to skip ahead to the RAG section without having to set up the required infrastructure for each solution.)\n- Add utility functions needed for later sections\n\n\n```python\n%%capture\n! sudo apt install tesseract-ocr poppler-utils\n! pip install \"cohere<5\" fsspec hnswlib google-cloud-documentai google-cloud-storage boto3 langchain-text-splitters llama_parse pytesseract pdf2image pandas\n\n```\n\n\n```python\ndata_dir = \"data/document-parsing\"\nsource_filename = \"example-drug-label\"\nextension = \"pdf\"\n```\n\n\n```python\nfrom pathlib import Path\n\nsources = [\"gcp\", \"aws\", \"unstructured-io\", \"llamaparse-text\", \"llamaparse-markdown\", \"pytesseract\"]\n\nfilenames = [\"{}-parsed-fda-approved-drug.txt\".format(source) for source in sources]\nfilenames.append(\"fda-approved-drug.pdf\")\n\nfor filename in filenames: \n file_path = Path(f\"{data_dir}/{filename}\")\n if file_path.is_file() == False:\n print(f\"File {filename} not found at {data_dir}!\")\n```\n\n### Utility Functions\nMake sure to include the notebook's utility functions in the runtime.\n\n\n```python\ndef store_document(path: str, doc_content: str):\n with open(path, 'w') as f:\n f.write(doc_content)\n```\n\n\n```python\nimport json\n\ndef insert_citations_in_order(text, citations, documents):\n \"\"\"\n A helper function to pretty print citations.\n \"\"\"\n\n citations_reference = {}\n for index, doc in enumerate(documents):\n citations_reference[index] = doc\n\n offset = 0\n # Process citations in the order they were provided\n for citation in citations:\n # Adjust start/end with offset\n start, end = citation['start'] + offset, citation['end'] + offset\n citation_numbers = []\n for doc_id in citation[\"document_ids\"]:\n for citation_index, doc in citations_reference.items():\n if doc[\"id\"] == doc_id:\n citation_numbers.append(citation_index)\n references = \"(\" + \", \".join(\"[{}]\".format(num) for num in citation_numbers) + \")\"\n modification = f'{text[start:end]} {references}'\n # Replace the cited text with its bolded version + placeholder\n text = text[:start] + modification + text[end:]\n # Update the offset for subsequent replacements\n offset += len(modification) - (end - start)\n\n # Add the citations at the bottom of the text\n text_with_citations = f'{text}'\n citations_reference = [\"[{}]: {}\".format(x[\"id\"], x[\"text\"]) for x in citations_reference.values()]\n\n return text_with_citations, \"\\n\".join(citations_reference)\n```\n\n\n```python\ndef format_docs_for_chat(documents):\n return [{\"id\": str(index), \"text\": x} for index, x in enumerate(documents)]\n```\n\n## Document Parsing Solutions\n\nFor demonstration purposes, we have collected and saved the parsed documents from each solution in this notebook. Skip to the [next section](#document-questions) to run RAG with Command-R on the pre-fetched versions. You can find all parsed resources in detail at the link [here](https://github.com/gchatz22/temp-cohere-resources/tree/main/data).\n\n\n### Solution 1: Google Cloud Document AI [[Back to Solutions]](#top)\n\nDocument AI helps developers create high-accuracy processors to extract, classify, and split documents.\n\nExternal documentation: https://cloud.google.com/document-ai\n\n#### Parsing the document\n\nThe following block can be executed in one of two ways:\n- Inside a Google Vertex AI environment\n - No authentication needed\n- From this notebook\n - Authentication is needed\n - There are pointers inside the code on which lines to uncomment in order to make this work\n\n**Note: You can skip to the next block if you want to use the pre-existing parsed version.**\n\n\n```python\n\"\"\"\nExtracted from https://cloud.google.com/document-ai/docs/samples/documentai-batch-process-document\n\"\"\"\n\nimport re\nfrom typing import Optional\n\nfrom google.api_core.client_options import ClientOptions\nfrom google.api_core.exceptions import InternalServerError\nfrom google.api_core.exceptions import RetryError\nfrom google.cloud import documentai # type: ignore\nfrom google.cloud import storage\n\nproject_id = \"\"\nlocation = \"\"\nprocessor_id = \"\"\ngcs_output_uri = \"\"\n# credentials_file = \"populate if you are running in a non Vertex AI environment.\"\ngcs_input_prefix = \"\"\n\n\ndef batch_process_documents(\n project_id: str,\n location: str,\n processor_id: str,\n gcs_output_uri: str,\n gcs_input_prefix: str,\n timeout: int = 400\n) -> None:\n parsed_documents = []\n\n # Client configs\n opts = ClientOptions(api_endpoint=f\"{location}-documentai.googleapis.com\")\n # With credentials\n # opts = ClientOptions(api_endpoint=f\"{location}-documentai.googleapis.com\", credentials_file=credentials_file)\n\n client = documentai.DocumentProcessorServiceClient(client_options=opts)\n processor_name = client.processor_path(project_id, location, processor_id)\n\n # Input storage configs\n gcs_prefix = documentai.GcsPrefix(gcs_uri_prefix=gcs_input_prefix)\n input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)\n\n # Output storage configs\n gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(gcs_uri=gcs_output_uri, field_mask=None)\n output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)\n storage_client = storage.Client()\n # With credentials\n # storage_client = storage.Client.from_service_account_json(json_credentials_path=credentials_file)\n\n # Batch process docs request\n request = documentai.BatchProcessRequest(\n name=processor_name,\n input_documents=input_config,\n document_output_config=output_config,\n )\n\n # batch_process_documents returns a long running operation\n operation = client.batch_process_documents(request)\n\n # Continually polls the operation until it is complete.\n # This could take some time for larger files\n try:\n print(f\"Waiting for operation {operation.operation.name} to complete...\")\n operation.result(timeout=timeout)\n except (RetryError, InternalServerError) as e:\n print(e.message)\n\n # Get output document information from completed operation metadata\n metadata = documentai.BatchProcessMetadata(operation.metadata)\n if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:\n raise ValueError(f\"Batch Process Failed: {metadata.state_message}\")\n\n print(\"Output files:\")\n # One process per Input Document\n for process in list(metadata.individual_process_statuses):\n matches = re.match(r\"gs://(.*?)/(.*)\", process.output_gcs_destination)\n if not matches:\n print(\"Could not parse output GCS destination:\", process.output_gcs_destination)\n continue\n\n output_bucket, output_prefix = matches.groups()\n output_blobs = storage_client.list_blobs(output_bucket, prefix=output_prefix)\n\n # Document AI may output multiple JSON files per source file\n # (Large documents get split in multiple file \"versions\" doc --> parsed_doc_0 + parsed_doc_1 ...)\n for blob in output_blobs:\n # Document AI should only output JSON files to GCS\n if blob.content_type != \"application/json\":\n print(f\"Skipping non-supported file: {blob.name} - Mimetype: {blob.content_type}\")\n continue\n\n # Download JSON file as bytes object and convert to Document Object\n print(f\"Fetching {blob.name}\")\n document = documentai.Document.from_json(blob.download_as_bytes(), ignore_unknown_fields=True)\n # Store the filename and the parsed versioned document content as a tuple\n parsed_documents.append((blob.name.split(\"/\")[-1].split(\".\")[0], document.text))\n\n print(\"Finished document parsing process.\")\n return parsed_documents\n\n# Call service\n# versioned_parsed_documents = batch_process_documents(\n# project_id=project_id,\n# location=location,\n# processor_id=processor_id,\n# gcs_output_uri=gcs_output_uri,\n# gcs_input_prefix=gcs_input_prefix\n# )\n```\n\n\n```python\n\"\"\"\nPost process parsed document and store it locally.\nMake sure to run this in a Google Vertex AI environment or include a credentials file.\n\"\"\"\n\n\"\"\"\nfrom pathlib import Path\nfrom collections import defaultdict\n\nparsed_documents = []\ncombined_versioned_parsed_documents = defaultdict(list)\n\n# Assemble versioned documents together ({\"doc_name\": [(0, doc_content_0), (1, doc_content_1), ...]}).\nfor filename, doc_content in versioned_parsed_documents:\n filename, version = \"-\".join(filename.split(\"-\")[:-1]), filename.split(\"-\")[-1]\n combined_versioned_parsed_documents[filename].append((version, doc_content))\n\n# Sort documents by version and join the content together.\nfor filename, docs in combined_versioned_parsed_documents.items():\n doc_content = \" \".join([x[1] for x in sorted(docs, key=lambda x: x[0])])\n parsed_documents.append((filename, doc_content))\n\n# Store parsed documents in local storage.\nfor filename, doc_content in parsed_documents:\n file_path = \"{}/{}-parsed-{}.txt\".format(data_dir, \"gcp\", source_filename)\n store_document(file_path, doc_content)\n\"\"\"\n```\n\n#### Visualize the parsed document\n\n\n```python\nfilename = \"gcp-parsed-{}.txt\".format(source_filename)\nwith open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n parsed_document = doc.read()\n\nprint(parsed_document[:1000])\n```\n\n\n### Solution 2: AWS Textract [[Back to Solutions]](#top)\n\n[Amazon Textract](https://aws.amazon.com/textract/) is an OCR service offered by AWS. It can detect text, forms, tables, and more in PDFs and images. In this section, we go over how to use Textract's asynchronous API.\n\n#### Parsing the document\n\nWe assume that you are working within the AWS ecosystem (from a SageMaker notebook, EC2 instance, a Lambda function, etc.) with valid credentials. Much of the code here is from supplemental materials created by AWS and offered here:\n\n- https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/python/example_code/textract\n- https://github.com/aws-samples/textract-paragraph-identification/tree/main\n\nAt minimum, you will need access to the following AWS resources to get started:\n\n- Textract\n- an S3 bucket containing the document(s) to process - in this case, our `example-drug-label.pdf` file\n- an SNS topic that Textract can publish to. This is used to send a notification that parsing is complete.\n- an IAM role that Textract will assume, granting access to the S3 bucket and SNS topic\n\nFirst, we bring in the `TextractWrapper` class provided in the [AWS Code Examples repository](https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/textract/textract_wrapper.py). This class makes it simpler to interface with the Textract service.\n\n\n```python\n# source: https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/python/example_code/textract\n\n# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n# SPDX-License-Identifier: Apache-2.0\n\n\"\"\"\nPurpose\n\nShows how to use the AWS SDK for Python (Boto3) with Amazon Textract to\ndetect text, form, and table elements in document images.\n\"\"\"\n\nimport json\nimport logging\nfrom botocore.exceptions import ClientError\n\nlogger = logging.getLogger(__name__)\n\n\n# snippet-start:[python.example_code.textract.TextractWrapper]\nclass TextractWrapper:\n \"\"\"Encapsulates Textract functions.\"\"\"\n\n def __init__(self, textract_client, s3_resource, sqs_resource):\n \"\"\"\n :param textract_client: A Boto3 Textract client.\n :param s3_resource: A Boto3 Amazon S3 resource.\n :param sqs_resource: A Boto3 Amazon SQS resource.\n \"\"\"\n self.textract_client = textract_client\n self.s3_resource = s3_resource\n self.sqs_resource = sqs_resource\n\n # snippet-end:[python.example_code.textract.TextractWrapper]\n\n # snippet-start:[python.example_code.textract.DetectDocumentText]\n def detect_file_text(self, *, document_file_name=None, document_bytes=None):\n \"\"\"\n Detects text elements in a local image file or from in-memory byte data.\n The image must be in PNG or JPG format.\n\n :param document_file_name: The name of a document image file.\n :param document_bytes: In-memory byte data of a document image.\n :return: The response from Amazon Textract, including a list of blocks\n that describe elements detected in the image.\n \"\"\"\n if document_file_name is not None:\n with open(document_file_name, \"rb\") as document_file:\n document_bytes = document_file.read()\n try:\n response = self.textract_client.detect_document_text(\n Document={\"Bytes\": document_bytes}\n )\n logger.info(\"Detected %s blocks.\", len(response[\"Blocks\"]))\n except ClientError:\n logger.exception(\"Couldn't detect text.\")\n raise\n else:\n return response\n\n # snippet-end:[python.example_code.textract.DetectDocumentText]\n\n # snippet-start:[python.example_code.textract.AnalyzeDocument]\n def analyze_file(\n self, feature_types, *, document_file_name=None, document_bytes=None\n ):\n \"\"\"\n Detects text and additional elements, such as forms or tables, in a local image\n file or from in-memory byte data.\n The image must be in PNG or JPG format.\n\n :param feature_types: The types of additional document features to detect.\n :param document_file_name: The name of a document image file.\n :param document_bytes: In-memory byte data of a document image.\n :return: The response from Amazon Textract, including a list of blocks\n that describe elements detected in the image.\n \"\"\"\n if document_file_name is not None:\n with open(document_file_name, \"rb\") as document_file:\n document_bytes = document_file.read()\n try:\n response = self.textract_client.analyze_document(\n Document={\"Bytes\": document_bytes}, FeatureTypes=feature_types\n )\n logger.info(\"Detected %s blocks.\", len(response[\"Blocks\"]))\n except ClientError:\n logger.exception(\"Couldn't detect text.\")\n raise\n else:\n return response\n\n # snippet-end:[python.example_code.textract.AnalyzeDocument]\n\n # snippet-start:[python.example_code.textract.helper.prepare_job]\n def prepare_job(self, bucket_name, document_name, document_bytes):\n \"\"\"\n Prepares a document image for an asynchronous detection job by uploading\n the image bytes to an Amazon S3 bucket. Amazon Textract must have permission\n to read from the bucket to process the image.\n\n :param bucket_name: The name of the Amazon S3 bucket.\n :param document_name: The name of the image stored in Amazon S3.\n :param document_bytes: The image as byte data.\n \"\"\"\n try:\n bucket = self.s3_resource.Bucket(bucket_name)\n bucket.upload_fileobj(document_bytes, document_name)\n logger.info(\"Uploaded %s to %s.\", document_name, bucket_name)\n except ClientError:\n logger.exception(\"Couldn't upload %s to %s.\", document_name, bucket_name)\n raise\n\n # snippet-end:[python.example_code.textract.helper.prepare_job]\n\n # snippet-start:[python.example_code.textract.helper.check_job_queue]\n def check_job_queue(self, queue_url, job_id):\n \"\"\"\n Polls an Amazon SQS queue for messages that indicate a specified Textract\n job has completed.\n\n :param queue_url: The URL of the Amazon SQS queue to poll.\n :param job_id: The ID of the Textract job.\n :return: The status of the job.\n \"\"\"\n status = None\n try:\n queue = self.sqs_resource.Queue(queue_url)\n messages = queue.receive_messages()\n if messages:\n msg_body = json.loads(messages[0].body)\n msg = json.loads(msg_body[\"Message\"])\n if msg.get(\"JobId\") == job_id:\n messages[0].delete()\n status = msg.get(\"Status\")\n logger.info(\n \"Got message %s with status %s.\", messages[0].message_id, status\n )\n else:\n logger.info(\"No messages in queue %s.\", queue_url)\n except ClientError:\n logger.exception(\"Couldn't get messages from queue %s.\", queue_url)\n else:\n return status\n\n # snippet-end:[python.example_code.textract.helper.check_job_queue]\n\n # snippet-start:[python.example_code.textract.StartDocumentTextDetection]\n def start_detection_job(\n self, bucket_name, document_file_name, sns_topic_arn, sns_role_arn\n ):\n \"\"\"\n Starts an asynchronous job to detect text elements in an image stored in an\n Amazon S3 bucket. Textract publishes a notification to the specified Amazon SNS\n topic when the job completes.\n The image must be in PNG, JPG, or PDF format.\n\n :param bucket_name: The name of the Amazon S3 bucket that contains the image.\n :param document_file_name: The name of the document image stored in Amazon S3.\n :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic\n where the job completion notification is published.\n :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM)\n role that can be assumed by Textract and grants permission\n to publish to the Amazon SNS topic.\n :return: The ID of the job.\n \"\"\"\n try:\n response = self.textract_client.start_document_text_detection(\n DocumentLocation={\n \"S3Object\": {\"Bucket\": bucket_name, \"Name\": document_file_name}\n },\n NotificationChannel={\n \"SNSTopicArn\": sns_topic_arn,\n \"RoleArn\": sns_role_arn,\n },\n )\n job_id = response[\"JobId\"]\n logger.info(\n \"Started text detection job %s on %s.\", job_id, document_file_name\n )\n except ClientError:\n logger.exception(\"Couldn't detect text in %s.\", document_file_name)\n raise\n else:\n return job_id\n\n # snippet-end:[python.example_code.textract.StartDocumentTextDetection]\n\n # snippet-start:[python.example_code.textract.GetDocumentTextDetection]\n def get_detection_job(self, job_id):\n \"\"\"\n Gets data for a previously started text detection job.\n\n :param job_id: The ID of the job to retrieve.\n :return: The job data, including a list of blocks that describe elements\n detected in the image.\n \"\"\"\n try:\n response = self.textract_client.get_document_text_detection(JobId=job_id)\n job_status = response[\"JobStatus\"]\n logger.info(\"Job %s status is %s.\", job_id, job_status)\n except ClientError:\n logger.exception(\"Couldn't get data for job %s.\", job_id)\n raise\n else:\n return response\n\n # snippet-end:[python.example_code.textract.GetDocumentTextDetection]\n\n # snippet-start:[python.example_code.textract.StartDocumentAnalysis]\n def start_analysis_job(\n self,\n bucket_name,\n document_file_name,\n feature_types,\n sns_topic_arn,\n sns_role_arn,\n ):\n \"\"\"\n Starts an asynchronous job to detect text and additional elements, such as\n forms or tables, in an image stored in an Amazon S3 bucket. Textract publishes\n a notification to the specified Amazon SNS topic when the job completes.\n The image must be in PNG, JPG, or PDF format.\n\n :param bucket_name: The name of the Amazon S3 bucket that contains the image.\n :param document_file_name: The name of the document image stored in Amazon S3.\n :param feature_types: The types of additional document features to detect.\n :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic\n where job completion notification is published.\n :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM)\n role that can be assumed by Textract and grants permission\n to publish to the Amazon SNS topic.\n :return: The ID of the job.\n \"\"\"\n try:\n response = self.textract_client.start_document_analysis(\n DocumentLocation={\n \"S3Object\": {\"Bucket\": bucket_name, \"Name\": document_file_name}\n },\n NotificationChannel={\n \"SNSTopicArn\": sns_topic_arn,\n \"RoleArn\": sns_role_arn,\n },\n FeatureTypes=feature_types,\n )\n job_id = response[\"JobId\"]\n logger.info(\n \"Started text analysis job %s on %s.\", job_id, document_file_name\n )\n except ClientError:\n logger.exception(\"Couldn't analyze text in %s.\", document_file_name)\n raise\n else:\n return job_id\n\n # snippet-end:[python.example_code.textract.StartDocumentAnalysis]\n\n # snippet-start:[python.example_code.textract.GetDocumentAnalysis]\n def get_analysis_job(self, job_id):\n \"\"\"\n Gets data for a previously started detection job that includes additional\n elements.\n\n :param job_id: The ID of the job to retrieve.\n :return: The job data, including a list of blocks that describe elements\n detected in the image.\n \"\"\"\n try:\n response = self.textract_client.get_document_analysis(JobId=job_id)\n job_status = response[\"JobStatus\"]\n logger.info(\"Job %s status is %s.\", job_id, job_status)\n except ClientError:\n logger.exception(\"Couldn't get data for job %s.\", job_id)\n raise\n else:\n return response\n\n\n# snippet-end:[python.example_code.textract.GetDocumentAnalysis]\n```\n\nNext, we set up Textract and S3, and provide this to an instance of `TextractWrapper`.\n\n\n```python\nimport boto3\n\ntextract_client = boto3.client('textract')\ns3_client = boto3.client('s3')\n\ntextractWrapper = TextractWrapper(textract_client, s3_client, None)\n```\n\nWe are now ready to make calls to Textract. At a high level, Textract has two modes: synchronous and asynchronous. Synchronous calls return the parsed output once it is completed. As of the time of writing (March 2024), however, multipage PDF processing is only supported [asynchronously](https://docs.aws.amazon.com/textract/latest/dg/sync.html). So for our purposes here, we will only explore the asynchronous route.\n\nAsynchronous calls follow the below process:\n\n1. Send a request to Textract with an SNS topic, S3 bucket, and the name (key) of the document inside that bucket to process. Textract returns a Job ID that can be used to track the status of the request\n2. Textract fetches the document from S3 and processes it\n3. Once the request is complete, Textract sends out a message to the SNS topic. This can be used in conjunction with other services such as Lambda or SQS for downstream processes.\n4. The parsed results can be fetched from Textract in chunks via the job ID.\n\n\n```python\nbucket_name = \"your-bucket-name\"\nsns_topic_arn = \"your-sns-arn\" # this can be found under the topic you created in the Amazon SNS dashboard\nsns_role_arn = \"sns-role-arn\" # this is an IAM role that allows Textract to interact with SNS\n\nfile_name = \"example-drug-label.pdf\"\n```\n\n\n```python\n# kick off a text detection job. This returns a job ID.\njob_id = textractWrapper.start_detection_job(bucket_name=bucket_name, document_file_name=file_name,\n sns_topic_arn=sns_topic_arn, sns_role_arn=sns_role_arn)\n```\n\nOnce the job completes, this will return a dictionary with the following keys:\n\n```dict_keys(['DocumentMetadata', 'JobStatus', 'NextToken', 'Blocks', 'AnalyzeDocumentModelVersion', 'ResponseMetadata'])```\n\nThis response corresponds to one chunk of information parsed by Textract. The number of chunks a document is parsed into depends on the length of the document. The two keys we are most interested in are `Blocks` and `NextToken`. `Blocks` contains all of the information that was extracted from this chunk, while `NextToken` tells us what chunk comes next, if any.\n\nTextract returns an information-rich representation of the extracted text, such as their position on the page and hierarchical relationships with other entities, all the way down to the individual word level. Since we are only interested in the raw text, we need a way to parse through all of the chunks and their `Blocks`. Lucky for us, Amazon provides some [helper functions](https://github.com/aws-samples/textract-paragraph-identification/tree/main) for this purpose, which we utilize below.\n\n\n```python\ndef get_text_results_from_textract(job_id):\n response = textract_client.get_document_text_detection(JobId=job_id)\n collection_of_textract_responses = []\n pages = [response]\n\n collection_of_textract_responses.append(response)\n\n while 'NextToken' in response:\n next_token = response['NextToken']\n response = textract_client.get_document_text_detection(JobId=job_id, NextToken=next_token)\n pages.append(response)\n collection_of_textract_responses.append(response)\n return collection_of_textract_responses\n\ndef get_the_text_with_required_info(collection_of_textract_responses):\n total_text = []\n total_text_with_info = []\n running_sequence_number = 0\n\n font_sizes_and_line_numbers = {}\n for page in collection_of_textract_responses:\n per_page_text = []\n blocks = page['Blocks']\n for block in blocks:\n if block['BlockType'] == 'LINE':\n block_text_dict = {}\n running_sequence_number += 1\n block_text_dict.update(text=block['Text'])\n block_text_dict.update(page=block['Page'])\n block_text_dict.update(left_indent=round(block['Geometry']['BoundingBox']['Left'], 2))\n font_height = round(block['Geometry']['BoundingBox']['Height'], 3)\n line_number = running_sequence_number\n block_text_dict.update(font_height=round(block['Geometry']['BoundingBox']['Height'], 3))\n block_text_dict.update(indent_from_top=round(block['Geometry']['BoundingBox']['Top'], 2))\n block_text_dict.update(text_width=round(block['Geometry']['BoundingBox']['Width'], 2))\n block_text_dict.update(line_number=running_sequence_number)\n\n if font_height in font_sizes_and_line_numbers:\n line_numbers = font_sizes_and_line_numbers[font_height]\n line_numbers.append(line_number)\n font_sizes_and_line_numbers[font_height] = line_numbers\n else:\n line_numbers = []\n line_numbers.append(line_number)\n font_sizes_and_line_numbers[font_height] = line_numbers\n\n total_text.append(block['Text'])\n per_page_text.append(block['Text'])\n total_text_with_info.append(block_text_dict)\n\n return total_text, total_text_with_info, font_sizes_and_line_numbers\n\ndef get_text_with_line_spacing_info(total_text_with_info):\n i = 1\n text_info_with_line_spacing_info = []\n while (i < len(total_text_with_info) - 1):\n previous_line_info = total_text_with_info[i - 1]\n current_line_info = total_text_with_info[i]\n next_line_info = total_text_with_info[i + 1]\n if current_line_info['page'] == next_line_info['page'] and previous_line_info['page'] == current_line_info[\n 'page']:\n line_spacing_after = round((next_line_info['indent_from_top'] - current_line_info['indent_from_top']), 2)\n spacing_with_prev = round((current_line_info['indent_from_top'] - previous_line_info['indent_from_top']), 2)\n current_line_info.update(line_space_before=spacing_with_prev)\n current_line_info.update(line_space_after=line_spacing_after)\n text_info_with_line_spacing_info.append(current_line_info)\n else:\n text_info_with_line_spacing_info.append(None)\n i += 1\n return text_info_with_line_spacing_info\n```\n\nWe feed in the Job ID from before into the function `get_text_results_from_textract` to fetch all of the chunks associated with this job. Then, we pass the resulting list into `get_the_text_with_required_info` and `get_text_with_line_spacing_info` to organize the text into lines.\n\nFinally, we can concatenate the lines into one string to pass into our downstream RAG pipeline.\n\n\n```python\nall_text = \"\\n\".join([line[\"text\"] if line else \"\" for line in text_info_with_line_spacing])\n\nwith open(f\"aws-parsed-{source_filename}.txt\", \"w\") as f:\n f.write(all_text)\n```\n\n#### Visualize the parsed document\n\n\n```python\nfilename = \"aws-parsed-{}.txt\".format(source_filename)\nwith open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n parsed_document = doc.read()\n\nprint(parsed_document[:1000])\n```\n\n\n### Solution 3: Unstructured.io [[Back to Solutions]](#top)\n\nUnstructured.io provides libraries with open-source components for pre-processing text documents such as PDFs, HTML and Word Documents.\n\nExternal documentation: https://github.com/Unstructured-IO/unstructured-api\n\n#### Parsing the document\n\nThe guide assumes an endpoint exists that hosts this service. The API is offered in two forms\n1. [a hosted version](https://unstructured.io/)\n2. [an OSS docker image](https://github.com/Unstructured-IO/unstructured-api?tab=readme-ov-file#dizzy-instructions-for-using-the-docker-image)\n\n**Note: You can skip to the next block if you want to use the pre-existing parsed version.**\n\n\n```python\nimport os\nimport requests\n\nUNSTRUCTURED_URL = \"\" # enter service endpoint\n\nparsed_documents = []\n\ninput_path = \"{}/{}.{}\".format(data_dir, source_filename, extension)\nwith open(input_path, 'rb') as file_data:\n response = requests.post(\n url=UNSTRUCTURED_URL,\n files={\"files\": (\"{}.{}\".format(source_filename, extension), file_data)},\n data={\n \"output_format\": (None, \"application/json\"),\n \"stratergy\": \"hi_res\",\n \"pdf_infer_table_structure\": \"true\",\n \"include_page_breaks\": \"true\"\n },\n headers={\"Accept\": \"application/json\"}\n )\n\nparsed_response = response.json()\n\nparsed_document = \" \".join([parsed_entry[\"text\"] for parsed_entry in parsed_response])\nprint(\"Parsed {}\".format(source_filename))\n```\n\n\n```python\n\"\"\"\nPost process parsed document and store it locally.\n\"\"\"\n\nfile_path = \"{}/{}-parsed-fda-approved-drug.txt\".format(data_dir, \"unstructured-io\")\nstore_document(file_path, parsed_document)\n```\n\n#### Visualize the parsed document\n\n\n```python\nfilename = \"unstructured-io-parsed-{}.txt\".format(source_filename)\nwith open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n parsed_document = doc.read()\n\nprint(parsed_document[:1000])\n```\n\n\n\n### Solution 4: LlamaParse [[Back to Solutions]](#top)\n\nLlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks.\n\nExternal documentation: https://github.com/run-llama/llama_parse\n\n#### Parsing the document\n\nThe following block uses the LlamaParse cloud offering. You can learn more and fetch a respective API key for the service [here](https://cloud.llamaindex.ai/parse).\n\nParsing documents with LlamaParse offers an option for two output modes both of which we will explore and compare below\n- Text\n- Markdown\n\n**Note: You can skip to the next block if you want to use the pre-existing parsed version.**\n\n\n```python\nimport os\nfrom llama_parse import LlamaParse\n\nimport nest_asyncio # needed to notebook env\nnest_asyncio.apply() # needed to notebook env\n\nllama_index_api_key = \"{API_KEY}\"\ninput_path = \"{}/{}.{}\".format(data_dir, source_filename, extension)\n```\n\n\n```python\n# Text mode\ntext_parser = LlamaParse(\n api_key=llama_index_api_key,\n result_type=\"text\"\n)\n\ntext_response = text_parser.load_data(input_path)\ntext_parsed_document = \" \".join([parsed_entry.text for parsed_entry in text_response])\n\nprint(\"Parsed {} to text\".format(source_filename))\n```\n\n\n```python\n\"\"\"\nPost process parsed document and store it locally.\n\"\"\"\n\nfile_path = \"{}/{}-text-parsed-fda-approved-drug.txt\".format(data_dir, \"llamaparse\")\nstore_document(file_path, text_parsed_document)\n```\n\n\n```python\n# Markdown mode\nmarkdown_parser = LlamaParse(\n api_key=llama_index_api_key,\n result_type=\"markdown\"\n)\n\nmarkdown_response = markdown_parser.load_data(input_path)\nmarkdown_parsed_document = \" \".join([parsed_entry.text for parsed_entry in markdown_response])\n\nprint(\"Parsed {} to markdown\".format(source_filename))\n```\n\n\n```python\n\"\"\"\nPost process parsed document and store it locally.\n\"\"\"\n\nfile_path = \"{}/{}-markdown-parsed-fda-approved-drug.txt\".format(data_dir, \"llamaparse\")\nstore_document(file_path, markdown_parsed_document)\n```\n\n#### Visualize the parsed document\n\n\n```python\n# Text parsing\n\nfilename = \"llamaparse-text-parsed-{}.txt\".format(source_filename)\n\nwith open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n parsed_document = doc.read()\n \nprint(parsed_document[:1000])\n```\n\n\n```python\n# Markdown parsing\n\nfilename = \"llamaparse-markdown-parsed-fda-approved-drug.txt\"\nwith open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n parsed_document = doc.read()\n \nprint(parsed_document[:1000])\n```\n\n\n\n### Solution 5: pdf2image + pytesseract [[Back to Solutions]](#top)\n\nThe final parsing method we examine does not rely on cloud services, but rather relies on two libraries: `pdf2image`, and `pytesseract`. `pytesseract` lets you perform OCR locally on images, but not PDF files. So, we first convert our PDF into a set of images via `pdf2image`.\n\n#### Parsing the document\n\n\n```python\nfrom matplotlib import pyplot as plt\nfrom pdf2image import convert_from_path\nimport pytesseract\n```\n\n\n```python\n# pdf2image extracts as a list of PIL.Image objects\npages = convert_from_path(filename)\n```\n\n\n```python\n# we look at the first page as a sanity check:\n\nplt.imshow(pages[0])\nplt.axis('off')\nplt.show()\n```\n\nNow, we can process the image of each page with `pytesseract` and concatenate the results to get our parsed document.\n\n\n```python\nlabel_ocr_pytesseract = \"\".join([pytesseract.image_to_string(page) for page in pages])\n```\n\n\n```python\nprint(label_ocr_pytesseract[:200])\n```\n\n HIGHLIGHTS OF PRESCRIBING INFORMATION\n \n These highlights do not include all the information needed to use\n IWILFIN™ safely and effectively. See full prescribing information for\n IWILFIN.\n \n IWILFIN™ (eflor\n\n\n\n```python\nlabel_ocr_pytesseract = \"\".join([pytesseract.image_to_string(page) for page in pages])\n\nwith open(f\"pytesseract-parsed-{source_filename}.txt\", \"w\") as f:\n f.write(label_ocr_pytesseract)\n```\n\n#### Visualize the parsed document\n\n\n```python\nfilename = \"pytesseract-parsed-{}.txt\".format(source_filename)\nwith open(\"{}/{}\".format(data_dir, filename), \"r\") as doc:\n parsed_document = doc.read()\n\nprint(parsed_document[:1000])\n```\n\n\n## Document Questions\n\nWe can now ask a set of simple + complex questions and see how each parsing solution performs with Command-R. The questions are\n- **What are the most common adverse reactions of Iwilfin?**\n - Task: Simple information extraction\n- **What is the recommended dosage of IWILFIN on body surface area between 0.5 and 0.75?**\n - Task: Tabular data extraction\n- **I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.**\n - Task: Overall document summary\n\n\n```python\nimport cohere\nco = cohere.Client(api_key=\"{API_KEY}\")\n```\n\n\n```python\n\"\"\"\nDocument Questions\n\"\"\"\nprompt = \"What are the most common adverse reactions of Iwilfin?\"\n# prompt = \"What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?\"\n# prompt = \"I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.\"\n\n\"\"\"\nChoose one of the above solutions\n\"\"\"\nsource = \"gcp\"\n# source = \"aws\"\n# source = \"unstructured-io\"\n# source = \"llamaparse-text\"\n# source = \"llamaparse-markdown\"\n# source = \"pytesseract\"\n```\n\n## Data Ingestion\n\n\nIn order to set up our RAG implementation, we need to separate the parsed text into chunks and load the chunks to an index. The index will allow us to retrieve relevant passages from the document for different queries. Here, we use a simple implementation of indexing using the `hnswlib` library. Note that there are many different indexing solutions that are appropriate for specific production use cases.\n\n\n```python\n\"\"\"\nRead parsed document content and chunk data\n\"\"\"\n\nimport os\nfrom langchain_text_splitters import RecursiveCharacterTextSplitter\n\ndocuments = []\n\nwith open(\"{}/{}-parsed-fda-approved-drug.txt\".format(data_dir, source), \"r\") as doc:\ndoc_content = doc.read()\n\n\"\"\"\nPersonal notes on chunking\nhttps://medium.com/@ayhamboucher/llm-based-context-splitter-for-large-documents-445d3f02b01b\n\"\"\"\n\n\n# Chunk doc content\ntext_splitter = RecursiveCharacterTextSplitter(\n chunk_size=512,\n chunk_overlap=200,\n length_function=len,\n is_separator_regex=False\n)\n\n# Split the text into chunks with some overlap\nchunks_ = text_splitter.create_documents([doc_content])\ndocuments = [c.page_content for c in chunks_]\n\nprint(\"Source document has been broken down to {} chunks\".format(len(documents)))\n```\n\n\n```python\n\"\"\"\nEmbed document chunks\n\"\"\"\ndocument_embeddings = co.embed(texts=documents, model=\"embed-english-v3.0\", input_type=\"search_document\").embeddings\n```\n\n\n```python\n\"\"\"\nCreate document index and add embedded chunks\n\"\"\"\n\nimport hnswlib\n\nindex = hnswlib.Index(space='ip', dim=1024) # space: inner product\nindex.init_index(max_elements=len(document_embeddings), ef_construction=512, M=64)\nindex.add_items(document_embeddings, list(range(len(document_embeddings))))\nprint(\"Count:\", index.element_count)\n```\n\n Count: 115\n\n\n## Retrieval\n\nIn this step, we use k-nearest neighbors to fetch the most relevant documents for our query. Once the nearest neighbors are retrieved, we use Cohere's reranker to reorder the documents in the most relevant order with regards to our input search query.\n\n\n```python\n\"\"\"\nEmbed search query\nFetch k nearest neighbors\n\"\"\"\n\nquery_emb = co.embed(texts=[prompt], model='embed-english-v3.0', input_type=\"search_query\").embeddings\ndefault_knn = 10\nknn = default_knn if default_knn <= index.element_count else index.element_count\nresult = index.knn_query(query_emb, k=knn)\nneighbors = [(result[0][0][i], result[1][0][i]) for i in range(len(result[0][0]))]\nrelevant_docs = [documents[x[0]] for x in sorted(neighbors, key=lambda x: x[1])]\n```\n\n\n```python\n\"\"\"\nRerank retrieved documents\n\"\"\"\n\nrerank_results = co.rerank(query=prompt, documents=relevant_docs, top_n=3, model='rerank-english-v2.0').results\nreranked_relevant_docs = format_docs_for_chat([x.document[\"text\"] for x in rerank_results])\n```\n\n## Final Step: Call Command-R + RAG!\n\n\n```python\n\"\"\"\nCall the /chat endpoint with command-r\n\"\"\"\n\nresponse = co.chat(\n message=prompt,\n model=\"command-r\",\n documents=reranked_relevant_docs\n)\n\ncited_response, citations_reference = insert_citations_in_order(response.text, response.citations, reranked_relevant_docs)\nprint(cited_response)\nprint(\"\\n\")\nprint(\"References:\")\nprint(citations_reference)\n```\n\n## Head-to-head Comparisons\n\nRun the code cells below to make head to head comparisons of the different parsing techniques across different questions.\n\n\n```python\nimport pandas as pd\nresults = pd.read_csv(\"{}/results-table.csv\".format(data_dir))\n```\n\n\n```python\nquestion = input(\"\"\"\nQuestion 1: What are the most common adverse reactions of Iwilfin?\nQuestion 2: What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?\nQuestion 3: I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.\n\nPick which question you want to see (1,2,3): \"\"\")\nreferences = input(\"Do you want to see the references as well? References are long and noisy (y/n): \")\nprint(\"\\n\\n\")\n\nindex = {\"1\": 0, \"2\": 3, \"3\": 6}[question]\n\nfor src in [\"gcp\", \"aws\", \"unstructured-io\", \"llamaparse-text\", \"llamaparse-markdown\", \"pytesseract\"]:\n print(\"| {} |\".format(src))\n print(\"\\n\")\n print(results[src][index])\n if references == \"y\":\n print(\"\\n\")\n print(\"References:\")\n print(results[src][index+1])\n print(\"\\n\")\n```\n\n \n Question 1: What are the most common adverse reactions of Iwilfin?\n Question 2: What is the recommended dosage of Iwilfin on body surface area between 0.5 m2 and 0.75 m2?\n Question 3: I need a succinct summary of the compound name, indication, route of administration, and mechanism of action of Iwilfin.\n \n Pick which question you want to see (1,2,3): 3\n Do you want to see the references as well? References are long and noisy (y/n): n\n \n \n \n | gcp |\n \n \n Compound Name: eflornithine hydrochloride ([0], [1], [2]) (IWILFIN ([1])™)\n \n Indication: used to reduce the risk of relapse in adult and paediatric patients with high-risk neuroblastoma (HRNB) ([1], [3]), who have responded at least partially to prior multiagent, multimodality therapy. ([1], [3], [4])\n \n Route of Administration: IWILFIN™ tablets ([1], [3], [4]) are taken orally twice daily ([3], [4]), with doses ranging from 192 to 768 mg based on body surface area. ([3], [4])\n \n Mechanism of Action: IWILFIN™ is an ornithine decarboxylase inhibitor. ([0], [2])\n \n \n \n | aws |\n \n \n Compound Name: eflornithine ([0], [1], [2], [3]) (IWILFIN ([0])™)\n \n Indication: used to reduce the risk of relapse ([0], [3]) in adults ([0], [3]) and paediatric patients ([0], [3]) with high-risk neuroblastoma (HRNB) ([0], [3]) who have responded to prior therapies. ([0], [3], [4])\n \n Route of Administration: Oral ([2], [4])\n \n Mechanism of Action: IWILFIN is an ornithine decarboxylase inhibitor. ([1])\n \n \n | unstructured-io |\n \n \n Compound Name: Iwilfin ([1], [2], [3], [4]) (eflornithine) ([0], [2], [3], [4])\n \n Indication: Iwilfin is indicated to reduce the risk of relapse ([1], [3]) in adult and paediatric patients ([1], [3]) with high-risk neuroblastoma (HRNB) ([1], [3]), who have responded to prior anti-GD2 ([1]) immunotherapy ([1], [4]) and multi-modality therapy. ([1])\n \n Route of Administration: Oral ([0], [3])\n \n Mechanism of Action: Iwilfin is an ornithine decarboxylase inhibitor. ([1], [2], [3], [4])\n \n \n | llamaparse-text |\n \n \n Compound Name: IWILFIN ([2], [3]) (eflornithine) ([3])\n \n Indication: IWILFIN is used to reduce the risk of relapse ([1], [2], [3]) in adult and paediatric patients ([1], [2], [3]) with high-risk neuroblastoma (HRNB) ([1], [2], [3]), who have responded at least partially to certain prior therapies. ([2], [3])\n \n Route of Administration: IWILFIN is administered as a tablet. ([2])\n \n Mechanism of Action: IWILFIN is an ornithine decarboxylase inhibitor. ([0], [1], [4])\n \n \n | llamaparse-markdown |\n \n \n Compound Name: IWILFIN ([1], [2]) (eflornithine) ([1])\n \n Indication: IWILFIN is indicated to reduce the risk of relapse ([1], [2]) in adult and paediatric patients ([1], [2]) with high-risk neuroblastoma (HRNB) ([1], [2]), who have responded at least partially ([1], [2], [3]) to prior anti-GD2 immunotherapy ([1], [2]) and multiagent, multimodality therapy. ([1], [2], [3])\n \n Route of Administration: Oral ([0], [1], [3], [4])\n \n Mechanism of Action: IWILFIN acts as an ornithine decarboxylase inhibitor. ([1])\n \n \n | pytesseract |\n \n \n Compound Name: IWILFIN™ ([0], [2]) (eflornithine) ([0], [2])\n \n Indication: IWILFIN is indicated to reduce the risk of relapse ([0], [2]) in adult and paediatric patients ([0], [2]) with high-risk neuroblastoma (HRNB) ([0], [2]), who have responded positively to prior anti-GD2 immunotherapy and multiagent, multimodality therapy. ([0], [2], [4])\n \n Route of Administration: IWILFIN is administered orally ([0], [1], [3], [4]), in the form of a tablet. ([1])\n \n Mechanism of Action: IWILFIN acts as an ornithine decarboxylase inhibitor. ([0])", "html": "", "htmlmode": false, "fullscreen": false, diff --git a/scripts/cookbooks-json/fueling-generative-content.json b/scripts/cookbooks-json/fueling-generative-content.json index 792ef656..fb6d4b8b 100644 --- a/scripts/cookbooks-json/fueling-generative-content.json +++ b/scripts/cookbooks-json/fueling-generative-content.json @@ -13,7 +13,7 @@ }, "title": "Fueling Generative Content with Keyword Research", "slug": "fueling-generative-content", - "body": "[block:html]\n{\n \"html\": \"\\n\\n
\\n

Fueling Generative Content with Keyword Research

\\n
\\n\\n\"\n}\n[/block]\n\n\nGenerative models have proven extremely useful in content idea generation. But they don’t take into account user search demand and trends. In this notebook, let’s see how we can solve that by adding keyword research into the equation.\n\nRead the accompanying [blog post here](https://txt.cohere.ai/generative-content-keyword-research/).\n\n```python\n! pip install cohere -q\n```\n\n```python\nimport cohere\nimport numpy as np\nimport pandas as pd\nfrom sklearn.cluster import KMeans\n\nimport cohere\nco = cohere.Client(\"COHERE_API_KEY\") # Get your API key: https://dashboard.cohere.com/api-keys\n```\n\n\n\n```python\n#@title Enable text wrapping in Google Colab\n\nfrom IPython.display import HTML, display\n\ndef set_css():\n display(HTML('''\n \n '''))\nget_ipython().events.register('pre_run_cell', set_css)\n```\n\nFirst, we need to get a supply of high-traffic keywords for a given topic. We can get this via keyword research tools, of which are many available. We’ll use Google Keyword Planner, which is free to use.\n\n```python\n\nimport wget\nwget.download(\"https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/data/remote_teams.csv\", \"remote_teams.csv\")\n```\n\n\n\n```\n'remote_teams.csv'\n```\n\n```python\ndf = pd.read_csv('remote_teams.csv')\ndf.columns = [\"keyword\",\"volume\"]\ndf.head()\n```\n\n\n\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
keywordvolume
0managing remote teams1000
1remote teams390
2collaboration tools for remote teams320
3online games for remote teams320
4how to manage remote teams260
\n
\n\nWe now have a list of keywords, but this list is still raw. For example, “managing remote teams” is the top-ranking keyword in this list. But at the same time, there are many similar keywords further down in the list, such as “how to effectively manage remote teams.”\n\nWe can do that by clustering them into topics. For this, we’ll leverage Cohere’s Embed endpoint and scikit-learn.\n\n### Embed the Keywords with Cohere Embed\n\nThe Cohere Embed endpoint turns a text input into a text embedding.\n\n```python\ndef embed_text(texts):\n output = co.embed(\n texts=texts,\n model='embed-english-v3.0',\n input_type=\"search_document\",\n )\n return output.embeddings\n\nembeds = np.array(embed_text(df['keyword'].tolist()))\n```\n\n\n\n### Cluster the Keywords into Topics with scikit-learn\n\nWe then use these embeddings to cluster the keywords. A common term used for this exercise is “topic modeling.” Here, we can leverage scikit-learn’s KMeans module, a machine learning algorithm for clustering.\n\n```python\nNUM_TOPICS = 4\nkmeans = KMeans(n_clusters=NUM_TOPICS, random_state=21, n_init=\"auto\").fit(embeds)\ndf['topic'] = list(kmeans.labels_)\ndf.head()\n```\n\n\n\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
keywordvolumetopic
0managing remote teams10000
1remote teams3901
2collaboration tools for remote teams3201
3online games for remote teams3203
4how to manage remote teams2600
\n
\n\n### Generate Topic Names with Cohere Chat\n\nWe use the Chat to generate a topic name for that cluster.\n\n```python\ntopic_keywords_dict = {topic: list(set(group['keyword'])) for topic, group in df.groupby('topic')}\n```\n\n\n\n```python\ndef generate_topic_name(keywords):\n # Construct the prompt\n prompt = f\"\"\"Generate a concise topic name that best represents these keywords.\\\nProvide just the topic name and not any additional details.\n\nKeywords: {', '.join(keywords)}\"\"\"\n \n # Call the Cohere API\n response = co.chat(\n model='command-r', # Choose the model size\n message=prompt,\n preamble=\"\")\n \n # Return the generated text\n return response.text\n```\n\n\n\n```python\ntopic_name_mapping = {topic: generate_topic_name(keywords) for topic, keywords in topic_keywords_dict.items()}\n\ndf['topic_name'] = df['topic'].map(topic_name_mapping)\n\ndf.head()\n```\n\n\n\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
keywordvolumetopictopic_name
0managing remote teams10000Remote Team Management
1remote teams3901Remote Team Tools and Tips
2collaboration tools for remote teams3201Remote Team Tools and Tips
3online games for remote teams3203Remote Team Fun
4how to manage remote teams2600Remote Team Management
\n
\n\n```python\nfor topic, name in topic_name_mapping.items():\n print(f\"Topic {topic}: {name}\")\n```\n\n\n\n```\nTopic 0: Remote Team Management\nTopic 1: Remote Team Tools and Tips\nTopic 2: Remote Team Resources\nTopic 3: Remote Team Fun\n```\n\nNow that we have the keywords nicely grouped into topics, we can proceed to generate the content ideas.\n\n### Take the Top Keywords from Each Topic\n\nHere we can implement a filter to take just the top N keywords from each topic, sorted by the search volume. In our case, we use 10.\n\n```python\nTOP_N = 10\n\ntop_keywords = (df.groupby('topic')\n .apply(lambda x: x.nlargest(TOP_N, 'volume'))\n .reset_index(drop=True))\n\n\ncontent_by_topic = {}\nfor topic, group in top_keywords.groupby('topic'):\n keywords = ', '.join(list(group['keyword']))\n topic2name = topic2name = dict(df.groupby('topic')['topic_name'].first())\n topic_name = topic2name[topic]\n content_by_topic[topic] = {'topic_name': topic_name, 'keywords': keywords}\n```\n\n\n\n```python\ncontent_by_topic\n```\n\n\n\n```\n{0: {'topic_name': 'Remote Team Management',\n 'keywords': 'managing remote teams, how to manage remote teams, leading remote teams, managing remote teams best practices, remote teams best practices, best practices for managing remote teams, manage remote teams, building culture in remote teams, culture building for remote teams, managing remote teams training'},\n 1: {'topic_name': 'Remote Team Tools and Tips',\n 'keywords': 'remote teams, collaboration tools for remote teams, team building for remote teams, scrum remote teams, tools for remote teams, zapier remote teams, working agreements for remote teams, working with remote teams, free collaboration tools for remote teams, free retrospective tools for remote teams'},\n 2: {'topic_name': 'Remote Team Resources',\n 'keywords': 'best collaboration tools for remote teams, slack best practices for remote teams, best communication tools for remote teams, best tools for remote teams, always on video for remote teams, best apps for remote teams, best free collaboration tools for remote teams, best games for remote teams, best gifts for remote teams, best ice breaker questions for remote teams'},\n 3: {'topic_name': 'Remote Team Fun',\n 'keywords': 'online games for remote teams, team building activities for remote teams, games for remote teams, retrospective ideas for remote teams, team building ideas for remote teams, fun retrospective ideas for remote teams, retro ideas for remote teams, team building exercises for remote teams, trust building exercises for remote teams, activities for remote teams'}}\n```\n\n### Create a Prompt with These Keywords\n\nNext, we use the Chat endpoint to produce the content ideas. The prompt we’ll use is as follows\n\n```python\ndef generate_blog_ideas(keywords):\n prompt = f\"\"\"{keywords}\\n\\nThe above is a list of high-traffic keywords obtained from a keyword research tool. \nSuggest three blog post ideas that are highly relevant to these keywords. \nFor each idea, write a one paragraph abstract about the topic. \nUse this format:\nBlog title: \nAbstract: \"\"\"\n \n response = co.chat(\n model='command-r',\n message = prompt)\n return response.text\n\n```\n\n\n\n### Generate Content Ideas\n\nNext, we generate the blog post ideas. It takes in a string of keywords, calls the Chat endpoint, and returns the generated text.\n\n```python\nfor key,value in content_by_topic.items():\n value['ideas'] = generate_blog_ideas(value['keywords'])\n\n\nfor key,value in content_by_topic.items():\n print(f\"Topic Name: {value['topic_name']}\\n\")\n print(f\"Top Keywords: {value['keywords']}\\n\")\n print(f\"Blog Post Ideas: {value['ideas']}\\n\")\n print(\"-\"*50)\n```\n\n\n\n```\nTopic Name: Remote Team Management\n\nTop Keywords: managing remote teams, how to manage remote teams, leading remote teams, managing remote teams best practices, remote teams best practices, best practices for managing remote teams, manage remote teams, building culture in remote teams, culture building for remote teams, managing remote teams training\n\nBlog Post Ideas: Here are three blog post ideas:\n\n1. Blog title: \"Leading Remote Teams: Strategies for Effective Management\"\n Abstract: Effective management of remote teams is crucial for success, but it comes with unique challenges. This blog will explore practical strategies for leading dispersed employees, focusing on building a cohesive and productive virtual workforce. It will cover topics such as establishing clear communication protocols, fostering a collaborative environment, and the importance of trusting and empowering your remote employees for enhanced performance.\n\n2. Blog title: \"Remote Teams' Best Practices: Creating a Vibrant and Engaging Culture\"\n Abstract: Building a rich culture in a remote team setting is essential for employee engagement and retention. The blog will delve into creative ways to foster a sense of community and connection among team members who may be scattered across the globe. It will offer practical tips on creating virtual rituals, fostering open communication, and harnessing the power of technology for cultural development, ensuring remote employees feel valued and engaged.\n\n3. Blog title: \"Managing Remote Teams: A Comprehensive Guide to Training and Development\"\n Abstract: Training and developing remote teams present specific challenges and opportunities. This comprehensive guide will arm managers with techniques to enhance their remote team's skills and knowledge. It will explore the latest tools and methodologies for remote training, including virtual workshops, e-learning platforms, and performance coaching. Additionally, the blog will discuss the significance of ongoing development and how to create an environment that encourages self-improvement and learning.\n\nEach of these topics explores a specific aspect of managing remote teams, providing valuable insights and practical guidance for leaders and managers in the evolving remote work landscape.\n\n--------------------------------------------------\nTopic Name: Remote Team Tools and Tips\n\nTop Keywords: remote teams, collaboration tools for remote teams, team building for remote teams, scrum remote teams, tools for remote teams, zapier remote teams, working agreements for remote teams, working with remote teams, free collaboration tools for remote teams, free retrospective tools for remote teams\n\nBlog Post Ideas: 1. Blog title: \"The Ultimate Guide to Building Effective Remote Teams\"\n Abstract: Building a cohesive and productive remote team can be challenging. This blog will serve as a comprehensive guide, offering practical tips and insights on how to create a united and successful virtual workforce. It will cover essential topics such as building a strong team culture, utilizing collaboration tools, and fostering effective communication strategies, ensuring remote teams can thrive and achieve their full potential.\n\n2. Blog title: \"The Best Collaboration Tools for Remote Teams: A Comprehensive Review\"\n Abstract: With the rapid rise of remote work, collaboration tools have become essential for teams' productivity and efficiency. This blog aims to review and compare the most popular collaboration tools, providing an in-depth analysis of their features, ease of use, and benefits. It will offer insights into choosing the right tools for remote collaboration, helping teams streamline their workflows and enhance their overall performance.\n\n3. Blog title: \"Remote Retrospective: A Guide to Reflect and Grow as a Remote Team\"\n Abstract: Conducting effective retrospectives is crucial for remote teams to reflect on their experiences, learn from the past, and chart a course for the future. This blog will focus on remote retrospectives, exploring different formats, techniques, and free tools that teams can use to foster continuous improvement. It will also provide tips on creating a safe and inclusive environment, encouraging honest feedback and productive discussions.\n\n--------------------------------------------------\nTopic Name: Remote Team Resources\n\nTop Keywords: best collaboration tools for remote teams, slack best practices for remote teams, best communication tools for remote teams, best tools for remote teams, always on video for remote teams, best apps for remote teams, best free collaboration tools for remote teams, best games for remote teams, best gifts for remote teams, best ice breaker questions for remote teams\n\nBlog Post Ideas: 1. Blog title: \"The Ultimate Guide to Remote Team Collaboration Tools\"\n Abstract: With the rise of remote work, choosing the right collaboration tools can be crucial to a team's success and productivity. This blog aims to be an comprehensive guide, outlining the various types of tools available, from communication platforms like Slack to project management software and online collaboration tools. It will offer best practices and guidelines for selecting and utilizing these tools, ensuring remote teams can work seamlessly together and maximize their output.\n\n2. Blog title: \"Remote Team Management: Tips for Leading a Successful Virtual Workforce\"\n Abstract: Managing a remote team comes with its own set of challenges. This blog will provide an in-depth exploration of best practices for leading and motivating virtual teams. Covering topics such as effective communication strategies, performance evaluation, and maintaining a cohesive team culture, it will offer practical tips for managers and leaders to ensure their remote teams are engaged, productive, and well-managed.\n\n3. Blog title: \"The Fun Side of Remote Work: Games, Icebreakers, and Team Building Activities\"\n Abstract: Remote work can be isolating, and this blog aims to provide some fun and creative solutions. It will offer a comprehensive guide to the best online games, icebreaker questions, and virtual team building activities that remote teams can use to connect and bond. From virtual escape rooms to interactive games and thought-provoking icebreakers, these ideas will help enhance team spirit, foster collaboration, and create a enjoyable remote work experience.\n\n--------------------------------------------------\nTopic Name: Remote Team Fun\n\nTop Keywords: online games for remote teams, team building activities for remote teams, games for remote teams, retrospective ideas for remote teams, team building ideas for remote teams, fun retrospective ideas for remote teams, retro ideas for remote teams, team building exercises for remote teams, trust building exercises for remote teams, activities for remote teams\n\nBlog Post Ideas: 1. Blog title: \"The Great Remote Retro: Fun Games and Activities for Your Team\"\n Abstract: Remote work can make team building challenging. This blog post will be a fun guide to hosting interactive retro games and activities that bring your remote team together. From online escape rooms to virtual scavenger hunts, we'll explore the best ways to engage and unite your team, fostering collaboration and camaraderie. Virtual icebreakers and retrospective ideas will also be included to make your remote meetings more interactive and enjoyable.\n\n2. Blog title: \"Trust Falls: Building Trust Among Remote Teams\"\n Abstract: Trust is the foundation of every successful team, but how do you build it when everyone is scattered across different locations? This blog will focus on trust-building exercises and activities designed specifically for remote teams. From virtual trust falls to transparent communication practices, we'll discover innovative ways to strengthen team bonds and foster a culture of trust. You'll learn how to create an environment where your remote team can thrive and collaborate effectively.\n\n3. Blog title: \"Game Night for Remote Teams: A Guide to Online Games and Activities\"\n Abstract: Miss the old office game nights? This blog will bring the fun back with a guide to hosting online game nights and activities that are perfect for remote teams. From trivia games to virtual board games and even remote-friendly outdoor adventures, we'll keep your team engaged and entertained. With tips on setting up online tournaments and ideas for encouraging participation, your virtual game nights will be the highlight of your team's week. Keep your remote team spirit high!\n\n--------------------------------------------------\n```", + "body": "[block:html]\n{\n \"html\": \"\\n\\n
\\n

Fueling Generative Content with Keyword Research

\\n
\\n\\n\"\n}\n[/block]\n\n\nGenerative models have proven extremely useful in content idea generation. But they don’t take into account user search demand and trends. In this notebook, let’s see how we can solve that by adding keyword research into the equation.\n\nRead the accompanying [blog post here](https://cohere.com/blog/generative-content-keyword-research/).\n\n```python\n! pip install cohere -q\n```\n\n```python\nimport cohere\nimport numpy as np\nimport pandas as pd\nfrom sklearn.cluster import KMeans\n\nimport cohere\nco = cohere.Client(\"COHERE_API_KEY\") # Get your API key: https://dashboard.cohere.com/api-keys\n```\n\n\n\n```python\n#@title Enable text wrapping in Google Colab\n\nfrom IPython.display import HTML, display\n\ndef set_css():\n display(HTML('''\n \n '''))\nget_ipython().events.register('pre_run_cell', set_css)\n```\n\nFirst, we need to get a supply of high-traffic keywords for a given topic. We can get this via keyword research tools, of which are many available. We’ll use Google Keyword Planner, which is free to use.\n\n```python\n\nimport wget\nwget.download(\"https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/data/remote_teams.csv\", \"remote_teams.csv\")\n```\n\n\n\n```\n'remote_teams.csv'\n```\n\n```python\ndf = pd.read_csv('remote_teams.csv')\ndf.columns = [\"keyword\",\"volume\"]\ndf.head()\n```\n\n\n\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
keywordvolume
0managing remote teams1000
1remote teams390
2collaboration tools for remote teams320
3online games for remote teams320
4how to manage remote teams260
\n
\n\nWe now have a list of keywords, but this list is still raw. For example, “managing remote teams” is the top-ranking keyword in this list. But at the same time, there are many similar keywords further down in the list, such as “how to effectively manage remote teams.”\n\nWe can do that by clustering them into topics. For this, we’ll leverage Cohere’s Embed endpoint and scikit-learn.\n\n### Embed the Keywords with Cohere Embed\n\nThe Cohere Embed endpoint turns a text input into a text embedding.\n\n```python\ndef embed_text(texts):\n output = co.embed(\n texts=texts,\n model='embed-english-v3.0',\n input_type=\"search_document\",\n )\n return output.embeddings\n\nembeds = np.array(embed_text(df['keyword'].tolist()))\n```\n\n\n\n### Cluster the Keywords into Topics with scikit-learn\n\nWe then use these embeddings to cluster the keywords. A common term used for this exercise is “topic modeling.” Here, we can leverage scikit-learn’s KMeans module, a machine learning algorithm for clustering.\n\n```python\nNUM_TOPICS = 4\nkmeans = KMeans(n_clusters=NUM_TOPICS, random_state=21, n_init=\"auto\").fit(embeds)\ndf['topic'] = list(kmeans.labels_)\ndf.head()\n```\n\n\n\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
keywordvolumetopic
0managing remote teams10000
1remote teams3901
2collaboration tools for remote teams3201
3online games for remote teams3203
4how to manage remote teams2600
\n
\n\n### Generate Topic Names with Cohere Chat\n\nWe use the Chat to generate a topic name for that cluster.\n\n```python\ntopic_keywords_dict = {topic: list(set(group['keyword'])) for topic, group in df.groupby('topic')}\n```\n\n\n\n```python\ndef generate_topic_name(keywords):\n # Construct the prompt\n prompt = f\"\"\"Generate a concise topic name that best represents these keywords.\\\nProvide just the topic name and not any additional details.\n\nKeywords: {', '.join(keywords)}\"\"\"\n \n # Call the Cohere API\n response = co.chat(\n model='command-r', # Choose the model size\n message=prompt,\n preamble=\"\")\n \n # Return the generated text\n return response.text\n```\n\n\n\n```python\ntopic_name_mapping = {topic: generate_topic_name(keywords) for topic, keywords in topic_keywords_dict.items()}\n\ndf['topic_name'] = df['topic'].map(topic_name_mapping)\n\ndf.head()\n```\n\n\n\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
keywordvolumetopictopic_name
0managing remote teams10000Remote Team Management
1remote teams3901Remote Team Tools and Tips
2collaboration tools for remote teams3201Remote Team Tools and Tips
3online games for remote teams3203Remote Team Fun
4how to manage remote teams2600Remote Team Management
\n
\n\n```python\nfor topic, name in topic_name_mapping.items():\n print(f\"Topic {topic}: {name}\")\n```\n\n\n\n```\nTopic 0: Remote Team Management\nTopic 1: Remote Team Tools and Tips\nTopic 2: Remote Team Resources\nTopic 3: Remote Team Fun\n```\n\nNow that we have the keywords nicely grouped into topics, we can proceed to generate the content ideas.\n\n### Take the Top Keywords from Each Topic\n\nHere we can implement a filter to take just the top N keywords from each topic, sorted by the search volume. In our case, we use 10.\n\n```python\nTOP_N = 10\n\ntop_keywords = (df.groupby('topic')\n .apply(lambda x: x.nlargest(TOP_N, 'volume'))\n .reset_index(drop=True))\n\n\ncontent_by_topic = {}\nfor topic, group in top_keywords.groupby('topic'):\n keywords = ', '.join(list(group['keyword']))\n topic2name = topic2name = dict(df.groupby('topic')['topic_name'].first())\n topic_name = topic2name[topic]\n content_by_topic[topic] = {'topic_name': topic_name, 'keywords': keywords}\n```\n\n\n\n```python\ncontent_by_topic\n```\n\n\n\n```\n{0: {'topic_name': 'Remote Team Management',\n 'keywords': 'managing remote teams, how to manage remote teams, leading remote teams, managing remote teams best practices, remote teams best practices, best practices for managing remote teams, manage remote teams, building culture in remote teams, culture building for remote teams, managing remote teams training'},\n 1: {'topic_name': 'Remote Team Tools and Tips',\n 'keywords': 'remote teams, collaboration tools for remote teams, team building for remote teams, scrum remote teams, tools for remote teams, zapier remote teams, working agreements for remote teams, working with remote teams, free collaboration tools for remote teams, free retrospective tools for remote teams'},\n 2: {'topic_name': 'Remote Team Resources',\n 'keywords': 'best collaboration tools for remote teams, slack best practices for remote teams, best communication tools for remote teams, best tools for remote teams, always on video for remote teams, best apps for remote teams, best free collaboration tools for remote teams, best games for remote teams, best gifts for remote teams, best ice breaker questions for remote teams'},\n 3: {'topic_name': 'Remote Team Fun',\n 'keywords': 'online games for remote teams, team building activities for remote teams, games for remote teams, retrospective ideas for remote teams, team building ideas for remote teams, fun retrospective ideas for remote teams, retro ideas for remote teams, team building exercises for remote teams, trust building exercises for remote teams, activities for remote teams'}}\n```\n\n### Create a Prompt with These Keywords\n\nNext, we use the Chat endpoint to produce the content ideas. The prompt we’ll use is as follows\n\n```python\ndef generate_blog_ideas(keywords):\n prompt = f\"\"\"{keywords}\\n\\nThe above is a list of high-traffic keywords obtained from a keyword research tool. \nSuggest three blog post ideas that are highly relevant to these keywords. \nFor each idea, write a one paragraph abstract about the topic. \nUse this format:\nBlog title: \nAbstract: \"\"\"\n \n response = co.chat(\n model='command-r',\n message = prompt)\n return response.text\n\n```\n\n\n\n### Generate Content Ideas\n\nNext, we generate the blog post ideas. It takes in a string of keywords, calls the Chat endpoint, and returns the generated text.\n\n```python\nfor key,value in content_by_topic.items():\n value['ideas'] = generate_blog_ideas(value['keywords'])\n\n\nfor key,value in content_by_topic.items():\n print(f\"Topic Name: {value['topic_name']}\\n\")\n print(f\"Top Keywords: {value['keywords']}\\n\")\n print(f\"Blog Post Ideas: {value['ideas']}\\n\")\n print(\"-\"*50)\n```\n\n\n\n```\nTopic Name: Remote Team Management\n\nTop Keywords: managing remote teams, how to manage remote teams, leading remote teams, managing remote teams best practices, remote teams best practices, best practices for managing remote teams, manage remote teams, building culture in remote teams, culture building for remote teams, managing remote teams training\n\nBlog Post Ideas: Here are three blog post ideas:\n\n1. Blog title: \"Leading Remote Teams: Strategies for Effective Management\"\n Abstract: Effective management of remote teams is crucial for success, but it comes with unique challenges. This blog will explore practical strategies for leading dispersed employees, focusing on building a cohesive and productive virtual workforce. It will cover topics such as establishing clear communication protocols, fostering a collaborative environment, and the importance of trusting and empowering your remote employees for enhanced performance.\n\n2. Blog title: \"Remote Teams' Best Practices: Creating a Vibrant and Engaging Culture\"\n Abstract: Building a rich culture in a remote team setting is essential for employee engagement and retention. The blog will delve into creative ways to foster a sense of community and connection among team members who may be scattered across the globe. It will offer practical tips on creating virtual rituals, fostering open communication, and harnessing the power of technology for cultural development, ensuring remote employees feel valued and engaged.\n\n3. Blog title: \"Managing Remote Teams: A Comprehensive Guide to Training and Development\"\n Abstract: Training and developing remote teams present specific challenges and opportunities. This comprehensive guide will arm managers with techniques to enhance their remote team's skills and knowledge. It will explore the latest tools and methodologies for remote training, including virtual workshops, e-learning platforms, and performance coaching. Additionally, the blog will discuss the significance of ongoing development and how to create an environment that encourages self-improvement and learning.\n\nEach of these topics explores a specific aspect of managing remote teams, providing valuable insights and practical guidance for leaders and managers in the evolving remote work landscape.\n\n--------------------------------------------------\nTopic Name: Remote Team Tools and Tips\n\nTop Keywords: remote teams, collaboration tools for remote teams, team building for remote teams, scrum remote teams, tools for remote teams, zapier remote teams, working agreements for remote teams, working with remote teams, free collaboration tools for remote teams, free retrospective tools for remote teams\n\nBlog Post Ideas: 1. Blog title: \"The Ultimate Guide to Building Effective Remote Teams\"\n Abstract: Building a cohesive and productive remote team can be challenging. This blog will serve as a comprehensive guide, offering practical tips and insights on how to create a united and successful virtual workforce. It will cover essential topics such as building a strong team culture, utilizing collaboration tools, and fostering effective communication strategies, ensuring remote teams can thrive and achieve their full potential.\n\n2. Blog title: \"The Best Collaboration Tools for Remote Teams: A Comprehensive Review\"\n Abstract: With the rapid rise of remote work, collaboration tools have become essential for teams' productivity and efficiency. This blog aims to review and compare the most popular collaboration tools, providing an in-depth analysis of their features, ease of use, and benefits. It will offer insights into choosing the right tools for remote collaboration, helping teams streamline their workflows and enhance their overall performance.\n\n3. Blog title: \"Remote Retrospective: A Guide to Reflect and Grow as a Remote Team\"\n Abstract: Conducting effective retrospectives is crucial for remote teams to reflect on their experiences, learn from the past, and chart a course for the future. This blog will focus on remote retrospectives, exploring different formats, techniques, and free tools that teams can use to foster continuous improvement. It will also provide tips on creating a safe and inclusive environment, encouraging honest feedback and productive discussions.\n\n--------------------------------------------------\nTopic Name: Remote Team Resources\n\nTop Keywords: best collaboration tools for remote teams, slack best practices for remote teams, best communication tools for remote teams, best tools for remote teams, always on video for remote teams, best apps for remote teams, best free collaboration tools for remote teams, best games for remote teams, best gifts for remote teams, best ice breaker questions for remote teams\n\nBlog Post Ideas: 1. Blog title: \"The Ultimate Guide to Remote Team Collaboration Tools\"\n Abstract: With the rise of remote work, choosing the right collaboration tools can be crucial to a team's success and productivity. This blog aims to be an comprehensive guide, outlining the various types of tools available, from communication platforms like Slack to project management software and online collaboration tools. It will offer best practices and guidelines for selecting and utilizing these tools, ensuring remote teams can work seamlessly together and maximize their output.\n\n2. Blog title: \"Remote Team Management: Tips for Leading a Successful Virtual Workforce\"\n Abstract: Managing a remote team comes with its own set of challenges. This blog will provide an in-depth exploration of best practices for leading and motivating virtual teams. Covering topics such as effective communication strategies, performance evaluation, and maintaining a cohesive team culture, it will offer practical tips for managers and leaders to ensure their remote teams are engaged, productive, and well-managed.\n\n3. Blog title: \"The Fun Side of Remote Work: Games, Icebreakers, and Team Building Activities\"\n Abstract: Remote work can be isolating, and this blog aims to provide some fun and creative solutions. It will offer a comprehensive guide to the best online games, icebreaker questions, and virtual team building activities that remote teams can use to connect and bond. From virtual escape rooms to interactive games and thought-provoking icebreakers, these ideas will help enhance team spirit, foster collaboration, and create a enjoyable remote work experience.\n\n--------------------------------------------------\nTopic Name: Remote Team Fun\n\nTop Keywords: online games for remote teams, team building activities for remote teams, games for remote teams, retrospective ideas for remote teams, team building ideas for remote teams, fun retrospective ideas for remote teams, retro ideas for remote teams, team building exercises for remote teams, trust building exercises for remote teams, activities for remote teams\n\nBlog Post Ideas: 1. Blog title: \"The Great Remote Retro: Fun Games and Activities for Your Team\"\n Abstract: Remote work can make team building challenging. This blog post will be a fun guide to hosting interactive retro games and activities that bring your remote team together. From online escape rooms to virtual scavenger hunts, we'll explore the best ways to engage and unite your team, fostering collaboration and camaraderie. Virtual icebreakers and retrospective ideas will also be included to make your remote meetings more interactive and enjoyable.\n\n2. Blog title: \"Trust Falls: Building Trust Among Remote Teams\"\n Abstract: Trust is the foundation of every successful team, but how do you build it when everyone is scattered across different locations? This blog will focus on trust-building exercises and activities designed specifically for remote teams. From virtual trust falls to transparent communication practices, we'll discover innovative ways to strengthen team bonds and foster a culture of trust. You'll learn how to create an environment where your remote team can thrive and collaborate effectively.\n\n3. Blog title: \"Game Night for Remote Teams: A Guide to Online Games and Activities\"\n Abstract: Miss the old office game nights? This blog will bring the fun back with a guide to hosting online game nights and activities that are perfect for remote teams. From trivia games to virtual board games and even remote-friendly outdoor adventures, we'll keep your team engaged and entertained. With tips on setting up online tournaments and ideas for encouraging participation, your virtual game nights will be the highlight of your team's week. Keep your remote team spirit high!\n\n--------------------------------------------------\n```", "html": "", "htmlmode": false, "fullscreen": false, diff --git a/scripts/cookbooks-json/hello-world-meet-ai.json b/scripts/cookbooks-json/hello-world-meet-ai.json index 40653ea5..d9b3d951 100644 --- a/scripts/cookbooks-json/hello-world-meet-ai.json +++ b/scripts/cookbooks-json/hello-world-meet-ai.json @@ -13,7 +13,7 @@ }, "title": "Hello World! Meet Language AI", "slug": "hello-world-meet-ai", - "body": "[block:html]\n{\n \"html\": \"\\n\\n
\\n

Hello World! Meet Language AI

\\n
\\n\\n\"\n}\n[/block]\n\n\nHere we take a quick tour of what’s possible with language AI via Cohere’s Large Language Model (LLM) API. This is the Hello, World! of language AI, written for developers with little or no background in AI. In fact, we’ll do that by exploring the Hello, World! phrase itself.\n\nRead the accompanying [blog post here](https://txt.cohere.ai/hello-world-p1/).\n\n[block:html]\n{\n \"html\": \"\\\"Hello\"\n}\n[/block]\n\n\nWe’ll cover three groups of tasks that you will typically work on when dealing with language data, including:\n\n- Generating text\n- Classifying text\n- Analyzing text\n\nThe first step is to install the Cohere Python SDK. Next, create an API key, which you can generate from the Cohere [dashboard](https://os.cohere.ai/register) or [CLI tool](https://docs.cohere.ai/cli-key).\n\n```python\n! pip install cohere altair umap-learn -q\n```\n\n```python\nimport cohere\nimport pandas as pd\nimport numpy as np\nimport altair as alt\n\nco = cohere.Client(\"COHERE_API_KEY\") # Get your API key: https://dashboard.cohere.com/api-keys\n```\n\nThe Cohere Generate endpoint generates text given an input, called “prompt”. The prompt provides a context of what we want the model to generate text. To illustrate this, let’s start with a simple prompt as the input. \n\n### Try a Simple Prompt\n\n```python\nprompt = \"What is a Hello World program.\"\n\nresponse = co.chat(\n message=prompt,\n model='command-r')\n\nprint(response.text)\n```\n\n````\nA \"Hello World\" program is a traditional and simple program that is often used as an introduction to a new programming language. The program typically displays the message \"Hello World\" as its output. The concept of a \"Hello World\" program originated from the book *The C Programming Language* written by Kernighan and Ritchie, where the example program in the book displayed the message using the C programming language. \n\nThe \"Hello World\" program serves as a basic and straightforward way to verify that your development environment is set up correctly and to familiarize yourself with the syntax and fundamentals of the programming language. It's a starting point for learning how to write and run programs in a new language.\n\nThe program's simplicity makes it accessible to programmers of all skill levels, and it's often one of the first programs beginners write when learning to code. The exact implementation of a \"Hello World\" program varies depending on the programming language being used, but the core idea remains the same—to display the \"Hello World\" message. \n\nHere's how a \"Hello World\" program can be written in a few select languages:\n1. **C**:\n```c\n#include \nint main() {\n printf(\"Hello World\\n\");\n return 0;\n}\n```\n2. **Python**:\n```python\nprint(\"Hello World\")\n```\n3. **Java**:\n```java\nclass HelloWorld {\n public static void main(String[] args) {\n System.out.println(\"Hello World\");\n }\n}\n```\n4. **JavaScript**:\n```javascript\nconsole.log(\"Hello World\");\n```\n5. **C#**:\n```csharp\nusing System;\n\nclass Program {\n static void Main() {\n Console.WriteLine(\"Hello World\");\n }\n}\n```\nThe \"Hello World\" program is a testament to the power of programming, as a simple and concise message can be displayed in numerous languages with just a few lines of code. It's an exciting first step into the world of software development!\n````\n\n### Create a Better Prompt\n\nThe output is not bad, but it can be better. We need to find a way to make the output tighter to how we want it to be, which is where we leverage _prompt engineering_.\n\n```python\nprompt = \"\"\"\nWrite the first paragraph of a blog post given a blog title.\n--\nBlog Title: Best Activities in Toronto\nFirst Paragraph: Looking for fun things to do in Toronto? When it comes to exploring Canada's\nlargest city, there's an ever-evolving set of activities to choose from. Whether you're looking to\nvisit a local museum or sample the city's varied cuisine, there is plenty to fill any itinerary. In\nthis blog post, I'll share some of my favorite recommendations\n--\nBlog Title: Mastering Dynamic Programming\nFirst Paragraph: In this piece, we'll help you understand the fundamentals of dynamic programming,\nand when to apply this optimization technique. We'll break down bottom-up and top-down approaches to\nsolve dynamic programming problems.\n--\nBlog Title: Learning to Code with Hello, World!\nFirst Paragraph:\"\"\"\n\nresponse = co.chat(\n message=prompt,\n model='command-r')\n\nprint(response.text)\n```\n\n```\nStarting to code can be daunting, but it's actually simpler than you think! The famous first program, \"Hello, World!\" is a rite of passage for all coders, and an excellent starting point to begin your coding journey. This blog will guide you through the process of writing your very first line of code, and help you understand why learning to code is an exciting and valuable skill to have, covering the fundamentals and the broader implications of this seemingly simple phrase.\n```\n\n### Automating the Process\n\nIn real applications, you will likely need to produce these text generations on an ongoing basis, given different inputs. Let’s simulate that with our example.\n\n```python\ndef generate_text(topic):\n prompt = f\"\"\"\nWrite the first paragraph of a blog post given a blog title.\n--\nBlog Title: Best Activities in Toronto\nFirst Paragraph: Looking for fun things to do in Toronto? When it comes to exploring Canada's\nlargest city, there's an ever-evolving set of activities to choose from. Whether you're looking to\nvisit a local museum or sample the city's varied cuisine, there is plenty to fill any itinerary. In\nthis blog post, I'll share some of my favorite recommendations\n--\nBlog Title: Mastering Dynamic Programming\nFirst Paragraph: In this piece, we'll help you understand the fundamentals of dynamic programming,\nand when to apply this optimization technique. We'll break down bottom-up and top-down approaches to\nsolve dynamic programming problems.\n--\nBlog Title: {topic}\nFirst Paragraph:\"\"\"\n # Generate text by calling the Chat endpoint\n response = co.chat(\n message=prompt,\n model='command-r')\n\n return response.text\n```\n\n```python\ntopics = [\"How to Grow in Your Career\",\n \"The Habits of Great Software Developers\",\n \"Ideas for a Relaxing Weekend\"]\n```\n\n```python\nparagraphs = []\n\nfor topic in topics:\n paragraphs.append(generate_text(topic))\n \nfor topic,para in zip(topics,paragraphs):\n print(f\"Topic: {topic}\")\n print(f\"First Paragraph: {para}\")\n print(\"-\"*10)\n```\n\n```\nTopic: How to Grow in Your Career\nFirst Paragraph: Advancing in your career can seem like a daunting task, especially if you're unsure of the path ahead. In this ever-changing professional landscape, there are numerous factors to consider. This blog aims to shed light on the strategies and skills that can help you navigate the complexities of career progression and unlock your full potential. Whether you're looking to secure a promotion or explore new opportunities, these insights will help you chart a course for your future. Let's embark on this journey of self-improvement and professional growth, equipping you with the tools to succeed in your career aspirations.\n----------\nTopic: The Habits of Great Software Developers\nFirst Paragraph: Great software developers are renowned for their ability to write robust code and create innovative applications, but what sets them apart from their peers? In this blog, we'll delve into the daily habits that contribute to their success. From their approach to coding challenges to the ways they stay organized, we'll explore the routines and practices that help them excel in the fast-paced world of software development. Understanding these habits can help you elevate your own skills and join the ranks of these industry leaders.\n----------\nTopic: Ideas for a Relaxing Weekend\nFirst Paragraph: Life can be stressful, and sometimes we just need a relaxing weekend to unwind and recharge. In this fast-paced world, taking some time to slow down and rejuvenate is essential. This blog post is here to help you plan the perfect low-key weekend with some easy and accessible ideas. From cozy indoor activities to peaceful outdoor adventures, I'll share some ideas to help you renew your mind, body, and spirit. Whether you're a homebody or an adventure seeker, there's something special for everyone. So, grab a cup of tea, sit back, and get ready to dive into a calming weekend of self-care and relaxation!\n----------\n```\n\nCohere’s Classify endpoint makes it easy to take a list of texts and predict their categories, or classes. A typical machine learning model requires many training examples to perform text classification, but with the Classify endpoint, you can get started with as few as 5 examples per class.\n\n### Sentiment Analysis\n\n```python\nfrom cohere import ClassifyExample\n\nexamples = [\n ClassifyExample(text=\"I’m so proud of you\", label=\"positive\"), \n ClassifyExample(text=\"What a great time to be alive\", label=\"positive\"), \n ClassifyExample(text=\"That’s awesome work\", label=\"positive\"), \n ClassifyExample(text=\"The service was amazing\", label=\"positive\"), \n ClassifyExample(text=\"I love my family\", label=\"positive\"), \n ClassifyExample(text=\"They don't care about me\", label=\"negative\"), \n ClassifyExample(text=\"I hate this place\", label=\"negative\"), \n ClassifyExample(text=\"The most ridiculous thing I've ever heard\", label=\"negative\"), \n ClassifyExample(text=\"I am really frustrated\", label=\"negative\"), \n ClassifyExample(text=\"This is so unfair\", label=\"negative\"),\n ClassifyExample(text=\"This made me think\", label=\"neutral\"), \n ClassifyExample(text=\"The good old days\", label=\"neutral\"), \n ClassifyExample(text=\"What's the difference\", label=\"neutral\"), \n ClassifyExample(text=\"You can't ignore this\", label=\"neutral\"), \n ClassifyExample(text=\"That's how I see it\", label=\"neutral\")\n]\n```\n\n```python\ninputs=[\"Hello, world! What a beautiful day\",\n \"It was a great time with great people\",\n \"Great place to work\",\n \"That was a wonderful evening\",\n \"Maybe this is why\",\n \"Let's start again\",\n \"That's how I see it\",\n \"These are all facts\",\n \"This is the worst thing\",\n \"I cannot stand this any longer\",\n \"This is really annoying\",\n \"I am just plain fed up\"\n ]\n```\n\n```python\ndef classify_text(inputs, examples):\n \"\"\"\n Classify a list of input texts\n Arguments:\n inputs(list[str]): a list of input texts to be classified\n examples(list[Example]): a list of example texts and class labels\n Returns:\n classifications(list): each result contains the text, labels, and conf values\n \"\"\"\n # Classify text by calling the Classify endpoint\n response = co.classify(\n model='embed-english-v2.0',\n inputs=inputs,\n examples=examples)\n \n classifications = response.classifications\n \n return classifications\n```\n\n```python\npredictions = classify_text(inputs,examples)\n\nclasses = [\"positive\",\"negative\",\"neutral\"]\nfor inp,pred in zip(inputs,predictions):\n class_pred = pred.predictions[0]\n class_idx = classes.index(class_pred)\n class_conf = pred.confidences[0]\n\n print(f\"Input: {inp}\")\n print(f\"Prediction: {class_pred}\")\n print(f\"Confidence: {class_conf:.2f}\")\n print(\"-\"*10)\n```\n\n```\nInput: Hello, world! What a beautiful day\nPrediction: positive\nConfidence: 0.84\n----------\nInput: It was a great time with great people\nPrediction: positive\nConfidence: 0.99\n----------\nInput: Great place to work\nPrediction: positive\nConfidence: 0.91\n----------\nInput: That was a wonderful evening\nPrediction: positive\nConfidence: 0.96\n----------\nInput: Maybe this is why\nPrediction: neutral\nConfidence: 0.70\n----------\nInput: Let's start again\nPrediction: neutral\nConfidence: 0.83\n----------\nInput: That's how I see it\nPrediction: neutral\nConfidence: 1.00\n----------\nInput: These are all facts\nPrediction: neutral\nConfidence: 0.78\n----------\nInput: This is the worst thing\nPrediction: negative\nConfidence: 0.93\n----------\nInput: I cannot stand this any longer\nPrediction: negative\nConfidence: 0.93\n----------\nInput: This is really annoying\nPrediction: negative\nConfidence: 0.99\n----------\nInput: I am just plain fed up\nPrediction: negative\nConfidence: 1.00\n----------\n```\n\nCohere’s Embed endpoint takes a piece of text and turns it into a vector embedding. Embeddings represent text in the form of numbers that capture its meaning and context. What it means is that it gives you the ability to turn unstructured text data into a structured form. It opens up ways to analyze and extract insights from them.\n\n## Get embeddings\n\nHere we have a list of 50 top web search keywords about Hello, World! taken from a keyword tool. Let’s look at a few examples:\n\n```python\ndf = pd.read_csv(\"https://github.com/cohere-ai/notebooks/raw/main/notebooks/data/hello-world-kw.csv\", names=[\"search_term\"])\ndf.head()\n```\n\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
search_term
0how to print hello world in python
1what is hello world
2how do you write hello world in an alert box
3how to print hello world in java
4how to write hello world in eclipse
\n
\n\nWe use the Embed endpoint to get the embeddings for each of these keywords.\n\n```python\ndef embed_text(texts, input_type):\n \"\"\"\n Turns a piece of text into embeddings\n Arguments:\n text(str): the text to be turned into embeddings\n Returns:\n embedding(list): the embeddings\n \"\"\"\n # Embed text by calling the Embed endpoint\n response = co.embed(\n model=\"embed-english-v3.0\",\n input_type=input_type,\n texts=texts)\n \n return response.embeddings\n```\n\n```python\ndf[\"search_term_embeds\"] = embed_text(texts=df[\"search_term\"].tolist(),\n input_type=\"search_document\")\ndoc_embeds = np.array(df[\"search_term_embeds\"].tolist())\n```\n\n### Semantic Search\n\nWe’ll look at a couple of example applications. The first example is semantic search. Given a new query, our \"search engine\" must return the most similar FAQs, where the FAQs are the 50 search terms we uploaded earlier.\n\n```python\nquery = \"what is the history of hello world\"\n\nquery_embeds = embed_text(texts=[query],\n input_type=\"search_query\")[0]\n```\n\nWe use cosine similarity to compare the similarity of the new query with each of the FAQs\n\n```python\n\nfrom sklearn.metrics.pairwise import cosine_similarity\n\ndef get_similarity(target, candidates):\n \"\"\"\n Computes the similarity between a target text and a list of other texts\n Arguments:\n target(list[float]): the target text\n candidates(list[list[float]]): a list of other texts, or candidates\n Returns:\n sim(list[tuple]): candidate IDs and the similarity scores\n \"\"\"\n # Turn list into array\n candidates = np.array(candidates)\n target = np.expand_dims(np.array(target),axis=0)\n\n # Calculate cosine similarity\n sim = cosine_similarity(target,candidates)\n sim = np.squeeze(sim).tolist()\n\n # Sort by descending order in similarity\n sim = list(enumerate(sim))\n sim = sorted(sim, key=lambda x:x[1], reverse=True)\n\n # Return similarity scores\n return sim\n```\n\nFinally, we display the top 5 FAQs that match the new query\n\n```python\nsimilarity = get_similarity(query_embeds,doc_embeds)\n\nprint(\"New query:\")\nprint(new_query,'\\n')\n\nprint(\"Similar queries:\")\nfor idx,score in similarity[:5]:\n print(f\"Similarity: {score:.2f};\", df.iloc[idx][\"search_term\"])\n```\n\n```\nNew query:\nwhat is the history of hello world \n\nSimilar queries:\nSimilarity: 0.58; how did hello world originate\nSimilarity: 0.56; where did hello world come from\nSimilarity: 0.54; why hello world\nSimilarity: 0.53; why is hello world so famous\nSimilarity: 0.53; what is hello world\n```\n\n### Semantic Exploration\n\nIn the second example, we take the same idea as semantic search and take a broader look, which is exploring huge volumes of text and analyzing their semantic relationships.\n\nWe'll use the same 50 top web search terms about Hello, World! There are different techniques we can use to compress the embeddings down to just 2 dimensions while retaining as much information as possible. We'll use a technique called UMAP. And once we can get it down to 2 dimensions, we can plot these embeddings on a 2D chart.\n\n```python\nimport umap\nreducer = umap.UMAP(n_neighbors=49) \numap_embeds = reducer.fit_transform(doc_embeds)\n\ndf['x'] = umap_embeds[:,0]\ndf['y'] = umap_embeds[:,1]\n```\n\n```python\nchart = alt.Chart(df).mark_circle(size=500).encode(\n x=\n alt.X('x',\n scale=alt.Scale(zero=False),\n axis=alt.Axis(labels=False, ticks=False, domain=False)\n ),\n\n y=\n alt.Y('y',\n scale=alt.Scale(zero=False),\n axis=alt.Axis(labels=False, ticks=False, domain=False)\n ),\n \n tooltip=['search_term']\n )\n\ntext = chart.mark_text(align='left', dx=15, size=12, color='black'\n ).encode(text='search_term', color= alt.value('black'))\n\nresult = (chart + text).configure(background=\"#FDF7F0\"\n ).properties(\n width=1000,\n height=700,\n title=\"2D Embeddings\"\n )\n\nresult\n```\n\n\n\n
", + "body": "[block:html]\n{\n \"html\": \"\\n\\n
\\n

Hello World! Meet Language AI

\\n
\\n\\n\"\n}\n[/block]\n\n\nHere we take a quick tour of what’s possible with language AI via Cohere’s Large Language Model (LLM) API. This is the Hello, World! of language AI, written for developers with little or no background in AI. In fact, we’ll do that by exploring the Hello, World! phrase itself.\n\nRead the accompanying [blog post here](https://cohere.com/blog/hello-world-p1/).\n\n[block:html]\n{\n \"html\": \"\\\"Hello\"\n}\n[/block]\n\n\nWe’ll cover three groups of tasks that you will typically work on when dealing with language data, including:\n\n- Generating text\n- Classifying text\n- Analyzing text\n\nThe first step is to install the Cohere Python SDK. Next, create an API key, which you can generate from the Cohere [dashboard](https://os.cohere.ai/register) or [CLI tool](https://docs.cohere.ai/cli-key).\n\n```python\n! pip install cohere altair umap-learn -q\n```\n\n```python\nimport cohere\nimport pandas as pd\nimport numpy as np\nimport altair as alt\n\nco = cohere.Client(\"COHERE_API_KEY\") # Get your API key: https://dashboard.cohere.com/api-keys\n```\n\nThe Cohere Generate endpoint generates text given an input, called “prompt”. The prompt provides a context of what we want the model to generate text. To illustrate this, let’s start with a simple prompt as the input. \n\n### Try a Simple Prompt\n\n```python\nprompt = \"What is a Hello World program.\"\n\nresponse = co.chat(\n message=prompt,\n model='command-r')\n\nprint(response.text)\n```\n\n````\nA \"Hello World\" program is a traditional and simple program that is often used as an introduction to a new programming language. The program typically displays the message \"Hello World\" as its output. The concept of a \"Hello World\" program originated from the book *The C Programming Language* written by Kernighan and Ritchie, where the example program in the book displayed the message using the C programming language. \n\nThe \"Hello World\" program serves as a basic and straightforward way to verify that your development environment is set up correctly and to familiarize yourself with the syntax and fundamentals of the programming language. It's a starting point for learning how to write and run programs in a new language.\n\nThe program's simplicity makes it accessible to programmers of all skill levels, and it's often one of the first programs beginners write when learning to code. The exact implementation of a \"Hello World\" program varies depending on the programming language being used, but the core idea remains the same—to display the \"Hello World\" message. \n\nHere's how a \"Hello World\" program can be written in a few select languages:\n1. **C**:\n```c\n#include \nint main() {\n printf(\"Hello World\\n\");\n return 0;\n}\n```\n2. **Python**:\n```python\nprint(\"Hello World\")\n```\n3. **Java**:\n```java\nclass HelloWorld {\n public static void main(String[] args) {\n System.out.println(\"Hello World\");\n }\n}\n```\n4. **JavaScript**:\n```javascript\nconsole.log(\"Hello World\");\n```\n5. **C#**:\n```csharp\nusing System;\n\nclass Program {\n static void Main() {\n Console.WriteLine(\"Hello World\");\n }\n}\n```\nThe \"Hello World\" program is a testament to the power of programming, as a simple and concise message can be displayed in numerous languages with just a few lines of code. It's an exciting first step into the world of software development!\n````\n\n### Create a Better Prompt\n\nThe output is not bad, but it can be better. We need to find a way to make the output tighter to how we want it to be, which is where we leverage _prompt engineering_.\n\n```python\nprompt = \"\"\"\nWrite the first paragraph of a blog post given a blog title.\n--\nBlog Title: Best Activities in Toronto\nFirst Paragraph: Looking for fun things to do in Toronto? When it comes to exploring Canada's\nlargest city, there's an ever-evolving set of activities to choose from. Whether you're looking to\nvisit a local museum or sample the city's varied cuisine, there is plenty to fill any itinerary. In\nthis blog post, I'll share some of my favorite recommendations\n--\nBlog Title: Mastering Dynamic Programming\nFirst Paragraph: In this piece, we'll help you understand the fundamentals of dynamic programming,\nand when to apply this optimization technique. We'll break down bottom-up and top-down approaches to\nsolve dynamic programming problems.\n--\nBlog Title: Learning to Code with Hello, World!\nFirst Paragraph:\"\"\"\n\nresponse = co.chat(\n message=prompt,\n model='command-r')\n\nprint(response.text)\n```\n\n```\nStarting to code can be daunting, but it's actually simpler than you think! The famous first program, \"Hello, World!\" is a rite of passage for all coders, and an excellent starting point to begin your coding journey. This blog will guide you through the process of writing your very first line of code, and help you understand why learning to code is an exciting and valuable skill to have, covering the fundamentals and the broader implications of this seemingly simple phrase.\n```\n\n### Automating the Process\n\nIn real applications, you will likely need to produce these text generations on an ongoing basis, given different inputs. Let’s simulate that with our example.\n\n```python\ndef generate_text(topic):\n prompt = f\"\"\"\nWrite the first paragraph of a blog post given a blog title.\n--\nBlog Title: Best Activities in Toronto\nFirst Paragraph: Looking for fun things to do in Toronto? When it comes to exploring Canada's\nlargest city, there's an ever-evolving set of activities to choose from. Whether you're looking to\nvisit a local museum or sample the city's varied cuisine, there is plenty to fill any itinerary. In\nthis blog post, I'll share some of my favorite recommendations\n--\nBlog Title: Mastering Dynamic Programming\nFirst Paragraph: In this piece, we'll help you understand the fundamentals of dynamic programming,\nand when to apply this optimization technique. We'll break down bottom-up and top-down approaches to\nsolve dynamic programming problems.\n--\nBlog Title: {topic}\nFirst Paragraph:\"\"\"\n # Generate text by calling the Chat endpoint\n response = co.chat(\n message=prompt,\n model='command-r')\n\n return response.text\n```\n\n```python\ntopics = [\"How to Grow in Your Career\",\n \"The Habits of Great Software Developers\",\n \"Ideas for a Relaxing Weekend\"]\n```\n\n```python\nparagraphs = []\n\nfor topic in topics:\n paragraphs.append(generate_text(topic))\n \nfor topic,para in zip(topics,paragraphs):\n print(f\"Topic: {topic}\")\n print(f\"First Paragraph: {para}\")\n print(\"-\"*10)\n```\n\n```\nTopic: How to Grow in Your Career\nFirst Paragraph: Advancing in your career can seem like a daunting task, especially if you're unsure of the path ahead. In this ever-changing professional landscape, there are numerous factors to consider. This blog aims to shed light on the strategies and skills that can help you navigate the complexities of career progression and unlock your full potential. Whether you're looking to secure a promotion or explore new opportunities, these insights will help you chart a course for your future. Let's embark on this journey of self-improvement and professional growth, equipping you with the tools to succeed in your career aspirations.\n----------\nTopic: The Habits of Great Software Developers\nFirst Paragraph: Great software developers are renowned for their ability to write robust code and create innovative applications, but what sets them apart from their peers? In this blog, we'll delve into the daily habits that contribute to their success. From their approach to coding challenges to the ways they stay organized, we'll explore the routines and practices that help them excel in the fast-paced world of software development. Understanding these habits can help you elevate your own skills and join the ranks of these industry leaders.\n----------\nTopic: Ideas for a Relaxing Weekend\nFirst Paragraph: Life can be stressful, and sometimes we just need a relaxing weekend to unwind and recharge. In this fast-paced world, taking some time to slow down and rejuvenate is essential. This blog post is here to help you plan the perfect low-key weekend with some easy and accessible ideas. From cozy indoor activities to peaceful outdoor adventures, I'll share some ideas to help you renew your mind, body, and spirit. Whether you're a homebody or an adventure seeker, there's something special for everyone. So, grab a cup of tea, sit back, and get ready to dive into a calming weekend of self-care and relaxation!\n----------\n```\n\nCohere’s Classify endpoint makes it easy to take a list of texts and predict their categories, or classes. A typical machine learning model requires many training examples to perform text classification, but with the Classify endpoint, you can get started with as few as 5 examples per class.\n\n### Sentiment Analysis\n\n```python\nfrom cohere import ClassifyExample\n\nexamples = [\n ClassifyExample(text=\"I’m so proud of you\", label=\"positive\"), \n ClassifyExample(text=\"What a great time to be alive\", label=\"positive\"), \n ClassifyExample(text=\"That’s awesome work\", label=\"positive\"), \n ClassifyExample(text=\"The service was amazing\", label=\"positive\"), \n ClassifyExample(text=\"I love my family\", label=\"positive\"), \n ClassifyExample(text=\"They don't care about me\", label=\"negative\"), \n ClassifyExample(text=\"I hate this place\", label=\"negative\"), \n ClassifyExample(text=\"The most ridiculous thing I've ever heard\", label=\"negative\"), \n ClassifyExample(text=\"I am really frustrated\", label=\"negative\"), \n ClassifyExample(text=\"This is so unfair\", label=\"negative\"),\n ClassifyExample(text=\"This made me think\", label=\"neutral\"), \n ClassifyExample(text=\"The good old days\", label=\"neutral\"), \n ClassifyExample(text=\"What's the difference\", label=\"neutral\"), \n ClassifyExample(text=\"You can't ignore this\", label=\"neutral\"), \n ClassifyExample(text=\"That's how I see it\", label=\"neutral\")\n]\n```\n\n```python\ninputs=[\"Hello, world! What a beautiful day\",\n \"It was a great time with great people\",\n \"Great place to work\",\n \"That was a wonderful evening\",\n \"Maybe this is why\",\n \"Let's start again\",\n \"That's how I see it\",\n \"These are all facts\",\n \"This is the worst thing\",\n \"I cannot stand this any longer\",\n \"This is really annoying\",\n \"I am just plain fed up\"\n ]\n```\n\n```python\ndef classify_text(inputs, examples):\n \"\"\"\n Classify a list of input texts\n Arguments:\n inputs(list[str]): a list of input texts to be classified\n examples(list[Example]): a list of example texts and class labels\n Returns:\n classifications(list): each result contains the text, labels, and conf values\n \"\"\"\n # Classify text by calling the Classify endpoint\n response = co.classify(\n model='embed-english-v2.0',\n inputs=inputs,\n examples=examples)\n \n classifications = response.classifications\n \n return classifications\n```\n\n```python\npredictions = classify_text(inputs,examples)\n\nclasses = [\"positive\",\"negative\",\"neutral\"]\nfor inp,pred in zip(inputs,predictions):\n class_pred = pred.predictions[0]\n class_idx = classes.index(class_pred)\n class_conf = pred.confidences[0]\n\n print(f\"Input: {inp}\")\n print(f\"Prediction: {class_pred}\")\n print(f\"Confidence: {class_conf:.2f}\")\n print(\"-\"*10)\n```\n\n```\nInput: Hello, world! What a beautiful day\nPrediction: positive\nConfidence: 0.84\n----------\nInput: It was a great time with great people\nPrediction: positive\nConfidence: 0.99\n----------\nInput: Great place to work\nPrediction: positive\nConfidence: 0.91\n----------\nInput: That was a wonderful evening\nPrediction: positive\nConfidence: 0.96\n----------\nInput: Maybe this is why\nPrediction: neutral\nConfidence: 0.70\n----------\nInput: Let's start again\nPrediction: neutral\nConfidence: 0.83\n----------\nInput: That's how I see it\nPrediction: neutral\nConfidence: 1.00\n----------\nInput: These are all facts\nPrediction: neutral\nConfidence: 0.78\n----------\nInput: This is the worst thing\nPrediction: negative\nConfidence: 0.93\n----------\nInput: I cannot stand this any longer\nPrediction: negative\nConfidence: 0.93\n----------\nInput: This is really annoying\nPrediction: negative\nConfidence: 0.99\n----------\nInput: I am just plain fed up\nPrediction: negative\nConfidence: 1.00\n----------\n```\n\nCohere’s Embed endpoint takes a piece of text and turns it into a vector embedding. Embeddings represent text in the form of numbers that capture its meaning and context. What it means is that it gives you the ability to turn unstructured text data into a structured form. It opens up ways to analyze and extract insights from them.\n\n## Get embeddings\n\nHere we have a list of 50 top web search keywords about Hello, World! taken from a keyword tool. Let’s look at a few examples:\n\n```python\ndf = pd.read_csv(\"https://github.com/cohere-ai/notebooks/raw/main/notebooks/data/hello-world-kw.csv\", names=[\"search_term\"])\ndf.head()\n```\n\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
search_term
0how to print hello world in python
1what is hello world
2how do you write hello world in an alert box
3how to print hello world in java
4how to write hello world in eclipse
\n
\n\nWe use the Embed endpoint to get the embeddings for each of these keywords.\n\n```python\ndef embed_text(texts, input_type):\n \"\"\"\n Turns a piece of text into embeddings\n Arguments:\n text(str): the text to be turned into embeddings\n Returns:\n embedding(list): the embeddings\n \"\"\"\n # Embed text by calling the Embed endpoint\n response = co.embed(\n model=\"embed-english-v3.0\",\n input_type=input_type,\n texts=texts)\n \n return response.embeddings\n```\n\n```python\ndf[\"search_term_embeds\"] = embed_text(texts=df[\"search_term\"].tolist(),\n input_type=\"search_document\")\ndoc_embeds = np.array(df[\"search_term_embeds\"].tolist())\n```\n\n### Semantic Search\n\nWe’ll look at a couple of example applications. The first example is semantic search. Given a new query, our \"search engine\" must return the most similar FAQs, where the FAQs are the 50 search terms we uploaded earlier.\n\n```python\nquery = \"what is the history of hello world\"\n\nquery_embeds = embed_text(texts=[query],\n input_type=\"search_query\")[0]\n```\n\nWe use cosine similarity to compare the similarity of the new query with each of the FAQs\n\n```python\n\nfrom sklearn.metrics.pairwise import cosine_similarity\n\ndef get_similarity(target, candidates):\n \"\"\"\n Computes the similarity between a target text and a list of other texts\n Arguments:\n target(list[float]): the target text\n candidates(list[list[float]]): a list of other texts, or candidates\n Returns:\n sim(list[tuple]): candidate IDs and the similarity scores\n \"\"\"\n # Turn list into array\n candidates = np.array(candidates)\n target = np.expand_dims(np.array(target),axis=0)\n\n # Calculate cosine similarity\n sim = cosine_similarity(target,candidates)\n sim = np.squeeze(sim).tolist()\n\n # Sort by descending order in similarity\n sim = list(enumerate(sim))\n sim = sorted(sim, key=lambda x:x[1], reverse=True)\n\n # Return similarity scores\n return sim\n```\n\nFinally, we display the top 5 FAQs that match the new query\n\n```python\nsimilarity = get_similarity(query_embeds,doc_embeds)\n\nprint(\"New query:\")\nprint(new_query,'\\n')\n\nprint(\"Similar queries:\")\nfor idx,score in similarity[:5]:\n print(f\"Similarity: {score:.2f};\", df.iloc[idx][\"search_term\"])\n```\n\n```\nNew query:\nwhat is the history of hello world \n\nSimilar queries:\nSimilarity: 0.58; how did hello world originate\nSimilarity: 0.56; where did hello world come from\nSimilarity: 0.54; why hello world\nSimilarity: 0.53; why is hello world so famous\nSimilarity: 0.53; what is hello world\n```\n\n### Semantic Exploration\n\nIn the second example, we take the same idea as semantic search and take a broader look, which is exploring huge volumes of text and analyzing their semantic relationships.\n\nWe'll use the same 50 top web search terms about Hello, World! There are different techniques we can use to compress the embeddings down to just 2 dimensions while retaining as much information as possible. We'll use a technique called UMAP. And once we can get it down to 2 dimensions, we can plot these embeddings on a 2D chart.\n\n```python\nimport umap\nreducer = umap.UMAP(n_neighbors=49) \numap_embeds = reducer.fit_transform(doc_embeds)\n\ndf['x'] = umap_embeds[:,0]\ndf['y'] = umap_embeds[:,1]\n```\n\n```python\nchart = alt.Chart(df).mark_circle(size=500).encode(\n x=\n alt.X('x',\n scale=alt.Scale(zero=False),\n axis=alt.Axis(labels=False, ticks=False, domain=False)\n ),\n\n y=\n alt.Y('y',\n scale=alt.Scale(zero=False),\n axis=alt.Axis(labels=False, ticks=False, domain=False)\n ),\n \n tooltip=['search_term']\n )\n\ntext = chart.mark_text(align='left', dx=15, size=12, color='black'\n ).encode(text='search_term', color= alt.value('black'))\n\nresult = (chart + text).configure(background=\"#FDF7F0\"\n ).properties(\n width=1000,\n height=700,\n title=\"2D Embeddings\"\n )\n\nresult\n```\n\n\n\n
", "html": "", "htmlmode": false, "fullscreen": false, diff --git a/scripts/cookbooks-json/multilingual-search.json b/scripts/cookbooks-json/multilingual-search.json index 806fc472..25da2d8c 100644 --- a/scripts/cookbooks-json/multilingual-search.json +++ b/scripts/cookbooks-json/multilingual-search.json @@ -13,7 +13,7 @@ }, "title": "Multilingual Search with Cohere and Langchain", "slug": "multilingual-search", - "body": "[block:html]\n{\n \"html\": \"\\n\\n
\\n

Multilingual Search with Cohere and Langchain

\\n
\\n\\n\"\n}\n[/block]\n\n\n***Read the accompanying [blog post here](https://txt.cohere.ai/search-cohere-langchain/).***\n\nThis notebook contains two examples for performing multilingual search using Cohere and Langchain. Langchain is a library that assists the development of applications built on top of large language models (LLMs), such as Cohere's models.\n\nIn short, Cohere makes it easy for developers to leverage LLMs and Langchain makes it easy to build applications with these models.\n\nWe'll go through the following examples:\n- **Example 1 - Basic Multilingual Search**\n\n This is a simple example of multilingual search over a list of documents.\n\n The steps in summary:\n - Import a list of documents\n - Embed the documents and store them in an index\n - Enter a query\n - Return the document most similar to the query\n- **Example 2 - Search-Based Question Answering**\n\n This example shows a more involved example where search is combined with text generation to answer questions about long-form documents.\n\n The steps in summary:\n - Add an article and chunk it into smaller passages\n - Embed the passages and store them in an index\n - Enter a question\n - Answer the question based on the most relevant documents\n\n\n```python\nfrom langchain.embeddings.cohere import CohereEmbeddings\nfrom langchain.llms import Cohere\nfrom langchain.prompts import PromptTemplate\nfrom langchain.text_splitter import RecursiveCharacterTextSplitter\nfrom langchain.chains.question_answering import load_qa_chain\nfrom langchain.chains import RetrievalQA\nfrom langchain.vectorstores import Qdrant\nfrom langchain.document_loaders import TextLoader\nimport textwrap as tr\nimport random\nimport dotenv\nimport os\n\ndotenv.load_dotenv(\".env\") # Upload an '.env' file containing an environment variable named 'COHERE_API_KEY' using your Cohere API Key\n```\n\n\n\n\n True\n\n\n\n\n[block:html]{\"html\":\"\\\"Example-1---Basic-Multilingual-Search.png\\\"/\"}[/block]\n\n### Import a list of documents\n\n\n```python\nimport tensorflow_datasets as tfds\ndataset = tfds.load('trec', split='train')\ntexts = [item['text'].decode('utf-8') for item in tfds.as_numpy(dataset)]\nprint(f\"Number of documents: {len(texts)}\")\n```\n\n Downloading and preparing dataset 350.79 KiB (download: 350.79 KiB, generated: 636.90 KiB, total: 987.69 KiB) to /root/tensorflow_datasets/trec/1.0.0...\n\n\n\n Dl Completed...: 0 url [00:00, ? url/s]\n\n\n\n Dl Size...: 0 MiB [00:00, ? MiB/s]\n\n\n\n Extraction completed...: 0 file [00:00, ? file/s]\n\n\n\n Generating splits...: 0%| | 0/2 [00:00\"}[/block]\n\n## Add an article and chunk it into smaller passages\n\n\n```python\n\n!wget 'https://docs.google.com/uc?export=download&id=1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F' -O steve-jobs-commencement.txt\n```\n\n --2023-06-08 06:11:19-- https://docs.google.com/uc?export=download&id=1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F\n Resolving docs.google.com (docs.google.com)... 74.125.200.101, 74.125.200.138, 74.125.200.102, ...\n Connecting to docs.google.com (docs.google.com)|74.125.200.101|:443... connected.\n HTTP request sent, awaiting response... 303 See Other\n Location: https://doc-0g-84-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/84t4moii9dmg08hmrh6rfpp8ecrjh6jq/1686204675000/12721472133292131824/*/1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F?e=download&uuid=a26288c7-ad0c-4707-ae0b-72cb94c224dc [following]\n Warning: wildcards not supported in HTTP.\n --2023-06-08 06:11:19-- https://doc-0g-84-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/84t4moii9dmg08hmrh6rfpp8ecrjh6jq/1686204675000/12721472133292131824/*/1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F?e=download&uuid=a26288c7-ad0c-4707-ae0b-72cb94c224dc\n Resolving doc-0g-84-docs.googleusercontent.com (doc-0g-84-docs.googleusercontent.com)... 74.125.130.132, 2404:6800:4003:c01::84\n Connecting to doc-0g-84-docs.googleusercontent.com (doc-0g-84-docs.googleusercontent.com)|74.125.130.132|:443... connected.\n HTTP request sent, awaiting response... 200 OK\n Length: 11993 (12K) [text/plain]\n Saving to: ‘steve-jobs-commencement.txt’\n \n steve-jobs-commence 100%[===================>] 11.71K --.-KB/s in 0s \n \n 2023-06-08 06:11:20 (115 MB/s) - ‘steve-jobs-commencement.txt’ saved [11993/11993]\n \n\n\n\n```python\nloader = TextLoader(\"steve-jobs-commencement.txt\")\ndocuments = loader.load()\ntext_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\ntexts = text_splitter.split_documents(documents)\n```\n\n## Embed the passages and store them in an index\n\n\n```python\nembeddings = CohereEmbeddings(model = \"multilingual-22-12\")\ndb = Qdrant.from_documents(texts, embeddings, location=\":memory:\", collection_name=\"my_documents\", distance_func=\"Dot\")\n```\n\n## Enter a question\n\n\n```python\nquestions = [\n \"What did the author liken The Whole Earth Catalog to?\",\n \"What was Reed College great at?\",\n \"What was the author diagnosed with?\",\n \"What is the key lesson from this article?\",\n \"What did the article say about Michael Jackson?\",\n ]\n```\n\n## Answer the question based on the most relevant documents\n\n\n\n```python\n\nprompt_template = \"\"\"Text: {context}\n\nQuestion: {question}\n\nAnswer the question based on the text provided. If the text doesn't contain the answer, reply that the answer is not available.\"\"\"\n\nPROMPT = PromptTemplate(\n template=prompt_template, input_variables=[\"context\", \"question\"]\n)\n```\n\n\n```python\nchain_type_kwargs = {\"prompt\": PROMPT}\n\nqa = RetrievalQA.from_chain_type(llm=Cohere(model=\"command\", temperature=0), \n chain_type=\"stuff\", \n retriever=db.as_retriever(), \n chain_type_kwargs=chain_type_kwargs, \n return_source_documents=True)\n\nfor question in questions:\n answer = qa({\"query\": question})\n result = answer[\"result\"].replace(\"\\n\",\"\").replace(\"Answer:\",\"\")\n sources = answer['source_documents']\n print(\"-\"*150,\"\\n\")\n print(f\"Question: {question}\")\n print(f\"Answer: {result}\")\n\n ### COMMENT OUT THE 4 LINES BELOW TO HIDE THE SOURCES\n print(f\"\\nSources:\")\n for idx, source in enumerate(sources):\n source_wrapped = tr.fill(str(source.page_content), width=150)\n print(f\"{idx+1}: {source_wrapped}\")\n```\n\n ------------------------------------------------------------------------------------------------------------------------------------------------------ \n \n Question: What did the author liken The Whole Earth Catalog to?\n Answer: It was sort of like Google in paperback form, 35 years before Google came along\n \n Sources:\n 1: When I was young, there was an amazing publication called The Whole Earth Catalog, which was one of the bibles of my generation. It was created by a\n fellow named Stewart Brand not far from here in Menlo Park, and he brought it to life with his poetic touch. This was in the late 1960s, before\n personal computers and desktop publishing, so it was all made with typewriters, scissors and Polaroid cameras. It was sort of like Google in paperback\n form, 35 years before Google came along: It was\n 2: Stewart and his team put out several issues of The Whole Earth Catalog, and then when it had run its course, they put out a final issue. It was the\n mid-1970s, and I was your age. On the back cover of their final issue was a photograph of an early morning country road, the kind you might find\n yourself hitchhiking on if you were so adventurous. Beneath it were the words: “Stay Hungry. Stay Foolish.” It was their farewell message as they\n signed off. Stay Hungry. Stay Foolish. And I have always\n 3: idealistic, and overflowing with neat tools and great notions.\n 4: beautiful, historical, artistically subtle in a way that science can’t capture, and I found it fascinating.\n ------------------------------------------------------------------------------------------------------------------------------------------------------ \n \n Question: What was Reed College great at?\n Answer: Reed College was great at calligraphy instruction.\n \n Sources:\n 1: Reed College at that time offered perhaps the best calligraphy instruction in the country. Throughout the campus every poster, every label on every\n drawer, was beautifully hand calligraphed. Because I had dropped out and didn’t have to take the normal classes, I decided to take a calligraphy class\n to learn how to do this. I learned about serif and sans serif typefaces, about varying the amount of space between different letter combinations,\n about what makes great typography great. It was\n 2: I dropped out of Reed College after the first 6 months, but then stayed around as a drop-in for another 18 months or so before I really quit. So why\n did I drop out?\n 3: never dropped out, I would have never dropped in on this calligraphy class, and personal computers might not have the wonderful typography that they\n do. Of course it was impossible to connect the dots looking forward when I was in college. But it was very, very clear looking backward 10 years\n later.\n 4: OK. It was pretty scary at the time, but looking back it was one of the best decisions I ever made. The minute I dropped out I could stop taking the\n required classes that didn’t interest me, and begin dropping in on the ones that looked interesting.\n ------------------------------------------------------------------------------------------------------------------------------------------------------ \n \n Question: What was the author diagnosed with?\n Answer: The author was diagnosed with cancer.\n \n Sources:\n 1: I lived with that diagnosis all day. Later that evening I had a biopsy, where they stuck an endoscope down my throat, through my stomach and into my\n intestines, put a needle into my pancreas and got a few cells from the tumor. I was sedated, but my wife, who was there, told me that when they viewed\n the cells under a microscope the doctors started crying because it turned out to be a very rare form of pancreatic cancer that is curable with\n surgery. I had the surgery and I’m fine now.\n 2: About a year ago I was diagnosed with cancer. I had a scan at 7:30 in the morning, and it clearly showed a tumor on my pancreas. I didn’t even know\n what a pancreas was. The doctors told me this was almost certainly a type of cancer that is incurable, and that I should expect to live no longer than\n three to six months. My doctor advised me to go home and get my affairs in order, which is doctor’s code for prepare to die. It means to try to tell\n your kids everything you thought you’d have the\n 3: Stewart and his team put out several issues of The Whole Earth Catalog, and then when it had run its course, they put out a final issue. It was the\n mid-1970s, and I was your age. On the back cover of their final issue was a photograph of an early morning country road, the kind you might find\n yourself hitchhiking on if you were so adventurous. Beneath it were the words: “Stay Hungry. Stay Foolish.” It was their farewell message as they\n signed off. Stay Hungry. Stay Foolish. And I have always\n 4: beautiful, historical, artistically subtle in a way that science can’t capture, and I found it fascinating.\n ------------------------------------------------------------------------------------------------------------------------------------------------------ \n \n Question: What is the key lesson from this article?\n Answer: The key lesson from this article is that you have to trust that the dots will somehow connect in your future. You have to trust in something -- your gut, destiny, life, karma, whatever. This approach has never let me down, and it has made all the difference in my life.\n \n Sources:\n 1: Again, you can’t connect the dots looking forward; you can only connect them looking backward. So you have to trust that the dots will somehow connect\n in your future. You have to trust in something — your gut, destiny, life, karma, whatever. This approach has never let me down, and it has made all\n the difference in my life. My second story is about love and loss.\n 2: Remembering that I’ll be dead soon is the most important tool I’ve ever encountered to help me make the big choices in life. Because almost everything\n — all external expectations, all pride, all fear of embarrassment or failure — these things just fall away in the face of death, leaving only what is\n truly important. Remembering that you are going to die is the best way I know to avoid the trap of thinking you have something to lose. You are\n already naked. There is no reason not to follow your\n 3: Your time is limited, so don’t waste it living someone else’s life. Don’t be trapped by dogma — which is living with the results of other people’s\n thinking. Don’t let the noise of others’ opinions drown out your own inner voice. And most important, have the courage to follow your heart and\n intuition. They somehow already know what you truly want to become. Everything else is secondary.\n 4: I really didn’t know what to do for a few months. I felt that I had let the previous generation of entrepreneurs down — that I had dropped the baton\n as it was being passed to me. I met with David Packard and Bob Noyce and tried to apologize for screwing up so badly. I was a very public failure, and\n I even thought about running away from the valley. But something slowly began to dawn on me — I still loved what I did. The turn of events at Apple\n had not changed that one bit. I had been rejected,\n ------------------------------------------------------------------------------------------------------------------------------------------------------ \n \n Question: What did the article say about Michael Jackson?\n Answer: The text did not provide information about Michael Jackson.\n \n Sources:\n 1: baby boy; do you want him?” They said: “Of course.” My biological mother later found out that my mother had never graduated from college and that my\n father had never graduated from high school. She refused to sign the final adoption papers. She only relented a few months later when my parents\n promised that I would someday go to college.\n 2: beautiful, historical, artistically subtle in a way that science can’t capture, and I found it fascinating.\n 3: I really didn’t know what to do for a few months. I felt that I had let the previous generation of entrepreneurs down — that I had dropped the baton\n as it was being passed to me. I met with David Packard and Bob Noyce and tried to apologize for screwing up so badly. I was a very public failure, and\n I even thought about running away from the valley. But something slowly began to dawn on me — I still loved what I did. The turn of events at Apple\n had not changed that one bit. I had been rejected,\n 4: This was the closest I’ve been to facing death, and I hope it’s the closest I get for a few more decades. Having lived through it, I can now say this\n to you with a bit more certainty than when death was a useful but purely intellectual concept:\n\n\n## Questions in French\n\n\n```python\nquestions_fr = [\n \"À quoi se compare The Whole Earth Catalog ?\",\n \"Dans quoi Reed College était-il excellent ?\",\n \"De quoi l'auteur a-t-il été diagnostiqué ?\",\n \"Quelle est la leçon clé de cet article ?\",\n \"Que disait l'article sur Michael Jackson ?\",\n ]\n```\n\n\n```python\n```\n\n\n```python\n\nchain_type_kwargs = {\"prompt\": PROMPT}\n\nqa = RetrievalQA.from_chain_type(llm=Cohere(model=\"command\", temperature=0), \n chain_type=\"stuff\", \n retriever=db.as_retriever(), \n chain_type_kwargs=chain_type_kwargs, \n return_source_documents=True)\n\nfor question in questions_fr:\n answer = qa({\"query\": question})\n result = answer[\"result\"].replace(\"\\n\",\"\").replace(\"Answer:\",\"\")\n sources = answer['source_documents']\n print(\"-\"*20,\"\\n\")\n print(f\"Question: {question}\")\n print(f\"Answer: {result}\")\n```\n\n -------------------- \n \n Question: À quoi se compare The Whole Earth Catalog ?\n Answer: The Whole Earth Catalog was like Google in paperback form, 35 years before Google came along.\n -------------------- \n \n Question: Dans quoi Reed College était-il excellent ?\n Answer: Reed College offered the best calligraphy instruction in the country.\n -------------------- \n \n Question: De quoi l'auteur a-t-il été diagnostiqué ?\n Answer: The author was diagnosed with a very rare form of pancreatic cancer that is curable with surgery.\n -------------------- \n \n Question: Quelle est la leçon clé de cet article ?\n Answer: The key lesson of this article is that remembering that you will die soon is the most important tool to help one make the big choices in life.\n -------------------- \n \n Question: Que disait l'article sur Michael Jackson ?\n Answer: The text does not contain the answer to the question.", + "body": "[block:html]\n{\n \"html\": \"\\n\\n
\\n

Multilingual Search with Cohere and Langchain

\\n
\\n\\n\"\n}\n[/block]\n\n\n***Read the accompanying [blog post here](https://cohere.com/blog/search-cohere-langchain/).***\n\nThis notebook contains two examples for performing multilingual search using Cohere and Langchain. Langchain is a library that assists the development of applications built on top of large language models (LLMs), such as Cohere's models.\n\nIn short, Cohere makes it easy for developers to leverage LLMs and Langchain makes it easy to build applications with these models.\n\nWe'll go through the following examples:\n- **Example 1 - Basic Multilingual Search**\n\n This is a simple example of multilingual search over a list of documents.\n\n The steps in summary:\n - Import a list of documents\n - Embed the documents and store them in an index\n - Enter a query\n - Return the document most similar to the query\n- **Example 2 - Search-Based Question Answering**\n\n This example shows a more involved example where search is combined with text generation to answer questions about long-form documents.\n\n The steps in summary:\n - Add an article and chunk it into smaller passages\n - Embed the passages and store them in an index\n - Enter a question\n - Answer the question based on the most relevant documents\n\n\n```python\nfrom langchain.embeddings.cohere import CohereEmbeddings\nfrom langchain.llms import Cohere\nfrom langchain.prompts import PromptTemplate\nfrom langchain.text_splitter import RecursiveCharacterTextSplitter\nfrom langchain.chains.question_answering import load_qa_chain\nfrom langchain.chains import RetrievalQA\nfrom langchain.vectorstores import Qdrant\nfrom langchain.document_loaders import TextLoader\nimport textwrap as tr\nimport random\nimport dotenv\nimport os\n\ndotenv.load_dotenv(\".env\") # Upload an '.env' file containing an environment variable named 'COHERE_API_KEY' using your Cohere API Key\n```\n\n\n\n\n True\n\n\n\n\n[block:html]{\"html\":\"\\\"Example-1---Basic-Multilingual-Search.png\\\"/\"}[/block]\n\n### Import a list of documents\n\n\n```python\nimport tensorflow_datasets as tfds\ndataset = tfds.load('trec', split='train')\ntexts = [item['text'].decode('utf-8') for item in tfds.as_numpy(dataset)]\nprint(f\"Number of documents: {len(texts)}\")\n```\n\n Downloading and preparing dataset 350.79 KiB (download: 350.79 KiB, generated: 636.90 KiB, total: 987.69 KiB) to /root/tensorflow_datasets/trec/1.0.0...\n\n\n\n Dl Completed...: 0 url [00:00, ? url/s]\n\n\n\n Dl Size...: 0 MiB [00:00, ? MiB/s]\n\n\n\n Extraction completed...: 0 file [00:00, ? file/s]\n\n\n\n Generating splits...: 0%| | 0/2 [00:00\"}[/block]\n\n## Add an article and chunk it into smaller passages\n\n\n```python\n\n!wget 'https://docs.google.com/uc?export=download&id=1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F' -O steve-jobs-commencement.txt\n```\n\n --2023-06-08 06:11:19-- https://docs.google.com/uc?export=download&id=1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F\n Resolving docs.google.com (docs.google.com)... 74.125.200.101, 74.125.200.138, 74.125.200.102, ...\n Connecting to docs.google.com (docs.google.com)|74.125.200.101|:443... connected.\n HTTP request sent, awaiting response... 303 See Other\n Location: https://doc-0g-84-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/84t4moii9dmg08hmrh6rfpp8ecrjh6jq/1686204675000/12721472133292131824/*/1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F?e=download&uuid=a26288c7-ad0c-4707-ae0b-72cb94c224dc [following]\n Warning: wildcards not supported in HTTP.\n --2023-06-08 06:11:19-- https://doc-0g-84-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/84t4moii9dmg08hmrh6rfpp8ecrjh6jq/1686204675000/12721472133292131824/*/1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F?e=download&uuid=a26288c7-ad0c-4707-ae0b-72cb94c224dc\n Resolving doc-0g-84-docs.googleusercontent.com (doc-0g-84-docs.googleusercontent.com)... 74.125.130.132, 2404:6800:4003:c01::84\n Connecting to doc-0g-84-docs.googleusercontent.com (doc-0g-84-docs.googleusercontent.com)|74.125.130.132|:443... connected.\n HTTP request sent, awaiting response... 200 OK\n Length: 11993 (12K) [text/plain]\n Saving to: ‘steve-jobs-commencement.txt’\n \n steve-jobs-commence 100%[===================>] 11.71K --.-KB/s in 0s \n \n 2023-06-08 06:11:20 (115 MB/s) - ‘steve-jobs-commencement.txt’ saved [11993/11993]\n \n\n\n\n```python\nloader = TextLoader(\"steve-jobs-commencement.txt\")\ndocuments = loader.load()\ntext_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)\ntexts = text_splitter.split_documents(documents)\n```\n\n## Embed the passages and store them in an index\n\n\n```python\nembeddings = CohereEmbeddings(model = \"multilingual-22-12\")\ndb = Qdrant.from_documents(texts, embeddings, location=\":memory:\", collection_name=\"my_documents\", distance_func=\"Dot\")\n```\n\n## Enter a question\n\n\n```python\nquestions = [\n \"What did the author liken The Whole Earth Catalog to?\",\n \"What was Reed College great at?\",\n \"What was the author diagnosed with?\",\n \"What is the key lesson from this article?\",\n \"What did the article say about Michael Jackson?\",\n ]\n```\n\n## Answer the question based on the most relevant documents\n\n\n\n```python\n\nprompt_template = \"\"\"Text: {context}\n\nQuestion: {question}\n\nAnswer the question based on the text provided. If the text doesn't contain the answer, reply that the answer is not available.\"\"\"\n\nPROMPT = PromptTemplate(\n template=prompt_template, input_variables=[\"context\", \"question\"]\n)\n```\n\n\n```python\nchain_type_kwargs = {\"prompt\": PROMPT}\n\nqa = RetrievalQA.from_chain_type(llm=Cohere(model=\"command\", temperature=0), \n chain_type=\"stuff\", \n retriever=db.as_retriever(), \n chain_type_kwargs=chain_type_kwargs, \n return_source_documents=True)\n\nfor question in questions:\n answer = qa({\"query\": question})\n result = answer[\"result\"].replace(\"\\n\",\"\").replace(\"Answer:\",\"\")\n sources = answer['source_documents']\n print(\"-\"*150,\"\\n\")\n print(f\"Question: {question}\")\n print(f\"Answer: {result}\")\n\n ### COMMENT OUT THE 4 LINES BELOW TO HIDE THE SOURCES\n print(f\"\\nSources:\")\n for idx, source in enumerate(sources):\n source_wrapped = tr.fill(str(source.page_content), width=150)\n print(f\"{idx+1}: {source_wrapped}\")\n```\n\n ------------------------------------------------------------------------------------------------------------------------------------------------------ \n \n Question: What did the author liken The Whole Earth Catalog to?\n Answer: It was sort of like Google in paperback form, 35 years before Google came along\n \n Sources:\n 1: When I was young, there was an amazing publication called The Whole Earth Catalog, which was one of the bibles of my generation. It was created by a\n fellow named Stewart Brand not far from here in Menlo Park, and he brought it to life with his poetic touch. This was in the late 1960s, before\n personal computers and desktop publishing, so it was all made with typewriters, scissors and Polaroid cameras. It was sort of like Google in paperback\n form, 35 years before Google came along: It was\n 2: Stewart and his team put out several issues of The Whole Earth Catalog, and then when it had run its course, they put out a final issue. It was the\n mid-1970s, and I was your age. On the back cover of their final issue was a photograph of an early morning country road, the kind you might find\n yourself hitchhiking on if you were so adventurous. Beneath it were the words: “Stay Hungry. Stay Foolish.” It was their farewell message as they\n signed off. Stay Hungry. Stay Foolish. And I have always\n 3: idealistic, and overflowing with neat tools and great notions.\n 4: beautiful, historical, artistically subtle in a way that science can’t capture, and I found it fascinating.\n ------------------------------------------------------------------------------------------------------------------------------------------------------ \n \n Question: What was Reed College great at?\n Answer: Reed College was great at calligraphy instruction.\n \n Sources:\n 1: Reed College at that time offered perhaps the best calligraphy instruction in the country. Throughout the campus every poster, every label on every\n drawer, was beautifully hand calligraphed. Because I had dropped out and didn’t have to take the normal classes, I decided to take a calligraphy class\n to learn how to do this. I learned about serif and sans serif typefaces, about varying the amount of space between different letter combinations,\n about what makes great typography great. It was\n 2: I dropped out of Reed College after the first 6 months, but then stayed around as a drop-in for another 18 months or so before I really quit. So why\n did I drop out?\n 3: never dropped out, I would have never dropped in on this calligraphy class, and personal computers might not have the wonderful typography that they\n do. Of course it was impossible to connect the dots looking forward when I was in college. But it was very, very clear looking backward 10 years\n later.\n 4: OK. It was pretty scary at the time, but looking back it was one of the best decisions I ever made. The minute I dropped out I could stop taking the\n required classes that didn’t interest me, and begin dropping in on the ones that looked interesting.\n ------------------------------------------------------------------------------------------------------------------------------------------------------ \n \n Question: What was the author diagnosed with?\n Answer: The author was diagnosed with cancer.\n \n Sources:\n 1: I lived with that diagnosis all day. Later that evening I had a biopsy, where they stuck an endoscope down my throat, through my stomach and into my\n intestines, put a needle into my pancreas and got a few cells from the tumor. I was sedated, but my wife, who was there, told me that when they viewed\n the cells under a microscope the doctors started crying because it turned out to be a very rare form of pancreatic cancer that is curable with\n surgery. I had the surgery and I’m fine now.\n 2: About a year ago I was diagnosed with cancer. I had a scan at 7:30 in the morning, and it clearly showed a tumor on my pancreas. I didn’t even know\n what a pancreas was. The doctors told me this was almost certainly a type of cancer that is incurable, and that I should expect to live no longer than\n three to six months. My doctor advised me to go home and get my affairs in order, which is doctor’s code for prepare to die. It means to try to tell\n your kids everything you thought you’d have the\n 3: Stewart and his team put out several issues of The Whole Earth Catalog, and then when it had run its course, they put out a final issue. It was the\n mid-1970s, and I was your age. On the back cover of their final issue was a photograph of an early morning country road, the kind you might find\n yourself hitchhiking on if you were so adventurous. Beneath it were the words: “Stay Hungry. Stay Foolish.” It was their farewell message as they\n signed off. Stay Hungry. Stay Foolish. And I have always\n 4: beautiful, historical, artistically subtle in a way that science can’t capture, and I found it fascinating.\n ------------------------------------------------------------------------------------------------------------------------------------------------------ \n \n Question: What is the key lesson from this article?\n Answer: The key lesson from this article is that you have to trust that the dots will somehow connect in your future. You have to trust in something -- your gut, destiny, life, karma, whatever. This approach has never let me down, and it has made all the difference in my life.\n \n Sources:\n 1: Again, you can’t connect the dots looking forward; you can only connect them looking backward. So you have to trust that the dots will somehow connect\n in your future. You have to trust in something — your gut, destiny, life, karma, whatever. This approach has never let me down, and it has made all\n the difference in my life. My second story is about love and loss.\n 2: Remembering that I’ll be dead soon is the most important tool I’ve ever encountered to help me make the big choices in life. Because almost everything\n — all external expectations, all pride, all fear of embarrassment or failure — these things just fall away in the face of death, leaving only what is\n truly important. Remembering that you are going to die is the best way I know to avoid the trap of thinking you have something to lose. You are\n already naked. There is no reason not to follow your\n 3: Your time is limited, so don’t waste it living someone else’s life. Don’t be trapped by dogma — which is living with the results of other people’s\n thinking. Don’t let the noise of others’ opinions drown out your own inner voice. And most important, have the courage to follow your heart and\n intuition. They somehow already know what you truly want to become. Everything else is secondary.\n 4: I really didn’t know what to do for a few months. I felt that I had let the previous generation of entrepreneurs down — that I had dropped the baton\n as it was being passed to me. I met with David Packard and Bob Noyce and tried to apologize for screwing up so badly. I was a very public failure, and\n I even thought about running away from the valley. But something slowly began to dawn on me — I still loved what I did. The turn of events at Apple\n had not changed that one bit. I had been rejected,\n ------------------------------------------------------------------------------------------------------------------------------------------------------ \n \n Question: What did the article say about Michael Jackson?\n Answer: The text did not provide information about Michael Jackson.\n \n Sources:\n 1: baby boy; do you want him?” They said: “Of course.” My biological mother later found out that my mother had never graduated from college and that my\n father had never graduated from high school. She refused to sign the final adoption papers. She only relented a few months later when my parents\n promised that I would someday go to college.\n 2: beautiful, historical, artistically subtle in a way that science can’t capture, and I found it fascinating.\n 3: I really didn’t know what to do for a few months. I felt that I had let the previous generation of entrepreneurs down — that I had dropped the baton\n as it was being passed to me. I met with David Packard and Bob Noyce and tried to apologize for screwing up so badly. I was a very public failure, and\n I even thought about running away from the valley. But something slowly began to dawn on me — I still loved what I did. The turn of events at Apple\n had not changed that one bit. I had been rejected,\n 4: This was the closest I’ve been to facing death, and I hope it’s the closest I get for a few more decades. Having lived through it, I can now say this\n to you with a bit more certainty than when death was a useful but purely intellectual concept:\n\n\n## Questions in French\n\n\n```python\nquestions_fr = [\n \"À quoi se compare The Whole Earth Catalog ?\",\n \"Dans quoi Reed College était-il excellent ?\",\n \"De quoi l'auteur a-t-il été diagnostiqué ?\",\n \"Quelle est la leçon clé de cet article ?\",\n \"Que disait l'article sur Michael Jackson ?\",\n ]\n```\n\n\n```python\n```\n\n\n```python\n\nchain_type_kwargs = {\"prompt\": PROMPT}\n\nqa = RetrievalQA.from_chain_type(llm=Cohere(model=\"command\", temperature=0), \n chain_type=\"stuff\", \n retriever=db.as_retriever(), \n chain_type_kwargs=chain_type_kwargs, \n return_source_documents=True)\n\nfor question in questions_fr:\n answer = qa({\"query\": question})\n result = answer[\"result\"].replace(\"\\n\",\"\").replace(\"Answer:\",\"\")\n sources = answer['source_documents']\n print(\"-\"*20,\"\\n\")\n print(f\"Question: {question}\")\n print(f\"Answer: {result}\")\n```\n\n -------------------- \n \n Question: À quoi se compare The Whole Earth Catalog ?\n Answer: The Whole Earth Catalog was like Google in paperback form, 35 years before Google came along.\n -------------------- \n \n Question: Dans quoi Reed College était-il excellent ?\n Answer: Reed College offered the best calligraphy instruction in the country.\n -------------------- \n \n Question: De quoi l'auteur a-t-il été diagnostiqué ?\n Answer: The author was diagnosed with a very rare form of pancreatic cancer that is curable with surgery.\n -------------------- \n \n Question: Quelle est la leçon clé de cet article ?\n Answer: The key lesson of this article is that remembering that you will die soon is the most important tool to help one make the big choices in life.\n -------------------- \n \n Question: Que disait l'article sur Michael Jackson ?\n Answer: The text does not contain the answer to the question.", "html": "", "htmlmode": false, "fullscreen": false, diff --git a/scripts/cookbooks-json/wikipedia-semantic-search.json b/scripts/cookbooks-json/wikipedia-semantic-search.json index 325881e6..a02e01ea 100644 --- a/scripts/cookbooks-json/wikipedia-semantic-search.json +++ b/scripts/cookbooks-json/wikipedia-semantic-search.json @@ -13,7 +13,7 @@ }, "title": "Wikipedia Semantic Search with Cohere Embedding Archives", "slug": "wikipedia-semantic-search", - "body": "[block:html]\n{\n \"html\": \"\\n\\n
\\n

Wikipedia Semantic Search with Cohere Embedding Archives

\\n
\\n\\n\"\n}\n[/block]\n\nThis notebook contains the starter code to do simple [semantic search](https://txt.cohere.ai/what-is-semantic-search/) on the [Wikipedia embeddings archives](https://txt.cohere.ai/embedding-archives-wikipedia/) published by Cohere. These archives embed Wikipedia sites in multiple languages. In this example, we'll use [Wikipedia Simple English](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings). \n\nLet's now download 1,000 records from the English Wikipedia embeddings archive so we can search it afterwards.\n\n\n```python\nfrom datasets import load_dataset\nimport torch\nimport cohere\n\nco = cohere.Client(\"\") \n\n#Load at max 1000 documents + embeddings\nmax_docs = 1000\ndocs_stream = load_dataset(f\"Cohere/wikipedia-22-12-simple-embeddings\", split=\"train\", streaming=True)\n\ndocs = []\ndoc_embeddings = []\n\nfor doc in docs_stream:\n docs.append(doc)\n doc_embeddings.append(doc['emb'])\n if len(docs) >= max_docs:\n break\n\ndoc_embeddings = torch.tensor(doc_embeddings)\n```\n\n\n Downloading: 0%| | 0.00/1.29k [00:00\\n \\n
\\n \\n \\n \\n \\n \\n \\n
\\n Back to Cookbooks\\n
\\n\\n \\n Open in GitHub\\n
\\n \\n \\n \\n \\n \\n \\n
\\n
\\n\\n\\n
\\n

Wikipedia Semantic Search with Cohere Embedding Archives

\\n
\\n\\n\"\n}\n[/block]\n\nThis notebook contains the starter code to do simple [semantic search](https://cohere.com/llmu/what-is-semantic-search/) on the [Wikipedia embeddings archives](https://cohere.com/blog/embedding-archives-wikipedia/) published by Cohere. These archives embed Wikipedia sites in multiple languages. In this example, we'll use [Wikipedia Simple English](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings). \n\nLet's now download 1,000 records from the English Wikipedia embeddings archive so we can search it afterwards.\n\n\n```python\nfrom datasets import load_dataset\nimport torch\nimport cohere\n\nco = cohere.Client(\"\") \n\n#Load at max 1000 documents + embeddings\nmax_docs = 1000\ndocs_stream = load_dataset(f\"Cohere/wikipedia-22-12-simple-embeddings\", split=\"train\", streaming=True)\n\ndocs = []\ndoc_embeddings = []\n\nfor doc in docs_stream:\n docs.append(doc)\n doc_embeddings.append(doc['emb'])\n if len(docs) >= max_docs:\n break\n\ndoc_embeddings = torch.tensor(doc_embeddings)\n```\n\n\n Downloading: 0%| | 0.00/1.29k [00:00 diff --git a/scripts/cookbooks-mdx/fueling-generative-content.mdx b/scripts/cookbooks-mdx/fueling-generative-content.mdx index 11d1d1d5..b2316064 100644 --- a/scripts/cookbooks-mdx/fueling-generative-content.mdx +++ b/scripts/cookbooks-mdx/fueling-generative-content.mdx @@ -135,7 +135,7 @@ slug: /page/fueling-generative-content Generative models have proven extremely useful in content idea generation. But they don’t take into account user search demand and trends. In this notebook, let’s see how we can solve that by adding keyword research into the equation. -Read the accompanying [blog post here](https://txt.cohere.ai/generative-content-keyword-research/). +Read the accompanying [blog post here](https://cohere.com/blog/generative-content-keyword-research/). ```python PYTHON ! pip install cohere -q diff --git a/scripts/cookbooks-mdx/hello-world-meet-ai.mdx b/scripts/cookbooks-mdx/hello-world-meet-ai.mdx index b7a4b089..de119e06 100644 --- a/scripts/cookbooks-mdx/hello-world-meet-ai.mdx +++ b/scripts/cookbooks-mdx/hello-world-meet-ai.mdx @@ -135,7 +135,7 @@ slug: /page/hello-world-meet-ai Here we take a quick tour of what’s possible with language AI via Cohere’s Large Language Model (LLM) API. This is the Hello, World! of language AI, written for developers with little or no background in AI. In fact, we’ll do that by exploring the Hello, World! phrase itself. -Read the accompanying [blog post here](https://txt.cohere.ai/hello-world-p1/). +Read the accompanying [blog post here](https://cohere.com/blog/hello-world-p1/). Hello World! Meet Language AI diff --git a/scripts/cookbooks-mdx/multilingual-search.mdx b/scripts/cookbooks-mdx/multilingual-search.mdx index b9d704e4..75de0baf 100644 --- a/scripts/cookbooks-mdx/multilingual-search.mdx +++ b/scripts/cookbooks-mdx/multilingual-search.mdx @@ -133,7 +133,7 @@ slug: /page/multilingual-search -***Read the accompanying [blog post here](https://txt.cohere.ai/search-cohere-langchain/).*** +***Read the accompanying [blog post here](https://cohere.com/blog/search-cohere-langchain/).*** This notebook contains two examples for performing multilingual search using Cohere and Langchain. Langchain is a library that assists the development of applications built on top of large language models (LLMs), such as Cohere's models. diff --git a/scripts/cookbooks-mdx/wikipedia-semantic-search.mdx b/scripts/cookbooks-mdx/wikipedia-semantic-search.mdx index fc1c6a26..7b456d05 100644 --- a/scripts/cookbooks-mdx/wikipedia-semantic-search.mdx +++ b/scripts/cookbooks-mdx/wikipedia-semantic-search.mdx @@ -132,7 +132,7 @@ slug: /page/wikipedia-semantic-search } -This notebook contains the starter code to do simple [semantic search](https://txt.cohere.ai/what-is-semantic-search/) on the [Wikipedia embeddings archives](https://txt.cohere.ai/embedding-archives-wikipedia/) published by Cohere. These archives embed Wikipedia sites in multiple languages. In this example, we'll use [Wikipedia Simple English](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings). +This notebook contains the starter code to do simple [semantic search](https://cohere.com/llmu/what-is-semantic-search/) on the [Wikipedia embeddings archives](https://cohere.com/blog/embedding-archives-wikipedia/) published by Cohere. These archives embed Wikipedia sites in multiple languages. In this example, we'll use [Wikipedia Simple English](https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings). Let's now download 1,000 records from the English Wikipedia embeddings archive so we can search it afterwards. @@ -167,7 +167,7 @@ doc_embeddings = torch.tensor(doc_embeddings) Using custom data configuration Cohere--wikipedia-22-12-simple-embeddings-94deea3d55a22093 -Now, `doc_embeddings` holds the embeddings of the first 1,000 documents in the dataset. Each document is represented as an [embeddings vector](https://txt.cohere.ai/sentence-word-embeddings/) of 768 values. +Now, `doc_embeddings` holds the embeddings of the first 1,000 documents in the dataset. Each document is represented as an [embeddings vector](https://cohere.com/llmu/sentence-word-embeddings/) of 768 values. ```python PYTHON