Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements: Sharepoint OAuth, docs improvements, code clarifications #17

Merged
merged 7 commits into from
Dec 19, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions sharepoint/.env-template
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
SHAREPOINT_AUTH_TYPE=
SHAREPOINT_CLIENT_ID=
SHAREPOINT_CLIENT_SECRET=
SHAREPOINT_TENANT_ID=
Expand Down
92 changes: 61 additions & 31 deletions sharepoint/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,72 +6,102 @@ It uses Microsoft Graph API run the search query and return matching files.

# Limitations

The Sharepoint connector currently allows for full-text search based on file contents stored within your Sharepoint instance, it is important to note however that only List items and Drive items are currently returned by the search API.
The Sharepoint connector allows for full-text search over all files in your Sharepoint instance. It supports two types of authentication:

- Application auth: Allows searching all files that the app has access to.
- Delegated auth (OAuth): Allows searching files that the authenticated user has access to (recommended).

Important: Sharepoint's default interval for content crawling is set to every 15 minutes. Expect a delay between uploading new files and being able to search for them.

## Configuration

1. Register a new Microsoft App

Running this connector requires access to Microsoft 365. For development purposes,
you can register for the Microsoft 365 developer program, which will grant temporary
access to a Microsoft 365.

For the connector to work, you must register the application. To do this, go to the
For the connector to work, you must create a new application. To do this, go to the
Microsoft Entra admin center:

https://entra.microsoft.com/

Navigate to Applications > App registrations > New registration option.

Select "Web" as the platform, and ensure you add a redirect URL, even if it is optional.
The redirect URL is required for the admin consent step to work. This connector does not
have a redirect page implemented, but you can use http://localhost/ as the redirect URL.
Select "Web" as the platform, and add a redirect URI as needed. For App auth, you can set the URI to the server you're hosting the connector on. For Delegated auth, set the URI to `https://api.cohere.com/v1/connectors/oauth/token`.

On the app registration page for the app you have created, go to API permissions, and
grant permissions. For development purposes, you can grant:
Next, we will configure your App permissions (this requires Admin access on Entra). Head under your app's API permissions page and select Add a permission > Microsoft Graph > Application Permissions > In the Select Permissions dialog, choose `Application.Read.All`.

- SharePointTenantSettings.Read.All
- SharePointTenantSettings.ReadWrite.All
- Sites.FullControl.All
- Sites.Manage.All
- Sites.Read.All
- Sites.ReadWrite.All
- Sites.Selected
Then, head to Certificates & Secrets and create a new client secret.

You will then have a create a client secret for the application. Then take the app's credentials (:code:`SHAREPOINT_GRAPH_TENANT_ID`, :code:`SHAREPOINT_GRAPH_CLIENT_ID` and :code:`SHAREPOINT_GRAPH_CLIENT_SECRET`) and copy them into a `.env` file using the `.env-template` as the base template.
The above environment variables can be read from a .env file. See `.env-template` for an example `.env` file.

To process the files in a readable format by Coral, the Sharepoint connector leverages
In order to process OneDrive files, it is necessary to provide credentials for Unstructured:
2. Authentication

- `SHAREPOINT_UNSTRUCTURED_BASE_URL`
- `SHAREPOINT_UNSTRUCTURED_API_KEY`
We will now cover the two types of authentication supported by this connector. To use either type of authentication, specify the `SHAREPOINT_AUTH_TYPE` environment variable as either `application` for App auth, or `user` for Delegated auth.

To use the hosted Unstructured API, you must provide an API key and set `SHAREPOINT_GRAPH_UNSTRUCTURED_BASE_URL`
too. A trailing slash should not be included (i.e. `http://localhost:8000` or `https://api.unstructured.io`).
### Application authentication

You can configure which file types will be processed by Unstructured with the `SHAREPOINT_PASSTHROUGH_FILE_TYPES` environment variable. This should be a comma-separated list of strings. Any files matching the types defined will skip Unstructured.
For application authentication, you will need to setup the following environment variables in a `.env` file:

The above environment variables can be read from a .env file. See `.env-template` for an example `.env` file.
```bash
SHAREPOINT_AUTH_TYPE=application
SHAREPOINT_CLIENT_ID=<obtainable from app details>
SHAREPOINT_CLIENT_SECRET=<obtainable from app credentials>
SHAREPOINT_TENANT_ID=<obtainable from app details>
```

### Delegated authentication

For delegated authentication, you will need to add the following environment variable in a `.env` file:

```bash
SHAREPOINT_AUTH_TYPE=user
```

Other than that, no configuration is needed. When registering the connector you will specify all the details required for Cohere to handle the authentication steps (details to follow).

After the client has been created, you will need to grant admin consent to the client. One
way to do this is by going to the following URL:
To configure delegated user OAuth, make sure the app you registered in Step 1 has a Redirect URI to `https://api.cohere.com/v1/connectors/oauth/token`.

https://login.microsoftonline.com/{site_id}/adminconsent?client_id={client_id}&redirect_uri=http://localhost/
Next, register the connector with Cohere's API using the following configuration.

You must replace `{site_id}` and `{client_id}` with the appropriate values. The `redirect_uri`
must match the value that was configured when creating the client in Microsoft Entra.
```bash
curl -X POST \
'https://api.cohere.ai/v1/connectors' \
--header 'Accept: */*' \
--header 'Authorization: Bearer {COHERE-API-KEY}' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "Sharepoint with OAuth",
"url": "{YOUR_CONNECTOR-URL}",
"oauth": {
"client_id": "{Your Microsoft App CLIENT-ID}",
"client_secret": "{Your Microsoft App CLIENT-SECRET}",
"authorize_url": "https://login.microsoftonline.com/{Your Microsoft App TENANT-ID}/oauth2/v2.0/authorize"
"token_url": "https://login.microsoftonline.com/{Your Microsoft App TENANT-ID}/oauth2/v2.0/token"
tianjing-li marked this conversation as resolved.
Show resolved Hide resolved
"scope": ".default offline_access"
}
}'
```

Once properly registered, whenever a search request is made Cohere will take care of authorizing the current user and passing the correct access tokens in the request headers.

### Provision Unstructured

Processing the files found on OneDrive requires the Unstructured API. The Unstructured API is
a commercially backed, Open Source project. It is available as a hosted API, Docker image, and as a
Python package, which can be manually set up.

By default, this connector uses the hosted `https://api.unstructured.io` API. You must provide an API key by registering an account and obtaining an API key [here](https://unstructured.io/api-key).
To configure Unstructured, setup these two environment variables:

```bash
SHAREPOINT_UNSTRUCTURED_BASE_URL=https://api.unstructured.io
SHAREPOINT_UNSTRUCTURED_API_KEY=(optional)
```

Alternatively, you can use the API by hosting it yourself with their provided Docker image. If you've used Docker before, the setup is relatively straightforward. Please follow the instructions for setting up the Docker image in the Unstructured [documentation](https://unstructured-io.github.io/unstructured/api.html#using-docker-images).
By default, this connector uses the hosted `https://api.unstructured.io` API that requires an API key obtainable by registering an account [here](https://unstructured.io/api-key).

The final option is to set Unstructured up locally, outside of Docker. This is a complex option that is not recommended, as it involves installing many dependencies outside of Python.
Alternatively, you can use the API by hosting it yourself with their provided Docker image. If you've used Docker before, the setup is relatively straightforward. Please follow the instructions for setting up the Docker image in the Unstructured [documentation](https://unstructured-io.github.io/unstructured/api.html#using-docker-images). With this self-hosted option, no API key is required.

### Run Flask Server

Expand Down
19 changes: 16 additions & 3 deletions sharepoint/provider/unstructured.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
import asyncio
import aiohttp
import logging
import sys
import functools
from collections import OrderedDict
from flask import current_app as app

logger = logging.getLogger(__name__)

CACHE_SIZE = 256
CACHE_LIMIT_BYTES = 20 * 1024 * 1024 # 20 MB to bytes
TIMEOUT_SECONDS = 20

unstructured = None

Expand All @@ -21,12 +24,22 @@ def __init__(self, unstructured_base_url, api_key):

def start_session(self):
self.loop = asyncio.new_event_loop()
self.session = aiohttp.ClientSession(loop=self.loop)
# Create ClientTimeout object to apply timeout for every request in the session
client_timeout = aiohttp.ClientTimeout(total=TIMEOUT_SECONDS)
self.session = aiohttp.ClientSession(loop=self.loop, timeout=client_timeout)

def close_loop(self):
self.loop.stop()
self.loop.close()

def cache_size(self):
# Calculate the total size of values in bytes
total_size_bytes = functools.reduce(
lambda a, b: a + b, map(lambda v: sys.getsizeof(v), self.cache.values()), 0
)

return total_size_bytes

def cache_get(self, key):
self.cache.move_to_end(key)

Expand All @@ -35,7 +48,7 @@ def cache_get(self, key):
def cache_put(self, key, item):
self.cache[key] = item

if len(self.cache) > CACHE_SIZE:
while self.cache_size() > CACHE_LIMIT_BYTES:
self.cache.popitem()

async def close_session(self):
Expand Down