Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements: Sharepoint OAuth, docs improvements, code clarifications #17

Merged
merged 7 commits into from
Dec 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions sharepoint/.env-template
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
SHAREPOINT_AUTH_TYPE=
SHAREPOINT_CLIENT_ID=
SHAREPOINT_CLIENT_SECRET=
SHAREPOINT_TENANT_ID=
Expand Down
98 changes: 67 additions & 31 deletions sharepoint/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,72 +6,108 @@ It uses Microsoft Graph API run the search query and return matching files.

# Limitations

The Sharepoint connector currently allows for full-text search based on file contents stored within your Sharepoint instance, it is important to note however that only List items and Drive items are currently returned by the search API.
The Sharepoint connector allows for full-text search over all files in your Sharepoint instance. It supports two types of authentication:

- Application auth: Allows searching all files that the app has access to.
- Delegated auth (OAuth): Allows searching files that the authenticated user has access to (recommended).

Important: Sharepoint's default interval for content crawling is set to every 15 minutes. Expect a delay between uploading new files and being able to search for them.

## Configuration

1. Register a new Microsoft App

Running this connector requires access to Microsoft 365. For development purposes,
you can register for the Microsoft 365 developer program, which will grant temporary
access to a Microsoft 365.

For the connector to work, you must register the application. To do this, go to the
For the connector to work, you must create a new application. To do this, go to the
Microsoft Entra admin center:

https://entra.microsoft.com/

Navigate to Applications > App registrations > New registration option.

Select "Web" as the platform, and ensure you add a redirect URL, even if it is optional.
The redirect URL is required for the admin consent step to work. This connector does not
have a redirect page implemented, but you can use http://localhost/ as the redirect URL.
Select "Web" as the platform, and add a redirect URI as needed. For App auth, you can set the URI to the server you're hosting the connector on. For Delegated auth, set the URI to `https://api.cohere.com/v1/connectors/oauth/token`.

On the app registration page for the app you have created, go to API permissions, and
grant permissions. For development purposes, you can grant:
Next, we will configure your App permissions (this requires Admin access on Entra). Head under your app's API permissions page and select Add a permission > Microsoft Graph. From here, select either Application of Delegated permissions as required, and check the following permissions:

- SharePointTenantSettings.Read.All
- SharePointTenantSettings.ReadWrite.All
- Sites.FullControl.All
- Sites.Manage.All
- Sites.Read.All
- Sites.ReadWrite.All
- Sites.Selected
- `offline_access` (only if using Delegated)
- `Application.Read.All`
- `Files.ReadWrite.All` (MSFT requires this to enable search, though this connector will never write anything)

You will then have a create a client secret for the application. Then take the app's credentials (:code:`SHAREPOINT_GRAPH_TENANT_ID`, :code:`SHAREPOINT_GRAPH_CLIENT_ID` and :code:`SHAREPOINT_GRAPH_CLIENT_SECRET`) and copy them into a `.env` file using the `.env-template` as the base template.
Go back to API permissions, and as an Admin, select Grant admin consent for MSFT.

To process the files in a readable format by Coral, the Sharepoint connector leverages
In order to process OneDrive files, it is necessary to provide credentials for Unstructured:
Then, head to Certificates & Secrets and create a new client secret.

- `SHAREPOINT_UNSTRUCTURED_BASE_URL`
- `SHAREPOINT_UNSTRUCTURED_API_KEY`
The above environment variables can be read from a .env file. See `.env-template` for an example `.env` file.

To use the hosted Unstructured API, you must provide an API key and set `SHAREPOINT_GRAPH_UNSTRUCTURED_BASE_URL`
too. A trailing slash should not be included (i.e. `http://localhost:8000` or `https://api.unstructured.io`).
2. Authentication

You can configure which file types will be processed by Unstructured with the `SHAREPOINT_PASSTHROUGH_FILE_TYPES` environment variable. This should be a comma-separated list of strings. Any files matching the types defined will skip Unstructured.
We will now cover the two types of authentication supported by this connector. To use either type of authentication, specify the `SHAREPOINT_AUTH_TYPE` environment variable as either `application` for App auth, or `user` for Delegated auth.

The above environment variables can be read from a .env file. See `.env-template` for an example `.env` file.
### Application authentication

For application authentication, you will need to setup the following environment variables in a `.env` file:

```bash
SHAREPOINT_AUTH_TYPE=application
SHAREPOINT_CLIENT_ID=<obtainable from app details>
SHAREPOINT_CLIENT_SECRET=<obtainable from app credentials>
SHAREPOINT_TENANT_ID=<obtainable from app details>
```

### Delegated authentication

For delegated authentication, you will need to add the following environment variable in a `.env` file:

After the client has been created, you will need to grant admin consent to the client. One
way to do this is by going to the following URL:
```bash
SHAREPOINT_AUTH_TYPE=user
```

Other than that, no configuration is needed. When registering the connector you will specify all the details required for Cohere to handle the authentication steps (details to follow).

To configure delegated user OAuth, make sure the app you registered in Step 1 has a Redirect URI to `https://api.cohere.com/v1/connectors/oauth/token`.

Next, register the connector with Cohere's API using the following configuration.

https://login.microsoftonline.com/{site_id}/adminconsent?client_id={client_id}&redirect_uri=http://localhost/
```bash
curl -X POST \
'https://api.cohere.ai/v1/connectors' \
--header 'Accept: */*' \
--header 'Authorization: Bearer {COHERE-API-KEY}' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "Sharepoint with OAuth",
"url": "{YOUR_CONNECTOR-URL}",
"oauth": {
"client_id": "{Your Microsoft App CLIENT-ID}",
"client_secret": "{Your Microsoft App CLIENT-SECRET}",
"authorize_url": "https://login.microsoftonline.com/{Your Microsoft App TENANT-ID}/oauth2/v2.0/authorize",
"token_url": "https://login.microsoftonline.com/{Your Microsoft App TENANT-ID}/oauth2/v2.0/token",
"scope": ".default offline_access"
}
}'
```

You must replace `{site_id}` and `{client_id}` with the appropriate values. The `redirect_uri`
must match the value that was configured when creating the client in Microsoft Entra.
Once properly registered, whenever a search request is made Cohere will take care of authorizing the current user and passing the correct access tokens in the request headers.

### Provision Unstructured

Processing the files found on OneDrive requires the Unstructured API. The Unstructured API is
a commercially backed, Open Source project. It is available as a hosted API, Docker image, and as a
Python package, which can be manually set up.

By default, this connector uses the hosted `https://api.unstructured.io` API. You must provide an API key by registering an account and obtaining an API key [here](https://unstructured.io/api-key).
To configure Unstructured, setup these two environment variables:

```bash
SHAREPOINT_UNSTRUCTURED_BASE_URL=https://api.unstructured.io
SHAREPOINT_UNSTRUCTURED_API_KEY=(optional)
```

Alternatively, you can use the API by hosting it yourself with their provided Docker image. If you've used Docker before, the setup is relatively straightforward. Please follow the instructions for setting up the Docker image in the Unstructured [documentation](https://unstructured-io.github.io/unstructured/api.html#using-docker-images).
By default, this connector uses the hosted `https://api.unstructured.io` API that requires an API key obtainable by registering an account [here](https://unstructured.io/api-key).

The final option is to set Unstructured up locally, outside of Docker. This is a complex option that is not recommended, as it involves installing many dependencies outside of Python.
Alternatively, you can use the API by hosting it yourself with their provided Docker image. If you've used Docker before, the setup is relatively straightforward. Please follow the instructions for setting up the Docker image in the Unstructured [documentation](https://unstructured-io.github.io/unstructured/api.html#using-docker-images). With this self-hosted option, no API key is required.

### Run Flask Server

Expand Down
187 changes: 87 additions & 100 deletions sharepoint/provider/client.py
Original file line number Diff line number Diff line change
@@ -1,141 +1,128 @@
from functools import lru_cache
from azure.identity import ClientSecretCredential
from flask import current_app as app
from msgraph.core import GraphClient, APIVersion
from urllib.parse import urlparse
import requests

from msal import ConfidentialClientApplication
from flask import current_app as app, request

from . import UpstreamProviderError
from .consts import CACHE_SIZE

client = None
AUTHORIZATION_HEADER = "Authorization"
BEARER_PREFIX = "Bearer "


class SharepointClient:
DEFAULT_SCOPES = ["https://graph.microsoft.com/.default"]
DEFAULT_REGION = "NAM"
SEARCH_ENTITY_TYPES = ["driveItem", "listItem"]
SEARCH_URL = "/search/query"
SEARCH_LIMIT = 3
BASE_URL = "https://graph.microsoft.com/v1.0"
SEARCH_ENTITY_TYPES = ["driveItem"]
DRIVE_ITEM_DATA_TYPE = "#microsoft.graph.driveItem"
APPLICATION_AUTH = "application"
DELEGATED_AUTH = "user"

def __init__(self, auth_type, search_limit):
self.access_token = None
self.user = None
self.auth_type = auth_type
self.search_limit = search_limit

graph_client = None
def get_auth_type(self):
return self.auth_type

def __init__(self, tenant_id, client_id, client_secret, search_limit=5):
def set_app_access_token(self, tenant_id, client_id, client_secret):
try:
credential = ClientSecretCredential(
tenant_id,
client_id,
client_secret,
credential = ConfidentialClientApplication(
client_id=client_id,
client_credential=client_secret,
authority=f"https://login.microsoftonline.com/{tenant_id}",
)

self.graph_client = GraphClient(
credential=credential,
token_response = credential.acquire_token_for_client(
scopes=self.DEFAULT_SCOPES,
api_version=APIVersion.beta,
)
if "access_token" not in token_response:
raise UpstreamProviderError(
"Error while retrieving access token from Microsoft Graph API"
)
self.access_token = token_response["access_token"]
self.headers = {"Authorization": f"Bearer {self.access_token}"}
except Exception as e:
raise UpstreamProviderError(
f"Error while initializing Sharepoint client: {str(e)}"
)

self.search_limit = search_limit
def set_user_access_token(self, token):
self.access_token = token
self.headers = {"Authorization": f"Bearer {self.access_token}"}

tianjing-li marked this conversation as resolved.
Show resolved Hide resolved
@lru_cache(CACHE_SIZE)
def search(self, query):
search_response = self.graph_client.post(
self.SEARCH_URL,
json={
"requests": [
{
"entityTypes": self.SEARCH_ENTITY_TYPES,
"query": {
"queryString": query,
"size": self.SEARCH_LIMIT,
},
"region": self.DEFAULT_REGION,
}
]
request = {
"entityTypes": self.SEARCH_ENTITY_TYPES,
"query": {
"queryString": query,
"size": self.search_limit,
},
)
}

if not search_response.ok:
message = (
search_response.json()
.get("error", {})
.get("message", "Error calling Microsoft Graph API")
)
raise UpstreamProviderError(message)

return search_response.json()["value"][0]["hitsContainers"]
if self.auth_type == self.APPLICATION_AUTH:
request["region"] = self.DEFAULT_REGION

@lru_cache(CACHE_SIZE)
def get_pages(self, site_id):
page_url = f"/sites/{site_id}/pages"
response = self.graph_client.get(page_url)
response = requests.post(
f"{self.BASE_URL}/search/query",
headers=self.headers,
json={"requests": [request]},
)

if not response.ok:
return []

return response.json()

@lru_cache(CACHE_SIZE)
def fetch_page(self, url):
parsed_url = urlparse(url)
site_id = parsed_url.netloc
pages = self.get_pages(site_id)

# Find page by path
matching_page = None
for page in pages["value"]:
normalized_page_path = f"/{page['webUrl']}"
if normalized_page_path == parsed_url.path:
matching_page = page
break

return matching_page

@lru_cache(CACHE_SIZE)
def get_drive_item(self, parent_drive_id, resource_id):
drive_item_url = f"/drives/{parent_drive_id}/items/{resource_id}/content"

get_response = self.graph_client.get(drive_item_url)

# Fail gracefully when retrieving content
if not get_response.ok:
return {}
raise UpstreamProviderError(
f"Error while searching Sharepoint: {response.text}"
)

return get_response.content
return response.json()["value"][0]["hitsContainers"]

@lru_cache(CACHE_SIZE)
def get_list_item(self, site_id, page_id):
list_item_url = (
f"/sites/{site_id}/pages/{page_id}/microsoft.graph.sitePage/webParts"
def get_drive_item_content(self, parent_drive_id, resource_id):
response = requests.get(
f"{self.BASE_URL}/drives/{parent_drive_id}/items/{resource_id}/content",
headers=self.headers,
)
get_response = self.graph_client.get(list_item_url)

# Fail gracefully when retrieving content
if not get_response.ok:
if not response.ok:
return {}

return get_response.json()
return response.content


def get_client():
global client
if client is not None:
return client

# Fetch environment variables
assert (
tenant_id := app.config.get("TENANT_ID")
), "SHAREPOINT_TENANT_ID must be set"
assert (
client_id := app.config.get("CLIENT_ID")
), "SHAREPOINT_CLIENT_ID must be set"
assert (
client_secret := app.config.get("CLIENT_SECRET")
), "SHAREPOINT_CLIENT_SECRET must be set"
search_limit = app.config.get("SEARCH_LIMIT", 5)
auth_type := app.config.get("AUTH_TYPE")
), "SHAREPOINT_AUTH_TYPE must be set"

client = SharepointClient(tenant_id, client_id, client_secret, search_limit)
search_limit = app.config.get("SEARCH_LIMIT", 5)
client = SharepointClient(auth_type, search_limit)

if auth_type == client.APPLICATION_AUTH:
assert (
tenant_id := app.config.get("TENANT_ID")
), "SHAREPOINT_TENANT_ID must be set"
assert (
client_id := app.config.get("CLIENT_ID")
), "SHAREPOINT_CLIENT_ID must be set"
assert (
client_secret := app.config.get("CLIENT_SECRET")
), "SHAREPOINT_CLIENT_SECRET must be set"
client.set_app_access_token(tenant_id, client_id, client_secret)
elif auth_type == client.DELEGATED_AUTH:
token = get_access_token()
if token is None:
raise UpstreamProviderError("No access token provided in request")
client.set_user_access_token(token)
else:
raise UpstreamProviderError(f"Invalid auth type: {auth_type}")

return client


def get_access_token():
authorization_header = request.headers.get(AUTHORIZATION_HEADER, "")
if authorization_header.startswith(BEARER_PREFIX):
return authorization_header.removeprefix(BEARER_PREFIX)
return None
1 change: 0 additions & 1 deletion sharepoint/provider/consts.py

This file was deleted.

6 changes: 0 additions & 6 deletions sharepoint/provider/enums.py

This file was deleted.

Loading