Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather information about the requests to and from RC API #1157

Closed
5 tasks done
adinuca opened this issue Nov 19, 2018 · 10 comments
Closed
5 tasks done

Gather information about the requests to and from RC API #1157

adinuca opened this issue Nov 19, 2018 · 10 comments
Assignees
Labels
Milestone

Comments

@adinuca
Copy link
Contributor

adinuca commented Nov 19, 2018

Why

To reduce the number of calls to Elasticsearch we need to understand who triggers them.

What

  • Data about requests made to RC is gathered from the access logs
  • Top 5 of requests made is determined for a time interval
  • Data about requests made to RC-API is gathered from the access logs
  • Top 5 of requests made is determined for a the same time interval
  • Data about number of requests made to Elasticsearch is retrieved from Elastic Cloud for the same time interval

Notes

@adinuca adinuca self-assigned this Nov 19, 2018
@charlesyoung charlesyoung added this to the Unscheduled milestone Dec 7, 2018
@adinuca
Copy link
Contributor Author

adinuca commented Dec 13, 2018

Based on Lumen documentation, there are no statistics about cache usage, but we could generate them by using the EventServiceProvider.

@adinuca
Copy link
Contributor Author

adinuca commented Dec 13, 2018

More monitoring from Elastic Cloud can be found here : https://3b83e4a3efda4f5e8c2f8f2ec07c0fc2.us-east-1.aws.found.io:9243/app/monitoring#/elasticsearch/nodes

@adinuca
Copy link
Contributor Author

adinuca commented Dec 13, 2018

Today from 1am to 2am (GMT) there was a spike of requests to Elasticsearch(table shows hours using GMT+2 timezone):
screenshot 2018-12-13 at 14 27 50

Requests from AWS for that time frame have been centralised here: https://docs.google.com/spreadsheets/d/1mhHmy6n9m3PMY-BLzLhDXz3jFL6FECRJ8sW5oE09S18/edit#gid=153453992

@adinuca
Copy link
Contributor Author

adinuca commented Dec 14, 2018

Yesterday from 3pm to 4pm there was another spike of requests.
Requests have been centralised here: https://docs.google.com/spreadsheets/d/1mhHmy6n9m3PMY-BLzLhDXz3jFL6FECRJ8sW5oE09S18/edit#gid=1773113440&fvid=1689222157

Elasticsearch request rate for one node:
screenshot 2018-12-14 at 15 40 37

@adinuca
Copy link
Contributor Author

adinuca commented Dec 14, 2018

To obtain the above data, I have downloaded the AWS ALB logs(from S3: aws s3 cp s3://nrgi-lb-logs/AWSLogs/877912432675/elasticloadbalancing/us-east-1/2018/12/13 . --recursive) and ran the following script:

import os
import gzip

cwd = os.getcwd()
logs_dir = os.path.join(cwd, 'rc_logs')
filenames = os.listdir(logs_dir)

full_filepaths = [os.path.join(logs_dir, f) for f in filenames]
only_files = [f for f in full_filepaths if os.path.isfile(f) and ('resource-contracts-lb' in f or 'rc-subsite-master-lb' in f)]

logs = os.path.join(cwd, "final_log")
with open("%s" % logs, 'a') as target:
    for f in only_files:
        print f
        with gzip.open(f, 'r') as zip_ref:
            target.write(zip_ref.read(gzip))


curated_file_path = os.path.join(cwd, 'curated_sorted_lines_log.csv')
with open(logs, 'r') as f:
    with open(curated_file_path, 'w') as cf:
        for line in f:
            if "staging" not in line :
                line = line.replace("\"GET ", "GET_").replace("\"POST ", "POST").replace("\"HEAD ", "HEAD").replace("\"OPTIONS ", "OPTIONS")
                tokens = line.split(' ')
                date = tokens[1]
                if date > "2018-12-13T14:59:59.140792Z" and date < "2018-12-13T16:00:00.140792Z":
                    system = tokens[2]
                    path = tokens[12]
                    newline = '{},{},{}\n'.format(date, system, path)
                    cf.write(newline)

The resulted file was then imported to GDrive.

@adinuca
Copy link
Contributor Author

adinuca commented Dec 14, 2018

On the 13th, between 1 and 2 am, the following requests have been made:
screenshot 2018-12-14 at 16 38 45

@adinuca
Copy link
Contributor Author

adinuca commented Dec 14, 2018

On the 13th of December, between 3pm and 4pm, the following requests have been made:
screenshot 2018-12-14 at 16 47 30

@adinuca
Copy link
Contributor Author

adinuca commented Dec 24, 2018

On the 23rd of December, between 5pm and 6pm, the following requests have been made:
screenshot 2018-12-24 at 09 39 38
screenshot 2018-12-24 at 09 05 16

Data has been gathered here: https://docs.google.com/spreadsheets/d/1mhHmy6n9m3PMY-BLzLhDXz3jFL6FECRJ8sW5oE09S18/edit#gid=296562686&fvid=1368567753

@adinuca
Copy link
Contributor Author

adinuca commented Jan 7, 2019

  • For every page viewed on the RC subsites, there is a request made to retrieve the list of contracts corresponding to the viewed page, and another 25 requests made to RC API to retrieve the details of each contract. When sorting the documents on the page, the process repeats.
  • For each contract viewed, there are 3 requests made to RC API to retrieve data about the contract.
    The requests are made using the open-contract-id, not the id of the contract, as it is done for the search page. Requests are for annotations, metadata and text
  • For each /contract/contractID/metadata request made, there are 3-4 requests made to ES

@adinuca
Copy link
Contributor Author

adinuca commented Jan 7, 2019

#1172 , #1173 , #1174 were raised to deal with the findings.

@adinuca adinuca closed this as completed Jan 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants