Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance analysis #23

Closed
jpmckinney opened this issue Oct 2, 2020 · 11 comments
Closed

Performance analysis #23

jpmckinney opened this issue Oct 2, 2020 · 11 comments

Comments

@jpmckinney
Copy link
Member

jpmckinney commented Oct 2, 2020

To address issues from CRM-6536. data.json is attached to that issue.

I ran:

pip install memory_profiler matplotlib
time mprof run libcoveoc4ids data.json
mprof plot

time's output is:

Executed in   55.89 secs   fish           external 
   usr time   43.71 secs  791.77 millis   42.92 secs 
   sys time    3.10 secs  373.29 millis    2.73 secs 

mprof plot output:

Screen Shot 2020-10-02 at 12 06 18 PM

Going to use cProfile to identify methods to add the @profile decorator to:

python -m cProfile -o code.prof libcoveoc4ids/cli/__main__.py -o /dev/null data.json
gprof2dot -f pstats code.prof | dot -Tpng -o output.png
open output.png

As documented at https://ocp-software-handbook.readthedocs.io/en/latest/python/performance.html

@jpmckinney
Copy link
Member Author

After #24, most time is spent in common_checks_context from lib-cove, so not an issue for this repo to fix.

output

@jpmckinney
Copy link
Member Author

Just loading 50MB JSON takes a lot of memory:

Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
     3   12.109 MiB   12.109 MiB   @profile
     4                             def main():
     5   12.109 MiB    0.000 MiB       with open("data.json") as f:
     6  217.312 MiB  205.203 MiB           json.load(f)

@jpmckinney
Copy link
Member Author

jpmckinney commented Oct 2, 2020

python -m memory_profiler libcoveoc4ids/cli/__main__.py --compact data.json

Filename: libcoveoc4ids/cli/__main__.py

Line #    Mem usage    Increment   Line Contents
================================================
     8   53.062 MiB   53.062 MiB   @profile
     9                             def main():
    10   53.062 MiB    0.000 MiB       parser = argparse.ArgumentParser(description='Lib Cove OC4IDS CLI')
    11   53.062 MiB    0.000 MiB       parser.add_argument("filename")
    12   53.062 MiB    0.000 MiB       parser.add_argument("-c", "--compact", action="store_true", help="compact instead of pretty-printed output")
    13                             
    14   53.070 MiB    0.008 MiB       args = parser.parse_args()
    15                             
    16   53.074 MiB    0.004 MiB       cove_temp_folder = tempfile.mkdtemp(prefix='lib-cove-oc4ids-cli-', dir=tempfile.gettempdir())
    17   53.074 MiB    0.000 MiB       try:
    18   53.074 MiB    0.000 MiB           result = oc4ids_json_output(
    19   53.074 MiB    0.000 MiB               cove_temp_folder,
    20   53.074 MiB    0.000 MiB               args.filename,
    21  326.312 MiB  273.238 MiB               file_type='json'
    22                                     )
    23                                 finally:
    24  326.316 MiB    0.004 MiB           shutil.rmtree(cove_temp_folder)
    25                             
    26  326.316 MiB    0.000 MiB       kwargs = {}
    27  326.316 MiB    0.000 MiB       if not args.compact:
    28                                     if using_orjson:
    29                                         kwargs['option'] = jsonlib.OPT_INDENT_2
    30                                     else:
    31                                         kwargs['indent'] = 2
    32                             
    33  380.668 MiB   54.352 MiB       output = jsonlib.dumps(result, **kwargs)
    34                             
    35  380.668 MiB    0.000 MiB       if using_orjson:
    36  468.285 MiB   87.617 MiB           output = output.decode('utf-8')
    37                             
    38  518.566 MiB   50.281 MiB       print(output)

@jpmckinney
Copy link
Member Author

jpmckinney commented Oct 2, 2020

Added -o option to eliminate time and memory spent printing and decoding:

output

Before (when using orjson, bytes need to be decoded to string to be printed):

Screen Shot 2020-10-02 at 3 10 43 PM

After:

Screen Shot 2020-10-02 at 3 10 11 PM

python -m memory_profiler libcoveoc4ids/cli/__main__.py --compact data.json

Filename: libcoveoc4ids/cli/__main__.py

Line #    Mem usage    Increment   Line Contents
================================================
     8   53.109 MiB   53.109 MiB   @profile
     9                             def main():
    10   53.113 MiB    0.004 MiB       parser = argparse.ArgumentParser(description='Lib Cove OC4IDS CLI')
    11   53.113 MiB    0.000 MiB       parser.add_argument("filename")
    12   53.113 MiB    0.000 MiB       parser.add_argument("-c", "--compact", action="store_true", help="compact instead of pretty-printed output")
    13   53.113 MiB    0.000 MiB       parser.add_argument("-o", "--output", help="write output to the given file instead of standard output")
    14                             
    15   53.121 MiB    0.008 MiB       args = parser.parse_args()
    16                             
    17   53.121 MiB    0.000 MiB       cove_temp_folder = tempfile.mkdtemp(prefix='lib-cove-oc4ids-cli-', dir=tempfile.gettempdir())
    18   53.121 MiB    0.000 MiB       try:
    19   53.121 MiB    0.000 MiB           result = oc4ids_json_output(
    20   53.121 MiB    0.000 MiB               cove_temp_folder,
    21   53.121 MiB    0.000 MiB               args.filename,
    22  320.234 MiB  267.113 MiB               file_type='json'
    23                                     )
    24                                 finally:
    25  320.305 MiB    0.070 MiB           shutil.rmtree(cove_temp_folder)
    26                             
    27  320.305 MiB    0.000 MiB       kwargs = {}
    28                             
    29  320.305 MiB    0.000 MiB       if using_orjson:
    30  320.305 MiB    0.000 MiB           kwargs['option'] = 0
    31  320.305 MiB    0.000 MiB           if not args.compact:
    32                                         kwargs['option'] |= jsonlib.OPT_INDENT_2
    33  320.305 MiB    0.000 MiB           if args.output:
    34  320.305 MiB    0.000 MiB               kwargs['option'] |= jsonlib.OPT_APPEND_NEWLINE
    35                             
    36  375.785 MiB   55.480 MiB           output = jsonlib.dumps(result, **kwargs)
    37  375.785 MiB    0.000 MiB           if args.output:
    38  375.785 MiB    0.000 MiB               with open(args.output, 'wb') as f:
    39  375.785 MiB    0.000 MiB                   f.write(output)
    40                                     else:
    41                                         print(output.decode())

@jpmckinney
Copy link
Member Author

jpmckinney commented Oct 2, 2020

After OpenDataServices/lib-cove#60, get_json_data_generic_paths goes from 10% to 3%:

output

@jpmckinney
Copy link
Member Author

After OpenDataServices/lib-cove#61 and OpenDataServices/lib-cove#62, get_fields_present_with_examples takes less time:

output

@jpmckinney
Copy link
Member Author

Now focusing on jsonschema. Fixing is_type yields:

output

@jpmckinney
Copy link
Member Author

The script has a baseline of 53MB, jsonlib.loads adds 229MB, then common_checks_oc4ids adds the rest.

Filename: /Users/james/Sites/remote/open-contracting/lib-cove-oc4ids/libcoveoc4ids/api.py

Line #    Mem usage    Increment   Line Contents
================================================
    23   53.164 MiB   53.164 MiB   @profile
    24                             def oc4ids_json_output(output_dir, file, file_type=None, json_data=None,
    25                                                    lib_cove_oc4ids_config=None):
    26                             
    27   53.164 MiB    0.000 MiB       if not lib_cove_oc4ids_config:
    28   53.164 MiB    0.000 MiB           lib_cove_oc4ids_config = LibCoveOC4IDSConfig()
    29                             
    30   53.164 MiB    0.000 MiB       if not file_type:
    31                                     file_type = get_file_type(file)
    32   53.164 MiB    0.000 MiB       context = {"file_type": file_type}
    33                             
    34   53.164 MiB    0.000 MiB       if file_type == 'json':
    35   53.164 MiB    0.000 MiB           if not json_data:
    36   53.164 MiB    0.000 MiB               if using_orjson:
    37   53.164 MiB    0.000 MiB                   kwargs = {'mode': 'rb'}
    38                                         else:
    39                                             kwargs = {'encoding': 'utf-8'}
    40   53.164 MiB    0.000 MiB               with open(file, **kwargs) as fp:
    41   53.164 MiB    0.000 MiB                   try:
    42   53.164 MiB    0.000 MiB                       if using_orjson:
    43  282.121 MiB  228.957 MiB                           json_data = jsonlib.loads(fp.read())
    44                                                 else:
    45                                                     json_data = jsonlib.load(fp)
    46                                             except ValueError:
    47                                                 raise APIException('The file looks like invalid json')
    48                             
    49  282.125 MiB    0.004 MiB           schema_oc4ids = SchemaOC4IDS(lib_cove_oc4ids_config=lib_cove_oc4ids_config)
    50                             
    51                                 else:
    52                             
    53                                     raise Exception("JSON only for now, sorry!")
    54                             
    55  282.125 MiB    0.000 MiB       context = common_checks_oc4ids(
    56  282.125 MiB    0.000 MiB           context,
    57  282.125 MiB    0.000 MiB           output_dir,
    58  282.125 MiB    0.000 MiB           json_data,
    59  282.125 MiB    0.000 MiB           schema_oc4ids,
    60  327.008 MiB  327.008 MiB           lib_cove_oc4ids_config=lib_cove_oc4ids_config)
    61                             
    62  327.008 MiB    0.000 MiB       return context
Filename: /Users/james/Sites/remote/opendataservices/lib-cove/libcove/lib/common.py

Line #    Mem usage    Increment   Line Contents
================================================
   637  285.719 MiB  285.719 MiB   @profile
   638                             def get_schema_validation_errors(
   639                                 json_data,
   640                                 schema_obj,
   641                                 schema_name,
   642                                 cell_src_map,
   643                                 heading_src_map,
   644                                 extra_checkers=None,
   645                             ):
   646  285.719 MiB    0.000 MiB       pkg_schema_obj = schema_obj.get_pkg_schema_obj()
   647                             
   648  285.719 MiB    0.000 MiB       validation_errors = collections.defaultdict(list)
   649  285.719 MiB    0.000 MiB       format_checker = FormatChecker()
   650  285.719 MiB    0.000 MiB       if extra_checkers:
   651                                     format_checker.checkers.update(extra_checkers)
   652                             
   653  285.719 MiB    0.000 MiB       if getattr(schema_obj, "extended", None):
   654                                     resolver = CustomRefResolver(
   655                                         "",
   656                                         pkg_schema_obj,
   657                                         config=schema_obj.config,
   658                                         schema_url=schema_obj.schema_host,
   659                                         schema_file=schema_obj.extended_schema_file,
   660                                         file_schema_name=schema_obj.schema_name,
   661                                     )
   662                                 else:
   663  285.719 MiB    0.000 MiB           resolver = CustomRefResolver(
   664  285.719 MiB    0.000 MiB               "",
   665  285.719 MiB    0.000 MiB               pkg_schema_obj,
   666  285.719 MiB    0.000 MiB               config=schema_obj.config,
   667  285.719 MiB    0.000 MiB               schema_url=schema_obj.schema_host,
   668                                     )
   669                             
   670  285.719 MiB    0.000 MiB       our_validator = validator(
   671  285.719 MiB    0.000 MiB           pkg_schema_obj, format_checker=format_checker, resolver=resolver
   672                                 )
   673  308.219 MiB    0.902 MiB       for e in our_validator.iter_errors(json_data):
   674  308.219 MiB    0.004 MiB           message_safe = None
   675  308.219 MiB    0.000 MiB           message = e.message
   676  308.219 MiB    0.004 MiB           path = "/".join(str(item) for item in e.path)
   677  308.219 MiB    0.000 MiB           path_no_number = "/".join(
   678  308.219 MiB    0.004 MiB               str(item) for item in e.path if not isinstance(item, int)
   679                                     )
   680                             
   681  308.219 MiB    0.000 MiB           value = {"path": path}
   682  308.219 MiB    0.000 MiB           cell_reference = cell_src_map.get(path)
   683                             
   684  308.219 MiB    0.000 MiB           if cell_reference:
   685                                         first_reference = cell_reference[0]
   686                                         if len(first_reference) == 4:
   687                                             (
   688                                                 value["sheet"],
   689                                                 value["col_alpha"],
   690                                                 value["row_number"],
   691                                                 value["header"],
   692                                             ) = first_reference
   693                                         if len(first_reference) == 2:
   694                                             value["sheet"], value["row_number"] = first_reference
   695                             
   696  308.219 MiB    0.000 MiB           header = value.get("header")
   697                             
   698  308.219 MiB    0.000 MiB           header_extra = None
   699  308.219 MiB    0.000 MiB           pre_header = ""
   700                             
   701  308.219 MiB    0.000 MiB           if not header and len(e.path):
   702  308.219 MiB    0.000 MiB               header = e.path[-1]
   703  308.219 MiB    0.000 MiB               if isinstance(e.path[-1], int) and len(e.path) >= 2:
   704                                             # We're dealing with elements in an array of items at this point
   705  300.898 MiB    0.000 MiB                   pre_header = "Array Element "
   706  300.898 MiB    0.004 MiB                   header_extra = "{}/[number]".format(e.path[-2])
   707                             
   708  308.219 MiB    0.000 MiB           null_clause = ""
   709  308.219 MiB    0.000 MiB           validator_type = e.validator
   710  308.219 MiB    0.000 MiB           if e.validator in ("format", "type"):
   711  308.219 MiB    0.000 MiB               validator_type = e.validator_value
   712  308.219 MiB    0.000 MiB               if isinstance(e.validator_value, list):
   713  308.219 MiB    0.000 MiB                   validator_type = e.validator_value[0]
   714  308.219 MiB    0.000 MiB                   if "null" not in e.validator_value:
   715  308.219 MiB    0.000 MiB                       null_clause = "is not null, and"
   716                                         else:
   717  308.219 MiB    0.000 MiB                   null_clause = "is not null, and"
   718                             
   719  308.219 MiB    0.000 MiB               message_template = validation_error_template_lookup.get(
   720  308.219 MiB    0.000 MiB                   validator_type, message
   721                                         )
   722  308.219 MiB    0.000 MiB               message_safe_template = validation_error_template_lookup_safe.get(
   723  308.219 MiB    0.000 MiB                   validator_type
   724                                         )
   725                             
   726  308.219 MiB    0.000 MiB               if message_template:
   727  308.219 MiB    0.004 MiB                   message = message_template.format(pre_header, header, null_clause)
   728                             
   729  308.219 MiB    0.000 MiB               if message_safe_template:
   730  308.219 MiB    0.000 MiB                   message_safe = format_html(
   731  308.219 MiB    0.008 MiB                       message_safe_template, pre_header, header, null_clause
   732                                             )
   733                             
   734  308.219 MiB    0.004 MiB           if e.validator == "oneOf" and e.validator_value[0] == {"format": "date-time"}:
   735                                         # Give a nice date related error message for 360Giving date `oneOf`s.
   736                                         message = validation_error_template_lookup["date-time"]
   737                                         message_safe = format_html(
   738                                             validation_error_template_lookup_safe["date-time"]
   739                                         )
   740                                         validator_type = "date-time"
   741                             
   742  308.219 MiB    0.000 MiB           if not isinstance(e.instance, (dict, list)):
   743  308.219 MiB    0.000 MiB               value["value"] = e.instance
   744                             
   745  308.219 MiB    0.000 MiB           if e.validator == "required":
   746  300.898 MiB    0.000 MiB               field_name = e.message
   747  300.898 MiB    0.000 MiB               parent_name = None
   748  300.898 MiB    0.000 MiB               if len(e.path) > 2:
   749  300.898 MiB    0.000 MiB                   if isinstance(e.path[-1], int):
   750  300.898 MiB    0.000 MiB                       parent_name = e.path[-2]
   751                                             else:
   752                                                 parent_name = e.path[-1]
   753                             
   754  300.898 MiB    0.000 MiB               heading = heading_src_map.get(path_no_number + "/" + e.message)
   755  300.898 MiB    0.000 MiB               if heading:
   756                                             field_name = heading[0][1]
   757                                             value["header"] = heading[0][1]
   758  300.898 MiB    0.000 MiB               header = field_name
   759  300.898 MiB    0.000 MiB               if parent_name:
   760  300.898 MiB    0.000 MiB                   message = "'{}' is missing but required within '{}'".format(
   761  300.898 MiB    0.004 MiB                       field_name, parent_name
   762                                             )
   763  300.898 MiB    0.004 MiB                   message_safe = format_html(
   764  300.898 MiB    0.000 MiB                       "<code>{}</code> is missing but required within <code>{}</code>",
   765  300.898 MiB    0.000 MiB                       field_name,
   766  300.898 MiB    0.008 MiB                       parent_name,
   767                                             )
   768                                         else:
   769                                             message = "'{}' is missing but required".format(field_name)
   770                                             message_safe = format_html(
   771                                                 "<code>{}</code> is missing but required", field_name, parent_name
   772                                             )
   773                             
   774  308.219 MiB    0.000 MiB           if e.validator == "enum":
   775  307.074 MiB    0.000 MiB               if "isCodelist" in e.schema:
   776                                             continue
   777  307.074 MiB    0.000 MiB               message = "Invalid code found in '{}'".format(header)
   778  307.074 MiB    0.004 MiB               message_safe = format_html("Invalid code found in <code>{}</code>", header)
   779                             
   780  308.219 MiB    0.000 MiB           if e.validator == "pattern":
   781                                         message_safe = format_html(
   782                                             "<code>{}</code> does not match the regex <code>{}</code>",
   783                                             header,
   784                                             e.validator_value,
   785                                         )
   786                             
   787  308.219 MiB    0.000 MiB           if e.validator == "minItems" and e.validator_value == 1:
   788                                         message_safe = format_html(
   789                                             "<code>{}</code> is too short. You must supply at least one value, or remove the item entirely (unless it’s required).",
   790                                             e.instance,
   791                                         )
   792                             
   793  308.219 MiB    0.000 MiB           if e.validator == "minLength" and e.validator_value == 1:
   794                                         message_safe = format_html(
   795                                             '<code>"{}"</code> is too short. Strings must be at least one character. This error typically indicates a missing value.',
   796                                             e.instance,
   797                                         )
   798                             
   799  308.219 MiB    0.000 MiB           if message_safe is None:
   800  308.219 MiB    0.000 MiB               message_safe = escape(message)
   801                             
   802  308.219 MiB    0.000 MiB           if header_extra is None:
   803  308.219 MiB    0.000 MiB               header_extra = header
   804                             
   805                                     unique_validator_key = {
   806  308.219 MiB    0.000 MiB               "message": message,
   807  308.219 MiB    0.000 MiB               "message_safe": conditional_escape(message_safe),
   808  308.219 MiB    0.000 MiB               "validator": e.validator,
   809  308.219 MiB    0.000 MiB               "assumption": e.assumption if hasattr(e, "assumption") else None,
   810                                         # Don't pass this value for 'enum' and 'required' validators,
   811                                         # because it is not needed, and it will mean less grouping, which
   812                                         # we don't want.
   813                                         "validator_value": e.validator_value
   814  308.219 MiB    0.000 MiB               if e.validator not in ["enum", "required"]
   815  307.074 MiB    0.000 MiB               else None,
   816  308.219 MiB    0.000 MiB               "message_type": validator_type,
   817  308.219 MiB    0.000 MiB               "path_no_number": path_no_number,
   818  308.219 MiB    0.000 MiB               "header": header,
   819  308.219 MiB    0.000 MiB               "header_extra": header_extra,
   820  308.219 MiB    0.000 MiB               "null_clause": null_clause,
   821  308.219 MiB    0.000 MiB               "error_id": e.error_id if hasattr(e, "error_id") else None,
   822                                     }
   823  308.219 MiB    0.031 MiB           validation_errors[json.dumps(unique_validator_key)].append(value)
   824  308.219 MiB    0.000 MiB       return dict(validation_errors)

@jpmckinney
Copy link
Member Author

time python libcoveoc4ids/cli/__main__.py data.json > /dev/null

Got time down to:

Executed in   26.67 secs   fish           external 
   usr time   23.59 secs  227.00 micros   23.59 secs 
   sys time    0.81 secs  1608.00 micros    0.81 secs 

Originally:

________________________________________________________
Executed in   43.23 secs   fish           external 
   usr time   39.75 secs  211.00 micros   39.75 secs 
   sys time    1.10 secs  1505.00 micros    1.10 secs 

@jpmckinney
Copy link
Member Author

jpmckinney commented Oct 6, 2020

Closing this issue. Memory usage was halved, and running time is two-thirds faster. In an above comment, oc4ids_json_output used 327 MB, having started with 53 MB and having used 229 MB just to load the JSON. That leaves 45 MB for lib-cove. This is much better in comparison to the memory for the JSON data.

PRs are open in https://github.com/OpenDataServices/lib-cove/pulls/jpmckinney

Other follow-up issues are:

@jpmckinney
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant