Skip to content

Latest commit

 

History

History
71 lines (54 loc) · 4.84 KB

data_quality_scoring.md

File metadata and controls

71 lines (54 loc) · 4.84 KB

Code.gov Data Quality Scoring

As part of our efforts to make Code.gov as useful as possible to our users, we have implemented rules to score the quality of the data being indexed into Code.gov. This score is not a judgement or critique of the data itself, but an effort on our part to quantify the data we have determined our users look for and need while using our site and API.

Score Determination

We determine the quality score of a repository by using a series of rules and passing the repository through a rules engine. We also assign a point value to each field that symbolizes the value that field has for Code.gov.

The rules can vary from field to field. As an example, we evaluate the existence of data in all fields before we add their weight to the overall score, but in the case of the description field we also evaluate that the content of the field is not the same as the name of the project and we make a "naive" evaluation on the amount of content that the field has (word count). Other fields are evaluated depending on the data that is expected and desired in them.

Field Points

We give point values between 0 and 10. These values are the maximum score that data in the field can obtain. There are fields where data is evaluated and may be awarded less than the values in the following table. Also, fields like tags may be awarded partial value based on number of items

Field Name Max Field Points Required Notes
name 10 yes
version 1 yes
organization 5
description 10 yes 0 points if descriptions and name are same, 0 points if description is less then 3 words, 3 points if description is less then 10 words, 5 points if description is less then 20 words, 8 points if description is less then 30 words, 10 (full) points if description is 30 or more words
permissions.licenses.URL 1 Licenses is an Array of objects and it should have at least 1 element
permissions.licenses.name 7 Licenses is an Array of objects and it should have at least 1 element
permissions.usageType 10 yes 1 point if exempt*, 5 points if governmentWideReuse, 10 (full) points if openSource'
permissions.exemptionText 5 5 (full) points if usageType is openSource/governmentWideReuse OR exemptionText is present, 0 otherwise
tags 10 yes Tags is an Array objects, 4 points if 1 tag element, 6 points if 2 tag elements, 10 (full) points if 3 or more tag elements
contact.email 10 yes
contact.name 5
contact.URL 5
contact.phone 5
status 5
vcs 5
repositoryURL 10 yes 10 (full) points if valid URL (starts with https:// or http://)
homepageURL 7 must exists and be a valid URL to get full points
downloadURL 3 must exists and be a valid URL to get full points
disclaimerURL 3 must exists and be a valid URL to get full points
disclaimerText 3
languages 10 Languages is an Array of strings, must have at least 1 element to get full points
laborHours 10 yes Full points if a positive numeric value
relatedCode.name 1
relatedCode.URL 1 must exists and be a valid URL to get full points
relatedCode.isGovernmentRepo 1
reusedCode.name 1
reusedCode.URL 1 must exists and be a valid URL to get full points
partners.name 1
partners.email 1
date.created 5
date.lastModified 5
date.metadataLastUpdated 2

Field Rules

We are using the simple-rules-engine package to evaluate the rules we've created for each field. Our current rules can be seen here.

Score Normalization

Score normalization is performed where the final score is between 0 and 10. To arrive at this value, all point values of each field in the above table are tallied and divided by maximum possible point and then multiplied by 10.

Score Display

The code.gov front end uses a corner tag to display the data quality score. This is implemented by the quality-tag component. The tag type/color is determined by the score value passed to the component using the following ranges:

  • low/red: a score below 4
  • medium/orange: a score greater than or equal to 4 and less than 6
  • high/green: a score higher than or equal to 6

*Note: because all code.gov projects should have the required fields filled, the majority of projects will have the medium/orange data quality corner tag.

Check out our Metadata Examples document for additional information on metadata to include in order to raise your data quality score.

For more info on the code.gov corner tags see the style guide.

See quality_tag.js to view the code for this corner tag component logic.