remove per capita stop words #4456

chejennifer · 2024-07-10T15:27:13Z

same change as #4415 but rebased off a clean master

svindex diff: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_09_21_55_14.html

base does not remove per capita stop words
test removes per capita stop words

chejennifer · 2024-07-10T16:35:18Z

...test_data/e2e_edge_cases2/povertyvs.unemploymentrateindistrictsoftamilnadu/chart_config.json

@@ -31,7 +31,6 @@
              }
            ],
            "denom": "Count_Person",
-            "startWithDenom": true,


I think we lose these startWithDenom because things like "unemployment rate", "mortality rate", etc no longer get classified as PerCapita because of the change to PerCapita stop words where "rate" -> regex for rate when it's not "unemployment rate", "mortality rate", etc

I wonder if we need some special treatment here where we still classify these things as PerCapita, but don't remove those specific stop words

This then feels like an improvement right?

The original query: "poverty vs. unemployment rate" isn't a per-capita query, so it makes sense to not start with per-capita enabled?

chejennifer · 2024-07-11T18:44:27Z

updated changes:

add alternate sv description with just murder for svs about murders & non-negligent murders
remove "heart attack" from stroke description
update PerCapita classification for "rates" to use same regex as "rate"
only remove PerCapita stop words for toolformer mode

sv diffs for index updates: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_11_11_41_44.html

sv diffs for index updates AND removal of per capita stop words: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_11_09_43_43.html

base is old index and KEEP per capita stop words
test is new index and REMOVE per capita stop words

diffs look ok to me

pradh · 2024-07-12T16:04:18Z

updated changes:

add alternate sv description with just murder for svs about murders & non-negligent murders

remove "heart attack" from stroke description

update PerCapita classification for "rates" to use same regex as "rate"

only remove PerCapita stop words for toolformer mode

sv diffs for index updates: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_11_11_41_44.html

sv diffs for index updates AND removal of per capita stop words: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_11_09_43_43.html

base is old index and KEEP per capita stop words

test is new index and REMOVE per capita stop words

diffs look ok to me

Awesome! How did you generate the 2nd diffs? For future reference.

pradh

Thanks for the edits!

pradh · 2024-07-12T16:00:53Z

server/lib/nl/detection/variable.py

+  # TODO: decouple words removal from detected attributes. Today, the removal
+  # blanket removes anything that matches, including the various attribute/
+  # classification triggers and contained_in place types (and their plurals).
+  # This may not always be the best thing to do.


Move this to combine_stop_words function def site?

pradh · 2024-07-12T16:02:53Z

shared/lib/constants.py


-# We do not want to strip words from events / superlatives / temporal
+# We do not want to strip words from events / superlatives / temporal / percapita
 # since we want those to match SVs too!


Have a comment so we will remember why we retain PerCapita words in stop-words in main DC, perhaps an example?

chejennifer · 2024-07-12T16:27:29Z

updated diffs after merging master which still look ok

just index diff: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_12_09_08_55.html

index + per capita change: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_12_09_23_39.html

chejennifer · 2024-07-12T16:30:53Z

updated changes:

add alternate sv description with just murder for svs about murders & non-negligent murders

remove "heart attack" from stroke description

update PerCapita classification for "rates" to use same regex as "rate"

only remove PerCapita stop words for toolformer mode

sv diffs for index updates: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_11_11_41_44.html
sv diffs for index updates AND removal of per capita stop words: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_11_09_43_43.html

base is old index and KEEP per capita stop words

test is new index and REMOVE per capita stop words

diffs look ok to me

Awesome! How did you generate the 2nd diffs? For future reference.

updated code in differ.py:

_STRIP_STOP_WORDS_NEW_FN = 'STRIP_STOP_WORDS_NEW'

flags.DEFINE_enum('test_query_transform', _EMPTY_FN, [
    _STRIP_STOP_WORDS_FN, _STRIP_STOP_WORDS_NO_EXCLUSION_FN, _EMPTY_FN,
    _STRIP_STOP_WORDS_NEW_FN
], 'Transform to perform on test query.')

_ALL_STOP_WORDS_NEW = shared_utils.combine_stop_words(shared_constants.HEURISTIC_TYPES_IN_VARIABLES_TOOLFORMER)

_QUERY_TRANSFORM_FUNCS: dict[str, Callable[[str], str]] = {
    _STRIP_STOP_WORDS_FN:
        lambda q: shared_utils.remove_stop_words(q, _ALL_STOP_WORDS),
    _STRIP_STOP_WORDS_NO_EXCLUSION_FN:
        lambda q: shared_utils.remove_stop_words(q, _ALL_STOP_WORDS, {}),
    _STRIP_STOP_WORDS_NEW_FN:
        lambda q: shared_utils.remove_stop_words(q, _ALL_STOP_WORDS_NEW)
}

command to run the diff:

./run.sh base_uae_mem base_uae_mem --base_query_transform=STRIP_STOP_WORDS --test_query_transform=STRIP_STOP_WORDS_NEW --queryset=tools/nl/svindex_differ/queryset_vars_withstopwords.csv

* Added bard staging environment (#4359) - Added bard staging environment - Fixed bug in `gke/get_storage_permission.sh` where the project environment name was being used for the robot service account instead of the project_id. Before: `website-robot@bard_staging.iam.gserviceaccount.com` (inconsistent with the account created by `create_robot_account.sh` script . After : `[email protected]` * [rag eval] read eval type from sheet (#4363) * Show per capita when the query has "per person" (#4367) * [rag eval] add table pane component (#4370) https://github.com/datacommonsorg/website/assets/69875368/cf87a224-58c2-4a6c-ad2a-92ac1aa07e0a * Add block evaluation claim counter (#4366) * [rag eval] add concept of feedback stage (#4372) - Add a new FeedbackStage enum to be used when deciding what to display in the query section and feedback sections - Changes to make RIG eval tool work with this new concept - only change made for RAG was displaying the rag calls in the query section when feedback stage is the CALLS stage. Actual navigation for RAG will come later * DC website compare tool navigation support (#4365) With this, when clicking in one side of the iframe, the other iframe can be updated correspondingly based on the url path. * [eval] fix previous button bug (#4373) * Updated mixer for 6/20/2024 release (#4375) * Added updated sdg table and ilo tables (#4376) * Fix errors in curated stat var descriptions (#4379) sv diff report: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_06_20_22_01_06.html * [rag eval] Add navigation logic and get calls feedback working (#4377) https://github.com/datacommonsorg/website/assets/69875368/29977c28-21ad-4ebf-8bdd-7f6f8300b3ca TODO: add loading spinner & maybe look into why the feedback section is slow to update * Add cacheSVFormula to autopush.yaml (#4381) Some examples testing locally of calculations: Timeline: <img width="1512" alt="Screenshot 2024-06-21 at 9 23 44 AM" src="https://github.com/datacommonsorg/website/assets/77713883/02b0d6b8-bd7e-4d8c-bc44-e10d66d0803b"> Map: <img width="1512" alt="Screenshot 2024-06-21 at 9 23 51 AM" src="https://github.com/datacommonsorg/website/assets/77713883/3b27feeb-cacb-40c1-abc9-9c6875dae133"> * [rag eval] add rag ans section (#4380) https://github.com/datacommonsorg/website/assets/69875368/7334d32d-4042-41a8-aca0-c7a8ef75e62f * Avoid place fallback for mode=toolformer_* (#4374) Data Gemma use-cases expect exact place's results, and when we don't have data, don't try to do place fallback. **Before** ![image](https://github.com/datacommonsorg/website/assets/4375037/88d17463-fe5a-4f5e-ba2f-f713d59e1009) **After** ![image](https://github.com/datacommonsorg/website/assets/4375037/079f2217-d5d1-4b97-9dfb-60b547b607cd) * fixed caching issue for POST requests to /api/observations/series (#4382) Before: POST request for a stat var returns the the wrong result due to caching Requested: Count_Person <img width="1428" alt="Screenshot 2024-06-21 at 3 34 31 PM" src="https://github.com/datacommonsorg/website/assets/13766/67eb91c1-4ac7-4c7e-abfc-9c81b919678c"> Got: SDG variable <img width="1458" alt="Screenshot 2024-06-21 at 3 34 40 PM" src="https://github.com/datacommonsorg/website/assets/13766/1c2ea88c-de9a-45eb-96b3-ed4d11f61ff3"> After cache key correction: <img width="1240" alt="Screenshot 2024-06-21 at 3 38 24 PM" src="https://github.com/datacommonsorg/website/assets/13766/c5d9f1d4-6ad6-4eac-af39-4bbc903a7e57"> * updated apigee instructions (#4368) * Updated un staging tables (#4378) * Added redis cache layer for staging.datacommons.org (#4383) Note: after merging, update oncall docs with a step to clear redis after deploying to staging * Major re-structure of build-embeddings tool; Use multiple sv/topic with the same description now (#4371) Major changes: - Re-structure build embeddings script to use a few modularized functions from tools/nl/embeddings/utils.py - Make base and custom embeddings build scripts *almost* identical now. Further clean up is needed to merge them. - Use multiple sv/topic with the same description now. The diff can be inspected from _preindex.csv files - Updated all the embedding indexes based on the changes - svdiff report for base: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_06_20_00_09_46.html Minor changes: - Rename/Removed several non critical embedding indexes - Updated documentation and scripts TODO: - clean up all custom_dc path assumptions in nl_server, web_server(admin) and build_embedding_tool. - deprecate custom dc embedding build script following above. - add back tests for tools/nl/embeddings/utils.py. * [rag eval] some small bug fixes (#4385) - fix some table rendering issues - add stage indicator - update overall options for RAG to be: irrelevant, somewhat relevant, relevant - update dc response options for RAG to be: does not match question, matches question - handle when there are no dc calls * Fix html diff tool bug and https->http issue (#4384) * [rag eval] add claims counters (#4388) update claim counter page to have 2 sections w/ 3 counters each: 1. statistical claims - total claims - false claims - unique tables 3. inferred claims - total claims - false claims - unsubstantiated claims ![Screenshot 2024-06-24 at 3 07 33 PM](https://github.com/datacommonsorg/website/assets/69875368/de4bbced-8890-4f26-8e38-51e2b4942c32) * Keep only one build_embedding tool and move all custom DC logic/path logic to admin/html.py (#4389) Major changes: - Put most custom DC specific constants, path, logic into admin/html.py. As this is the "offline import manager", it makes sense to let it decide on all these params and pass to scripts in different stages. - Make build_embeddings.py a general tool that can be purely self contained. - Support catalog_dict and catalog_paths when reading catalog. - In custom DC docker, put catalog.yaml as the same path as the GKE deployment, to unify the paths. * custom dc script to load data and generate embeddings (#4268) Copy of import/simple/stats/run_stats.sh to - process data - generate NL embedings * Fork build trigger config and deployment script for custom DC autopush (#4386) - cloudbuild.deploy.yaml -> cloudbuild.push.yaml, which adds a step to update repo version and removes a step to build and push a custom DC image - deploy shell script -> build-and-deploy shell script, which pulls latest submodules and builds/pushes a new image before deploying. Once the new trigger flow is fully set up, the old yaml and script can be deleted. * [rag eval] add overall questions feedback section (#4390) add another eval stage for users to evaluate the questions asked by the LLM <img width="1779" alt="Screenshot 2024-06-24 at 5 33 32 PM" src="https://github.com/datacommonsorg/website/assets/69875368/77ed8486-4d0a-4b43-8118-78236ca008b1"> * Fixed issue when .removeAttribute called on a web component's convertArrayAttribute property (#4391) Fixes this error in UN staging site: <img width="1216" alt="Screenshot 2024-06-25 at 12 54 32 AM" src="https://github.com/datacommonsorg/website/assets/13766/26e7af44-4b2f-4dc9-a384-e6d0cb404f30"> Impacts lit elements with properties with decorator: ``` @property({ type: Array<string>, converter: convertArrayAttribute }) ``` When React removes a property from an element, it calls `<domElement>.removeAttribute`. This then calls `convertArrayAttribute(undefined)`, which was causing an exception. TODO: add webdriver unit test for this case * Fix diff page path input issue and cross origin error; No need to load maps api key in explore page (#4392) * Fix nl server build issue (#4394) * Added 'part' selectors to 'show metadata' link (#4393) Gives web component users ability to hide and style the 'show metadata' link * Updated clearcache tool logging (#4387) Before: ``` ./tools/clearcache/clear_prod.sh WARNING: Your active project does not match the quota project in your local Application Default Credentials file. This might result in unexpected quota issues. To update your Application Default Credentials quota project, use the `gcloud auth application-default set-quota-project` command. Updated property [core/project]. Fetching cluster endpoint and auth data. kubeconfig entry generated for website-us-central1. Defaulted container "website" out of: website, nl True WARNING: Your active project does not match the quota project in your local Application Default Credentials file. This might result in unexpected quota issues. To update your Application Default Credentials quota project, use the `gcloud auth application-default set-quota-project` command. Updated property [core/project]. Fetching cluster endpoint and auth data. kubeconfig entry generated for website-us-west1. Defaulted container "website" out of: website, nl True ``` After: ``` ./tools/clearcache/clear_prod.sh WARNING: Your active project does not match the quota project in your local Application Default Credentials file. This might result in unexpected quota issues. To update your Application Default Credentials quota project, use the `gcloud auth application-default set-quota-project` command. Updated property [core/project]. Fetching cluster endpoint and auth data. kubeconfig entry generated for website-us-central1. Defaulted container "website" out of: website, nl Clearing cache for datcom-website-prod/website-us-central1/us-central1, redis host 10.167.58.139: True WARNING: Your active project does not match the quota project in your local Application Default Credentials file. This might result in unexpected quota issues. To update your Application Default Credentials quota project, use the `gcloud auth application-default set-quota-project` command. Updated property [core/project]. Fetching cluster endpoint and auth data. kubeconfig entry generated for website-us-west1. Defaulted container "website" out of: website, nl Clearing cache for datcom-website-prod/website-us-west1/us-west1, redis host 10.158.101.59: True ``` * remove redirect for missing trailing slashes on explore links (#4396) the redirect from /explore landing pages caused a downgrade from https to http. also fix a few http:// links to docsite. follow up to #4392 * [Build Embeddings] Save index config, md5 and add test to check integrity (#4348) - Removed old preindex/duplicate files and keep all input files simply under 'input' directory. - Re-built all indexes to generate the index_config.yaml and md5sum.txt in the embeddings output gcs folder. * Use commit hashes from all of website, mixer, and import to label custom DC images (#4399) * [eval] Fix empty feedback recorded, table parsing (#4404) - fix bug with empty call response being saved to firestore when "apply to next" is selected - fix table parsing issue when header contains "-" * [rag eval] update feedback stage order (#4405) move overall question feedback to before individual question feedback * Support query transforms with SV differ (#4400) Diff for medium_ft with vs. without stop-words removal: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/shanth_medium_ft_2024_06_25_21_00_30.html The canonical query in this case has stop-words. * [rag eval] fix crashing question, add loading, fix answer showing instead of questions (#4407) - fix crashing when a query doesn't have an answer - add loading when fetching data for the feedback section - update reading either answer or questions in the LHS * Fix submodule treatment in custom DC build script (#4403) - Actually initialize submodules so that checkout + pull has an effect. - Make a temporary commit so that HEAD:import and HEAD:mixer are updated. * [rag eval] update wording in calls feedback, add another question to overall feedback (#4406) - update wording in calls feedback <img width="1766" alt="Screenshot 2024-06-26 at 3 17 43 PM" src="https://github.com/datacommonsorg/website/assets/69875368/3cb62970-e6c5-4a78-8a56-1bf014920028"> - add another question to overall feedback <img width="1755" alt="Screenshot 2024-06-26 at 3 17 23 PM" src="https://github.com/datacommonsorg/website/assets/69875368/b392481e-2d26-410d-aff4-2f0aae411b6e"> * Set up git email before creating temp commit (#4408) Without this extra line, `git commit` fails with an error about requiring identity. * Add a direct link to the required version of protoc (#4412) When I was onboarding, I found the versioning for protoc confusing, and it also took me a while to figure out where I could download the required version. Adding a direct link for convenience for future readers/contributors. * update goldens (#4411) * Updated unsdg staging tables (#4395) * [nl] prune topic vars for rig (#4409) Allow topics for rig but prune the svs so that we only keep the ones that were also separately detected on their own. Only keep pruned topics if there were no svs returned (which, I'm slightly conflicted if we need this piece) * Fix a bunch of stat var descriptions based on gemma eval results (#4413) - Add "carbon footprint" to greenhouse gas emissions description - Remove `(Non-Biogenic)` in GHG variables to be consistent - use "without" for No_HealthInsurance var descriptions. svdiff: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_06_27_15_06_07.html * Support Incremental embeddings build (#4401) Load saved embeddings vector from catalog.yaml and only compute text that are not seen. Also: - Use a SentenceObject to hold dcid, text and vector - Added a bunch of tests * Make a shared nl requirements.txt that can be imported from nl server and nl tools (#4410) * Update about.html (#4369) Change "I" to "we" * [nodejs] update csv table header for toolformer_rag mode, add website commit in debug info (#4414) - update csv table headers for mode=toolformer_rag: - line chart: "label" -> "date" e.g., https://paste.googleplex.com/4545373437952000 - bar chart: "label" -> "variable" e.g., https://paste.googleplex.com/6171709243916288 - map_chart: "label" -> "place" - read website commit from environment and return it as part of debug_info * [eval] handle empty text processing (#4418) * Update RAG eval tool counter layout for easier interaction (#4420) ![image](https://github.com/datacommonsorg/website/assets/5951856/b69aab9d-7c6e-4660-90de-c68f4033b620) * [rag eval] make less sheets reads, fix more table parsing (#4422) - make one call to read mulitple rows instead of making one call per row - fix table parsing when header contains "," or "₂" * [nl] don't use default place for toolformer modes (#4419) * Update SDG topics (#4421) * Update NL embeddings eval playground (#4424) Remove golden stat vars and eval scores since these are not practical and reliable. Updated this into a playground that can upload sv descriptions and queries on the fly and check stat var matches. The matches can highlight stat vars that have override descriptions. ![image](https://github.com/datacommonsorg/website/assets/5951856/4c31af16-cfb0-4e11-ba4a-7e4dcee74dde) * Add e2e instructions of adding stat var descriptions (#4425) * Remove "total" from "total population" and some other stat var descriptions. (#4423) Consolidated and cleaned up some other stat vars. Also updated svdiff tool README. https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_06_28_20_09_29.html * minor fixes to nl scripts (#4426) * Update diffs after staging push (#4427) * Update drive to work stat var descriptions (#4431) Also fixed some doc issues. svdiff report: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_07_01_17_51_44.html * [Stop Words] Introduce stop words exception list and add "how many", "number of" to this list (#4416) The larger embeddings model can understand semantics better (instead of token match). So we should try to preserve stop words. Including stop words would be a big change and we should do this in a controlled way. This introduces an exception list that are related to stop words and should be kept. With this, the query sentence will be "number of asian" instead of "number asian". Turns out this boosts the matching and the score a lot! https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_07_01_13_41_51.html * [nodejs] add highlight tile handling (#4417) allow nodejs to return a tile result for highlight tiles when the mode is set to something that is not bard e.g., what is the percentage of urban population in Tamil Nadu? no mode set: https://paste.googleplex.com/5341628766355456 mode=toolformer_rig: https://paste.googleplex.com/5803309044858880 * [eval tool] fixes for feedback pane formatting, wrong answer showing up, some easy nits (#4428) * Add ILO topics (#4430) * Remove old versions of forked deploy script/config (#4429) - deploy_custom_dc_autopush.sh is superseded by build_and_deploy_custom_dc_autopush.sh - deploy.yaml is superseded by push.yaml * Support sv description yaml file; Convert base DC sdg topics csv into yaml (#4435) * [embedding playground] Support index selection; Improve performance (#4437) ![image](https://github.com/datacommonsorg/website/assets/5951856/6a49246a-225b-44ab-ba6f-f3b5c1149638) - Make sure each model box renders only once. The encode and match time is substantial for larger model. - Add checkbox for index selection - Keep only one "apply" button * skip a broken webdriver test (#4439) skipping broken webdriver test because looks like it is caused by somewhere upstream in mixer/data * [eval] set up components to be used by sxs evals (#4436) update components that will be shared with sxs evals to take props instead of using context * Test custom DC autopush homepage load after deploying (#4346) * Increased fulfill stat var results limit for undata nl indexes (#4441) - Increased stat vars returned from 50 to 200 - Updated check to allow for increased limit on the SDG index * Updated ILO embeddings (#4438) Updated iLO embedings to reflect stat var grouping changes from #4430. Test in explore page with URL params `dc=undata_dev` or `dc=undata_ilo`. Example: http://localhost:8080/explore#q=unemployment+in+the+usa&client=ui_query&dc=undata_dev <img width="1397" alt="Screenshot 2024-07-03 at 4 13 38 AM" src="https://github.com/datacommonsorg/website/assets/13766/32228e74-720b-4cc0-8bed-04e3674032f7"> * Updated un staging tables (#4442) - Removed sdg table since it's now in covered in `schema_2024_07_03_12_17_40` and `country_2024_07_03_10_18_47` - Updated ilo tables * Update 'prevalence' to 'proportion of' in stat var description (#4434) "prevalence" is a very strong word match and results in non-optimal matching and ranking. "proportion of" seems to be more general wording and works good for a range of query key words (see diff below). I have also tried "percentage" which don't really match prevalence query very well (maybe prevalence could refer to millionth and not really used in common with percentage?) sv diff report: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_07_03_18_26_13.html The two examples below show increased score of matching when asking in different ways. ![image](https://github.com/datacommonsorg/website/assets/5951856/3b181ca4-4caf-4826-940c-cacde9d10f32) ![image](https://github.com/datacommonsorg/website/assets/5951856/c9f67dcd-a948-4730-b77f-ea2887611ade) * Add back dc/topic/Asthma now the child sv order is fixed (#4445) svdiff report: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_07_04_11_32_19.html * [gemma sxs] create new app (#4440) - add plumbing for new sxs eval app at: /eval/retrieval_generation_sxs?sheetIdLeft=&sheetIdRight=&queryId= - re-use eval components to display the LHS and RHS <img width="1765" alt="Screenshot 2024-07-03 at 2 10 32 PM" src="https://github.com/datacommonsorg/website/assets/69875368/90369c8f-670b-450c-8009-959f6b52d58d"> * Allow duplicate dcid among input csv and yaml files (#4443) This is to allow alternative files. * enabled mixer rest api for unsdg project (#4446) This enables /v1 and /v2 endpoints like: https://unsdg.datacommons.org/v2/node?nodes=dc/g/UN&property=-%3E* which are used by the UN data site * [Eval UI] Trigger sign in on load (#4447) Also show some explanatory text while things are loading. * [toolformer] replace "residents" with "people" (#4449) example: https://screenshot.googleplex.com/C3QjzRnnzD9euVc sv index differ: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_08_12_11_23.html where test replaces residents with people * Fix build embeddings tool for local paths. (#4450) * [nodejs] add highlight field to highlight tile response (#4448) e.g., what is the percentage of urban population in Tamil Nadu? https://paste.googleplex.com/4946781240819712 * Fixed css bug that was hiding the stanford custom dc navbar (#4451) ## Before: ![Screenshot 2024-07-08 at 3 30 32 PM](https://github.com/datacommonsorg/website/assets/13766/c52d87de-ca2a-4230-b55b-20020eb17520) ## After: ![Screenshot 2024-07-08 at 3 31 43 PM](https://github.com/datacommonsorg/website/assets/13766/591f906c-d935-4d6c-b87a-5c55f357e8b5) * Set env variables to fix sentence transformer issues in cloud run. (#4460) * Added datagemma gke configuration (#4454) * remove negation from stop words (#4455) svindex differ: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_09_20_03_18.html base has all the stop words test has the negation stop words removed * Make base/sheets_svs.csv a proper csv (#4464) So it can be viewed as a [table on github](https://github.com/datacommonsorg/website/assets/4375037/6f355c58-89da-4708-99dc-eb1061ef771c) and loaded into BQ, etc. * [SxS Eval] Update how left/right are determined and add baseline eval type (#4458) Also: - Make all inputs to left/right picker available via context - Clean up some unused stuff - Factor out duplicated template section into its own component * Skip experimental import group for non-autopush (#4466) * sv index updates (#4457) - remove temperature, min/max temperature, houseless X scheduledCaste/scheduledTribe variables - add descriptions: include "children" for Count_Person_18OrLessYears_NoHealthInsurance, include "native american" for Count_Person_AmericanIndianOrAlaskaNativeAlone - replace Annual_Emissions_CarbonDioxide_Biogenic with Annual_Emissions_CarbonDioxide_NonBiogenic svindex diffs: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_10_10_15_51.html * Move custom DC tests out of server/webdriver/tests (#4465) Otherwise they get run as part of ./run_test.sh -w * remove per capita stop words (#4456) same change as https://github.com/datacommonsorg/website/pull/4415 but rebased off a clean master svindex diff: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_09_21_55_14.html base does not remove per capita stop words test removes per capita stop words * Add support for testing bad-words file (#4462) Instructions in cl/651193865 * NL: Drop custom cleanup logic for SV titles (#4471) Fork of an old version of https://github.com/datacommonsorg/website/pull/4461/files. * [Eval SxS] Feedback footer with navigation; Firestore writes (#4468) - Add a feedback component and style it as a footer. - Add previous/next buttons that navigate between queries. - Write to Firestore when navigating previous/next. Just write static values for now. Also fix randomization of left/right. * [toolformer] replace "global" with "world" in queries in toolformer mode (#4475) ran goldens without the `if (params.is_toolformer_mode(dargs.mode))` & there were no diffs * [Eval SxS] Actually get and set ratings; Improve layout (#4472) - Add radio buttons for preference and a text area for reason - Fetch and show existing ratings, correcting left/right orientation if necessary - Save rating when navigating if it has been updated - Make panes scroll independently - Display query ID and question only once - Make footer take up only as much room as it should [Screencast](https://github.com/user-attachments/assets/1f7cb506-eacf-408a-8219-84b9be59165c) * add women in parliament to nl index (#4476) no diffs: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_12_16_30_34.html * Added GKE setup script to grant Vertex AI User role to robot/service account (#4479) * Avoid a buggy regex for comparison heuristic (#4477) This was added in very very early days in https://github.com/datacommonsorg/website/pull/2155, for comparison between two variables. In practice, "compar.." or "vs" are the trigger words we use. There are too many potential unknown matches with "...er$". Even comparative words like "better" or "greater" need not necessarily mean comparison across vars ("which california county has a greater chance of ...?"). So lets just drop it! There are no diffs in integration-tests. TODO: I hope to check the screenshot diffs after submitting this PR... ![image](https://github.com/user-attachments/assets/f46a6e8c-74fb-46b9-a432-7e73ebd54d75) * update goldens after topic cache update (#4478) commute time topic updated with mean commute time replacing total commute time * Add CDC SV explorer sanity test. (#4481) * [gemma eval tools] filter out empty tables in table pane (#4484) * [toolformer] promote svs from a topic for rag (#4482) Promote svs that are in a topic but show up immediately after that topic to before the topic * Update submods. (#4489) * Load custom dc embeddings at server startup. (#4473) * Allow subset of pattern exclusions from heuristics (#4487) Currently, words like "youngest" etc are added to stop-words. Add a set of exception patterns per heuristic type. Start off with toolformer-only. * [toolformer] exclude correlation fulfiller (#4486) * Add Custom DC NL sanity test. (#4485) * Show DC stats in tooltips for RIG (#4483) For RIG answers: - Don't show inline DC stats. Instead show DC stat with label in a tooltip when an LLM stat is hovered over. - When an LLM stat in SxS UI doesn't have an associated DC stat, don't highlight it or show a tooltip. (For regular eval UI, still highlight but don't add a tooltip.) Other changes: - Change "Why?" free-text form field label to "Comments (optional)" - Show "Loading answer..." when fetching answers, so answers are never out of sync with the feedback form. - Fetch all call data when initially loading each spreadsheet - Don't limit some sheet calls by column (fetch more data including possibly extraneous data in exchange for making fewer fetches) - Minor naming updates for readability [Screencast](https://github.com/user-attachments/assets/a7ca0359-ee0c-40ff-99eb-b332cae61bff) * avoid stripping period for "St." (#4490) There are place names with St. like St. Landry Parish and stripping the period from that will prevent the place from being detected. * Add caching to place pages (#4480) Adds caching to the place pages in preparation for SEO experimentation. Because the place pages don't get updated often, I did not set a timeout. The cache will refresh whenever we clear the cache as part of a website release. * [timeline] fix ratio bug (#4493) there was a bug in picking the denominator observation where when comparing if next denom is better, it was comparing with the earliest denominator & not the previous denominator. This caused weird behavior for per capita timelines autopush: https://screenshot.googleplex.com/6gh9Nyjr7kg638C local: https://screenshot.googleplex.com/BHUsQPMih5X27Hy * Added API handler for fetching observation dates given a list of entities and variables (#4474) Added API handler for fetching observation dates given a list of entities and variables. New website API endpoint: `/api/observation-dates/entities`. This endpoint is similar to the `/api/observation-dates` endpoint, but it takes a list of entities instead of a parentEntity and childType. This endpoint will support updating the timeline slider web component to accept a list of entities rather than parent/child relationships. Example usage: ``` GET /api/observation-dates/entities?entities=country/USA&entities=country/CAN&variables=Count_Person&variables=Count_Household Response: { "datesByVariable": [ { "variable": "Count_Person", "observationDates": [ { "date": "1900", "entityCount": [ { "facet": "2176550201", "count": 1 } ] }, { "date": "1901", "entityCount": [ { "facet": "2176550201", "count": 1 } ] } ] }, { "variable": "Count_Household", "observationDates": [ { "date": "1900", "entityCount": [ { "facet": "2176550202", "count": 1 } ] }, { "date": "1901", "entityCount": [ { "facet": "2176550202", "count": 1 } ] } ] } ], "facets": { "2176550201": {"importName": "facet1"}, "2176550202": {"importName": "facet2"} } } ``` * Add frozendict to nl_server (#4494) https://github.com/datacommonsorg/website/pull/4487 introduced frozendict import in shared lib. The NL server requirements was missing it ([Web server requirements does](https://github.com/datacommonsorg/website/blob/e868bcc8a374fb72106d52cf7fd55a9d8a58f7bb/server/requirements.txt#L9)). * Update README.md * [Eval UI] Handle invalid query IDs; clean up after async useEffects (#4495) * [Eval Sxs] Add an eval list; jump to first incomplete eval on load (#4491) Also keep answers in loading state until answer text is processed. [Demo](https://github.com/user-attachments/assets/ac31ea02-8646-4e32-8a59-12cc85178627) * update nl index for electricity generation & age ranges (#4492) - update descriptions with "electricity generated" -> "electricity generation" to be consistent with descriptions about "energy generation" - add child, adult, and senior age ranges to the index sv index diff: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_17_14_22_29.html * update compose for sentence transformer * [evals] use value instead of stringValue when getting cell content (#4498) cell values of type number wasn't fetched when using getCell().stringValue * [SxS] only show tables that are used in answer (#4497) for RAG answers, the table pane should only show tables actually used in the answer autopush: https://screenshot.googleplex.com/J5CqEUu87ptoKCd local: https://screenshot.googleplex.com/8ygR9UTWk7HrCu6 * [CDC Autopush] Reorganize build, deploy, and test scripts (#4499) For ease of debugging workflow failures, I want to try having each script be its own build step. The new build config YAML will be: ``` steps: - id: clone-website-repo name: gcr.io/cloud-builders/git entrypoint: bash args: - -c - | set -e mkdir src git clone https://github.com/datacommonsorg/website.git src cd src waitFor: ['-'] - id: build-and-tag-latest name: gcr.io/cloud-builders/docker entrypoint: bash args: - -c - | set -e ./scripts/build_custom_dc_and_tag_latest.sh waitFor: ['clone-website-repo'] - id: deploy-latest-to-autopush name: gcr.io/cloud-builders/gcloud entrypoint: bash args: - -c - | set -e ./scripts/deploy_custom_dc_latest_to_autopush.sh waitFor: ['build-and-tag-latest'] - id: run-tests name: python:3.11.3 entrypoint: bash args: - -c - | ./run_test.sh --setup_python ./scripts/run_cdc_tests.sh waitFor: ['deploy-latest-to-autopush'] options: machineType: 'E2_HIGHCPU_32' ``` * [evals] fix format of values read from sheet (#4501) use formattedValue instead of value to get the actual value that a user sees in the sheet <img width="1757" alt="Screenshot 2024-07-18 at 1 08 36 PM" src="https://github.com/user-attachments/assets/81b83cc2-bfc6-45d8-a836-65a3b8959852"> * [CDC Autopush] Make new scripts executable (#4502) I always forget this! * [tiles] fix fraction digits in highlight tile (#4503) bug was that all the highlight tile values were showing 1 digit after the decimal. This was because: - we always defaulted numFractionDigits to 1 - before PR https://github.com/datacommonsorg/website/pull/4417 we actually weren't setting numFractionDigits field correctly in the highlight tile & this change here caused us to set numFractionDigits correctly, but for default cases that meant setting it to 1: https://github.com/datacommonsorg/website/pull/4417/files#diff-a7d41f3b232c03e93d2230c267b7465828c783406c2be4a4998e0b9e68df1d31R212 To fix this, - default numFractionDigits to undefined. This is safe in the other tiles because highlight tile is the only tile that uses numFractionDigits screenshot diff that found this bug: https://screenshot.googleplex.com/4yiW2b8d6WGPZ9f localhost: https://screenshot.googleplex.com/9cYvM2Xdx4ydo6C * Fixed csv download and show metadata link in local and custom DC to use local API path (#4463) Before: <img width="1407" alt="Screenshot 2024-07-10 at 5 17 34 PM" src="https://github.com/datacommonsorg/website/assets/13766/254eadbf-8ab5-463f-83e1-d01844be8eab"> <img width="1405" alt="Screenshot 2024-07-10 at 5 17 24 PM" src="https://github.com/datacommonsorg/website/assets/13766/7a596423-5edd-4bee-aa70-771d6c02d0c7"> After: <img width="1413" alt="Screenshot 2024-07-10 at 5 16 27 PM" src="https://github.com/datacommonsorg/website/assets/13766/b5b5943f-5a94-4dfc-9498-0676b36970a8"> <img width="1414" alt="Screenshot 2024-07-10 at 5 16 38 PM" src="https://github.com/datacommonsorg/website/assets/13766/dfa93d41-6b00-4782-97c2-ce815f7b883f"> * [nl] Fix PC stop words regex (#4505) fix the regex used for detecting rate/rates - dropped the word boundary because both classification detection and stop word removal (both cases where the stop words are used) add their own word boundary & having the word boundary in the stop word regex causes problems - use negative lookbehind instead of negative lookahead because we should be looking for all cases of rate that are not preceded by special words like literacy or mortality * Add ESLint warning for missing return type. (#4496) Codacy flags this but I would like to find out about it before waiting for checks to run. * [Eval UI] Reformat RIG tooltip; Close list modal on esc or external click (#4507) [Demo](https://github.com/user-attachments/assets/0623d7af-3415-427e-bcc9-02410d15fbed) * update goldens after mixer staging push (#4506) * update submodules for release (#4510) * [CDC Autopush] Fix submod pull no-op case (#4511) Allow an empty commit when pulling in latest submods in case they're already up-to-date. This should fix current autopush build failures. * Consolidate CDC env file. (#4509) * Make OUTPUT_DIR required. (#4512) * Headless drivers are now blocked (#4504) * Create initial data docker image. (#4515) * Update bad words python test (#4516) This PR removes the assertion that in the bad words test, in `multi` mode, lines should not contain spaces. This assertion breaks when creating cross products that include place names like "united states of america" or "united kingdom". This PR also updates the "headless drivers" NL test case to look for the correct detection type. This should unblock the python and nl test failures seen in unrelated, already approved PRs. * Update nodejs query test goldens (#4514) Updates the goldens for our nodejs_query_differ test. Looks like there's been some data updates since we last updated these goldens. * [SEO Experiment] Add plumbing to read experiment pages from GCS (#4500) To run our SEO experiments, this PR: * Adds logic to the place pages for the server to read a jinja template from GCS. * Starts a folder in server config to hold experiment template files. * Checks in a sample template for Egypt with matching CSS. * Adds a script to sync GCS bucket contents with local config folder. Note: Templates for the other places in the experiment group will come in a follow up PR. Until those pages are completed, reading from GCS is only enabled in autopush, dev, and locally (i.e. not staging and not prod). ![Screenshot 2024-07-18 at 12 32 54 PM](https://github.com/user-attachments/assets/8a637f23-6665-4119-b2a4-bb1a2c8f8fb3) * Remove *_env.list files. (#4517) * Add absolute path comments in env.list. (#4518) * Update .dockerignore to fix docker build (#4519) Updates .dockerignore to exclude the sanity.py file (in **/tests) to fix a cron-testing docker build failure. * Added support for HIGHEST_COVERAGE to /api/observation/point endpoints when specifying a list of entities and variables (#4513) Adds support for `date=HIGHEST_COVERAGE` to the `/api/observations/point` and `/api/observations/point/all` endpoints. Previously, `date=HIGHEST_COVERAGE` was only supported on `/api/observations/point/within` and `/api/observations/point/within/all` Requesting `date=HIGHEST_COVERAGE` uses a heuristic to fetch a recent data point (within 5 years or in the 5 most recent points, whichever is greater). When multiple variables are specified, we apply the same heuristic across using the total observation count across all variables for a given date to find the highest coverage. Example usage: Multi-entity, single variable `/api/observations/point?entities=country/RUS&entities=country/USA&entities=country/MEX&variables=Count_Person_InLaborForce&date=HIGHEST_COVERAGE` Multi-entity, multi-variable: `/api/observations/point?entities=country/RUS&entities=country/USA&entities=country/MEX&variables=Count_Person_InLaborForce&variables=sdg/SI_POV_DAY1&date=HIGHEST_COVERAGE` * update copd description in nl index (#4521) sv diffs (no diffs found): https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_25_16_22_31.html * Update mixer submodule (#4522) Updates submodules as part of website release process. Because last release was not so long ago, only mixer needed to be updated. * [CDC Autopush] Update scripts to accommodate data docker (#4520) - Split out submodule update into its own script so it can be run as a prerequisite step for both data and service docker build steps. - Make sure to get website hash before making a temp commit. - Write image label made of combined commit hashes to a temp file so other scripts can use it. - Rename service docker build and deploy scripts to distinguish them from data docker build and (eventually) deploy scripts. - Make some minor edits recommended by Shellcheck and shell-format VSCode extensions. This PR will temporarily break custom DC autopush until I update the compose autopush cloudbuild config in the deployment repo. Planned update: https://paste.googleplex.com/6181631964741632 * Update NL integration test goldens (#4523) Updates our NL integration test goldens via `./run_test.sh -g`. Per our release docs, our NL tests rely on staging mixer, so the goldens need to be updated after a mixer release to staging. These updated goldens reflect the data changes from this mixer commit: https://github.com/datacommonsorg/mixer/commit/75c03483a4e843bb77ad71c3e0a6694a2ff39dd0 * sv index updates to handle global mortality/death rates (#4524) - remove global from topic descriptions - mortality rate/death rate -> mortalities/deaths - remove "mortality rate" and "death rate" from Per Capita exclusions because the exclusion list is all forms of "x rate" that shows up as sv descriptions in the index sv diffs (looks ok to me): https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_26_13_31_00.html * Fix missing overview tile on explore page (#4526) Adds back the Google Maps API to the explore page. It's previous removal was causing the overview tile in queries like "Tell me about [place]" not to render properly. Before: <img width="1336" alt="Screenshot 2024-07-29 at 11 36 36 AM" src="https://github.com/user-attachments/assets/6e86023d-87c4-4964-b7cf-fe9ffb3e97c0"> After: <img width="1332" alt="Screenshot 2024-07-29 at 11 36 19 AM" src="https://github.com/user-attachments/assets/9354ffa3-5a23-4c8d-b0f3-32f26b22f5bf"> * Update GlobalHealth topic description (#4529) - in PR https://github.com/datacommonsorg/website/pull/4524, removed "Global" from all topic descriptions, however this caused losses for queries like "Health+conditions+vs+median+age+in+Alameda+County" and "Most+common+medical+conditions+in+US" because GlobalHealth as a topic got ranked higher than Health and HealthConditions - here we revert that global change & instead replace the word "Global" with "World", which is a word that can not be overindexed because it would be removed from the query (it is a place) and we confirmed that "global" in the query does not prefer "world" from SV description sv diffs: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_29_15_21_46.html * Add "tell me about california" to screenshot tests (#4528) Adds https://datacommons.org/explore#q=tell%20me%20about%20california to the screenshot tests to catch any future regressions in the overview tile on the NL results page. This PR is a followup to the fix made in #4526 * update bard goldens (#4530) Update diffs after pushing most recent changes * [Eval UI] Tweak table parsing (#4531) - Require table divider to be at least three dashes long. This prevents sequences of 1 or 2 dashes in values from getting parsed as dividers. - Allow any character other than a pipe in a table header value. * Fix map tool stat var search with many places (#4534) Fixes b/356689537 where variable search in the new map tool when plotting countries on earth resulted in an error "Request line is too large". Since the error only reproduces when running the website server with Gunicorn, also added a mode to run_server.sh for ease of reproing. I've moved all request params to the request body, but if we want to only move places (maybe for the sake of analytics), I can modify this. * Updated nodejs goldens (#4532) Reference: https://github.com/datacommonsorg/website/tree/master/tools/nl/nodejs_query_differ#update-goldens * Fixed error when fetching HIGHEST_COVERAGE date for variables with no observation-dates (#4533) Fixes error in prod when fetching HIGHEST_COVERAGE point observations for variables with no observation-dates: https://datacommons.org/api/observations/point/within?parentEntity=country/HKG&variables=Count_Person&childType=AdministrativeArea1&date=HIGHEST_COVERAGE No observation-dates summary for parent=country/HKG childType=AdministrativeArea1 , variable=CountPerson: https://datacommons.org/api/observation-dates?parentEntity=country/HKG&childType=State&variable=Count_Person <img width="1035" alt="Screenshot 2024-07-31 at 11 45 27 PM" src="https://github.com/user-attachments/assets/dad2fe0c-efc7-4da5-a0a7-c93c0e15368f"> Error message when running locally: <img width="1335" alt="Screenshot 2024-07-31 at 11 41 35 PM" src="https://github.com/user-attachments/assets/94d85164-e181-46df-ada3-066d4f5a2fe9"> After fix: <img width="1265" alt="Screenshot 2024-07-31 at 11 43 22 PM" src="https://github.com/user-attachments/assets/2e929522-2ee6-4212-92cc-136e8b05df0f"> * Added datacommons-bar 'subscribe' event listener to handle date change events from datacommons-slider. (#4525) - Added datacommons-bar 'subscribe' event listener to handle date change events from datacommons-slider. - Updated error display for all components. ## Bar chart slider integration Example usage: ``` <datacommons-bar apiRoot="http://localhost:8080" places="geoId/06 geoId/11 geoId/12" date="HIGHEST_COVERAGE" title="Life expectancy vs Median age in California, the District of Columbia, and Florida (${date})" subscribe="dc-bar" variables="LifeExpectancy_Person Median_Age_Person" > <div slot="footer"> <datacommons-slider apiRoot="http://localhost:8080" places="geoId/06 geoId/11 geoId/12" publish="dc-bar" variables="LifeExpectancy_Person Median_Age_Person" ></datacommons-slider> </div> </datacommons-bar> ``` <img width="1070" alt="Screenshot 2024-07-26 at 4 09 48 PM" src="https://github.com/user-attachments/assets/ebd78c05-6b21-464e-b8b9-7a407d80e616"> ## Error display update: ![MG3mrvpTrZLmP2V](https://github.com/user-attachments/assets/c4f9fbe4-f649-4903-8805-3a79dfa3f703) ![76Vd74zssCVhEcR](https://github.com/user-attachments/assets/e11658a2-0a5e-4a49-b270-331bda271ec7) * updated golden tests (#4535) * Updated mixer and import submodules. (#4536) * Update cache key for stat var search POSTs (#4537) Quick fix to follow up https://github.com/datacommonsorg/website/pull/4534 and unblock the website release. For the future it would be nice to wrap @cache.cached in a custom decorator that we can use everywhere which takes care of passing default params and making sure post body is in the cache key. * Fix misplaced comment. (#4538) * Update submods. (#4540) * remove world from mortality topic descriptions (#4541) "world" was being overindexed in queries like "global population". This is a short term fix. Longer term fix will be to have a set of negative variables that require a higher score threshold to be returned sv diffs: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_08_06_16_04_57.html * Create initial multi stage CDC services docker image. (#4542) * Added custom data commons docker-based local development environment (#4543) Start the docker environment by running: ``` ./run_cdc_dev_docker.sh ``` Open http://localhost:8080 in the browser Changes: - Replaced `USE_LOCAL_MIXER` environment variable with `WEBSITE_MIXER_API_ROOT` to specify the specific path of the local mixer. - Added `NL_SERVICE_ROOT_URL` optional environment variable to specify NL service path for the website - Updated NL app to serve on `0.0.0.0` instead of `127.0.0.1` to allow docker to expose NL service to other containers - Updated nl_requirements: `pandas` to `2.1.1` and `scikit-learn` to `1.3.1` because both of these versions come with pre-built wheels for python 3.11+ ([pandas wheels](https://www.piwheels.org/project/pandas/), [scikit-learn wheels](https://www.piwheels.org/project/scikit-learn/) * Update redirects.json (#4539) Adding two redirects for datacommons.org/link/video and datacommons.org/link/form for the two pager Co-authored-by: Dan Noble <[email protected]> * Update RIG tooltips to match latest mocks (#4544) - Incorporate footnote content into tooltip content and don't show footnotes in answer body - Also show tooltips when no DC stat is present https://github.com/user-attachments/assets/0a14aca2-615a-4e8a-a261-7e61b316948c * Use async when loading Google maps APIs (#4545) This PR adds `loading=async` to the call to load the Google Maps API, as per [Google's Documentation](https://developers.google.com/maps/documentation/javascript/overview#Loading_the_Maps_API). This removes the following console warning from pages with maps calls: ![Screenshot 2024-08-08 at 9 50 54 AM](https://github.com/user-attachments/assets/4e4ed399-c51d-4f44-80cc-3bc61a54ddbb) * Update RIG UI to show footnotes if no tooltips (#4546) If there are no tooltips shown, go back to old UI of showing footnotes <img width="2547" alt="Screenshot 2024-08-08 at 3 45 40 PM" src="https://github.com/user-attachments/assets/9ee09deb-cff5-4192-ada9-b3a395831db3"> * Create smaller, faster building services docker. (#4547) * Create scripts to auto build / deploy the new services docker. (#4548) * Copy only static dist artifacts to further reduce docker image size. (#4549) * Update internal to use remoteMixerDomain (#4550) Verified on https://dc.corp.goog/version * Build cdc services image with docker buildkit enabled. (#4551) * Run chmod as a separate step when building docker image. (#4553) * Fixed local cdc docker compose setup to use custom embeddings and configured a local data path (#4555) Fixes when running `./run_cdc_dev_docker.sh` : - Updated FLASK_ENV from local to custom to show the custom_dc homepage - Added IS_CUSTOM_DC true to load only custom dc embeddings - Added `OUTPUT_DIR` to env configuration, which: - Loads the custom topic cache in dc-website - Configures the `ADDITIONAL_CATALOG_PATH` (`custom_catalog.yaml`) which contains custom dc embeddings and index definitions - Mounts local sqlite database for the `dc-mixer` * Added custom data commons terraform deployment scripts (#4552) Introduces a Terraform-based deployment framework for setting up a custom Data Commons instance on GCP. The deployment automates the creation of necessary infrastructure, including Cloud Run services, Redis, and MySQL instances, and provisions essential API keys and secrets. It supports multiple instances in a single GCP account using namespaces and Terraform workspaces. ## Features: - Automates the deployment of a Data Commons website and data task containers via Cloud Run. - Optionally provisions a Redis instance for caching. - Creates a MySQL instance with a generated password stored securely in Secret Manager. - Automatically enables required Google Cloud APIs. - Supports multiple deployments in the same GCP project using namespaces. ## How to Use/Deploy: - Follow instructions in `deploy/terraform-custom-datacommons/README.md` * Add query logging for Bard instance (#4556) * Create initial client to fetch dc api keys for a given project. (#4557) * minor readme and .gitignore updates to custom dc terraform (#4558) Fixed some typos in the custom datacommons terraform readme, and added the backend.tf to gitignore * Update magic_eye to use remoteMixerDomain (#4559) Verified at https://datcom-magiceye-dev.corp.goog/version * Add apigee apis for importing keys. (#4560) * Read projects from Google Sheet and write dc keys back to it. (#4562) * Updated terraform.tfvars.sample with dc_api_key (#4564) * Updated comments in terraform.tfvars.sample (#4565) * Added disableEntityLink option to bar chart web component (#4563) (UN ask) Adds `disableEntityLink` option to bar chart web components to remove the entity link from x-axis: ![9Yy4WfAEkGGD5PY](https://github.com/user-attachments/assets/208c3066-9d19-4c92-908f-c8ef85d007e3) * Update README.md (#4567) * updated custom dc terraform services container to use the new image (#4570) * Add one more step to instructions (#4569) Co-authored-by: Dan Noble <[email protected]> * Add support for importing keys into apigee. (#4568) * Set default value for FLASK_ENV. (#4576) * updated custom dc docker dev image to pull down models in container build (#4575) * Bump github.com/hashicorp/go-getter from 1.7.0 to 1.7.5 in /deploy/terraform-datacommons-website/test (#4397) Bumps [github.com/hashicorp/go-getter](https://github.com/hashicorp/go-getter) from 1.7.0 to 1.7.5. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/hashicorp/go-getter/releases">github.com/hashicorp/go-getter's releases</a>.</em></p> <blockquote> <h2>v1.7.5</h2> <h2>What's Changed</h2> <ul> <li>Prevent Git Config Alteration on Git Update by <a href="https://github.com/dduzgun-security"><code>@dduzgun-security</code></a> in <a href="https://redirect.github.com/hashicorp/go-getter/pull/497">hashicorp/go-getter#497</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/dduzgun-security"><code>@dduzgun-security</code></a> made their first contribution in <a href="https://redirect.github.com/hashicorp/go-getter/pull/497">hashicorp/go-getter#497</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/hashicorp/go-getter/compare/v1.7.4...v1.7.5">https://github.com/hashicorp/go-getter/compare/v1.7.4...v1.7.5</a></p> <h2>v1.7.4</h2> <h2>What's Changed</h2> <ul> <li>Escape user-provided strings in <code>git</code> commands <a href="https://redirect.github.com/hashicorp/go-getter/pull/483">hashicorp/go-getter#483</a></li> <li>Fixed a bug in <code>.netrc</code> handling if the file does not exist <a href="https://redirect.github.com/hashicorp/go-getter/pull/433">hashicorp/go-getter#433</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/hashicorp/go-getter/compare/v1.7.3...v1.7.4">https://github.com/hashicorp/go-getter/compare/v1.7.3...v1.7.4</a></p> <h2>v1.7.3</h2> <h2>What's Changed</h2> <ul> <li>SEC-090: Automated trusted workflow pinning (2023-04-21) by <a href="https://github.com/hashicorp-tsccr"><code>@hashicorp-tsccr</code></a> in <a href="https://redirect.github.com/hashicorp/go-getter/pull/432">hashicorp/go-getter#432</a></li> <li>SEC-090: Automated trusted workflow pinning (2023-09-11) by <a href="https://github.com/hashicorp-tsccr"><code>@hashicorp-tsccr</code></a> in <a href="https://redirect.github.com/hashicorp/go-getter/pull/454">hashicorp/go-getter#454</a></li> <li>SEC-090: Automated trusted workflow pinning (2023-09-18) by <a href="https://github.com/hashicorp-tsccr"><code>@hashicorp-tsccr</code></a> in <a href="https://redirect.github.com/hashicorp/go-getter/pull/458">hashicorp/go-getter#458</a></li> <li>don't change GIT_SSH_COMMAND when there is no sshKeyFile by <a href="https://github.com/jbardin"><code>@jbardin</code></a> in <a href="https://redirect.github.com/hashicorp/go-getter/pull/459">hashicorp/go-getter#459</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/hashicorp-tsccr"><code>@hashicorp-tsccr</code></a> made their first contribution in <a href="https://redirect.github.com/hashicorp/go-getter/pull/432">hashicorp/go-getter#432</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/hashicorp/go-getter/compare/v1.7.2...v1.7.3">https://github.com/hashicorp/go-getter/compare/v1.7.2...v1.7.3</a></p> <h2>v1.7.2</h2> <h2>What's Changed</h2> <ul> <li>Don't override <code>GIT_SSH_COMMAND</code> when not needed by <a href="https://github.com/nl-brett-stime"><code>@nl-brett-stime</code></a> <a href="https://redirect.github.com/hashicorp/go-getter/pull/300">hashicorp/go-getter#300</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/hashicorp/go-getter/compare/v1.7.1...v1.7.2">https://github.com/hashicorp/go-getter/compare/v1.7.1...v1.7.2</a></p> <h2>v1.7.1</h2> <p>No release notes provided.</p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/hashicorp/go-getter/commit/5a63fd9c0d5b8da8a6805e8c283f46f0dacb30b3"><code>5a63fd9</code></a> Merge pull request <a href="https://redirect.github.com/hashicorp/go-getter/issues/497">#497</a> from hashicorp/fix-git-update</li> <li><a href="https://github.com/hashicorp/go-getter/commit/5b7ec5f039197dd363e912c8367329f8399557c6"><code>5b7ec5f</code></a> fetch tags on update and fix tests</li> <li><a href="https://github.com/hashicorp/go-getter/commit/9906874a23919a81eff097d84fdb8f98525ac880"><code>9906874</code></a> recreate git config during update to prevent config alteration</li> <li><a href="https://github.com/hashicorp/go-getter/commit/268c11cae8cf0d9374783e06572679796abe9ce9"><code>268c11c</code></a> escape user provide string to git (<a href="https://redirect.github.com/hashicorp/go-getter/issues/483">#483</a>)</li> <li><a href="https://github.com/hashicorp/go-getter/commit/975961f5f06346ccc282cd0d9aa16e160d26f9e3"><code>975961f</code></a> Merge pull request <a href="https://redirect.github.com/hashicorp/go-getter/issues/433">#433</a> from adrian-bl/netrc-fix</li> <li><a href="https://github.com/hashicorp/go-getter/commit/0298a221674f629339295fa8a1e6a938e28506e0"><code>0298a22</code></a> Merge pull request <a href="https://redirect.github.com/hashicorp/go-getter/issues/459">#459</a> from hashicorp/jbardin/setup-git-env</li> <li><a href="https://github.com/hashicorp/go-getter/commit/c70d9c915b8e823c44dd591088d15cde70d5e813"><code>c70d9c9</code></a> don't change GIT_SSH_COMMAND if there's no keyfile</li> <li><a href="https://github.com/hashicorp/go-getter/commit/3d5770fe3ae127b90f54d825ef1772f0b4e86621"><code>3d5770f</code></a> Merge pull request <a href="https://redirect.github.com/hashicorp/go-getter/issues/458">#458</a> from hashicorp/tsccr-auto-pinning/trusted/2023-09-18</li> <li><a href="https://github.com/hashicorp/go-getter/commit/06889794ed3f360b24e8ef7169294ccc59abc044"><code>0688979</code></a> Result of tsccr-helper -log-level=info -pin-all-workflows .</li> <li><a href="https://github.com/hashicorp/go-getter/commit/e66f244d9206aca1ce0dee4823c833fecb2f77fc"><code>e66f244</code></a> Merge pull request <a href="https://redirect.github.com/hashicorp/go-getter/issues/454">#454</a> from hashicorp/tsccr-auto-pinning/trusted/2023-09-11</li> <li>Additional commits viewable in <a href="https://github.com/hashicorp/go-getter/compare/v1.7.0...v1.7.5">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=github.com/hashicorp/go-getter&package-manager=go_modules&previous-version=1.7.0&new-version=1.7.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/datacommonsorg/website/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Dan Noble <[email protected]> * [custom_dc] Add env.list.sample (#4577) * Removed 'Analyze this data in BigQuery' button from visualization tools (#4582) ## Before ![before1](https://github.com/user-attachments/assets/5757640c-4d90-44bb-825d-10c564c750cd) ![after2](https://github.com/user-attachments/assets/c57368f5-40db-4ef3-9f9f-0b1220d7924f) ## After ![after1](https://github.com/user-attachments/assets/6aa46825-8d27-4623-931b-6e79592f30ba) ![before2](https://github.com/user-attachments/assets/8256c28c-b683-4973-b839-21580e0b6b66) * Added terraform variable google_analytics_tag_id for enabling Google Analytics (#4572) - Added terraform variable `google_analytics_tag_id` for enabling Google Analytics - The Google Analytics Tag ID can now be passed as an environment variable to Custom Data Commons. Previously, users had to set the variable in their `server/app_env/*.py` config file * Enable NL by default for Custom DC. (#4583) * [docs] Update readme with setup instructions for tests. (#4578) Make it clear that devs have to run setup before running tests. Fixes #3923 * Disable obs browser pages for some custom DCs (#4566) Updates * climate_trace * custom * feedingamerica * iitm * unsdg This is to remove dependency on BQ (and is currently broken for many instances currently), so we don't have to keep as many BQ versions Verified locally for each instance, but wanted to double check that this won't negatively impact any prod instance? * Updating goldens for nodejs query differ (#4585) Responding to the alerts on autopush * Add semicolon to server script (#4571) * Remove scripts and docker files related to old CDC docker image. (#4590) * Stanford Upload 8/24 (#4586) Adding Sustainable Systems Lab upload to Stanford env --------- Co-authored-by: Bo Xu <[email protected]> Co-authored-by: Carolyn Au <[email protected]> * Reduced UN prod cluster resources (#4508) Metrics links: - [Requests per second](https://console.cloud.google.com/monitoring/metrics-explorer;duration=P14D?pageState=%7B%22xyChart%22:%7B%22constantLines%22:%5B%5D,%22dataSets%22:%5B%7B%22plotType%22:%22LINE%22,%22targetAxis%22:%22Y1%22,%22timeSeriesFilter%22:%7B%22aggregations%22:%5B%7B%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22groupByFields%22:%5B%22resource.label.%5C%22url_map_name%5C%22%22%5D,%22perSeriesAligner%22:%22ALIGN_RATE%22%7D%5D,%22apiSource%22:%22DEFAULT_CLOUD%22,%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22filter%22:%22metric.type%3D%5C%22loadbalancing.googleapis.com%2Fhttps%2Frequest_count%5C%22%20resource.type%3D%5C%22https_lb_rule%5C%22%20resource.label.%5C%22project_id%5C%22%3D%5C%22datcom-recon-autopush%5C%22%20resource.label.%5C%22url_map_name%5C%22%3D%5C%22k8s2-um-35faua5t-website-website-ingress-2uz3r82p%5C%22%22,%22groupByFields%22:%5B%22resource.label.%5C%22url_map_name%5C%22%22%5D,%22minAlignmentPeriod%22:%2260s%22,%22perSeriesAligner%22:%22ALIGN_RATE%22%7D%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22y1Axis%22:%7B%22label%22:%22%22,%22scale%22:%22LINEAR%22%7D%7D%7D&project=datcom-recon-autopush&e=13803378&hl=en&inv=1&invt=AbXICg&mods=-monitoring_api_staging) - [Latency](https://console.cloud.google.com/monitoring/metrics-explorer;startTime=2024-07-09T04:02:03Z;endTime=2024-07-16T04:02:03Z?pageState=%7B%22xyChart%22:%7B%22constantLines%22:%5B%5D,%22d…

remove per capita words as stop words

aa6e5fa

chejennifer requested a review from pradh July 10, 2024 15:27

chejennifer added 2 commits July 10, 2024 08:37

Merge branch 'master' into boPC

228db8a

fix test

83599b9

chejennifer commented Jul 10, 2024

View reviewed changes

chejennifer added 4 commits July 11, 2024 09:08

Merge branch 'master' into boPC

bda18af

updates

c9953e0

fix tests

9fa2cfc

Merge branch 'master' into boPC

5823552

chejennifer added 2 commits July 11, 2024 11:44

Merge branch 'master' into boPC

457e79b

Merge branch 'master' into boPC

6829b31

pradh approved these changes Jul 12, 2024

View reviewed changes

update after merge and address comments

d743c12

Merge branch 'master' into boPC

7eb6a83

chejennifer merged commit 95e7096 into datacommonsorg:master Jul 12, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove per capita stop words #4456

remove per capita stop words #4456

chejennifer commented Jul 10, 2024

chejennifer Jul 10, 2024 •

edited

Loading

pradh Jul 10, 2024

chejennifer commented Jul 11, 2024

pradh commented Jul 12, 2024

pradh left a comment

pradh Jul 12, 2024

chejennifer Jul 12, 2024

pradh Jul 12, 2024

chejennifer Jul 12, 2024

chejennifer commented Jul 12, 2024

chejennifer commented Jul 12, 2024

remove per capita stop words #4456

remove per capita stop words #4456

Conversation

chejennifer commented Jul 10, 2024

chejennifer Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

pradh Jul 10, 2024

Choose a reason for hiding this comment

chejennifer commented Jul 11, 2024

pradh commented Jul 12, 2024

pradh left a comment

Choose a reason for hiding this comment

pradh Jul 12, 2024

Choose a reason for hiding this comment

chejennifer Jul 12, 2024

Choose a reason for hiding this comment

pradh Jul 12, 2024

Choose a reason for hiding this comment

chejennifer Jul 12, 2024

Choose a reason for hiding this comment

chejennifer commented Jul 12, 2024

chejennifer commented Jul 12, 2024

chejennifer Jul 10, 2024 •

edited

Loading