Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove per capita stop words #4456

Merged
merged 11 commits into from
Jul 12, 2024
Merged

Conversation

chejennifer
Copy link
Contributor

same change as #4415 but rebased off a clean master

svindex diff: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_09_21_55_14.html

base does not remove per capita stop words
test removes per capita stop words

@chejennifer chejennifer requested a review from pradh July 10, 2024 15:27
@@ -31,7 +31,6 @@
}
],
"denom": "Count_Person",
"startWithDenom": true,
Copy link
Contributor Author

@chejennifer chejennifer Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we lose these startWithDenom because things like "unemployment rate", "mortality rate", etc no longer get classified as PerCapita because of the change to PerCapita stop words where "rate" -> regex for rate when it's not "unemployment rate", "mortality rate", etc

I wonder if we need some special treatment here where we still classify these things as PerCapita, but don't remove those specific stop words

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This then feels like an improvement right?

The original query: "poverty vs. unemployment rate" isn't a per-capita query, so it makes sense to not start with per-capita enabled?

@chejennifer
Copy link
Contributor Author

updated changes:

  • add alternate sv description with just murder for svs about murders & non-negligent murders
  • remove "heart attack" from stroke description
  • update PerCapita classification for "rates" to use same regex as "rate"
  • only remove PerCapita stop words for toolformer mode

sv diffs for index updates: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_11_11_41_44.html

sv diffs for index updates AND removal of per capita stop words: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_11_09_43_43.html

  • base is old index and KEEP per capita stop words
  • test is new index and REMOVE per capita stop words

diffs look ok to me

@pradh
Copy link
Contributor

pradh commented Jul 12, 2024

updated changes:

  • add alternate sv description with just murder for svs about murders & non-negligent murders
  • remove "heart attack" from stroke description
  • update PerCapita classification for "rates" to use same regex as "rate"
  • only remove PerCapita stop words for toolformer mode

sv diffs for index updates: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_11_11_41_44.html

sv diffs for index updates AND removal of per capita stop words: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_11_09_43_43.html

  • base is old index and KEEP per capita stop words
  • test is new index and REMOVE per capita stop words

diffs look ok to me

Awesome! How did you generate the 2nd diffs? For future reference.

Copy link
Contributor

@pradh pradh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the edits!

# TODO: decouple words removal from detected attributes. Today, the removal
# blanket removes anything that matches, including the various attribute/
# classification triggers and contained_in place types (and their plurals).
# This may not always be the best thing to do.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to combine_stop_words function def site?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


# We do not want to strip words from events / superlatives / temporal
# We do not want to strip words from events / superlatives / temporal / percapita
# since we want those to match SVs too!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a comment so we will remember why we retain PerCapita words in stop-words in main DC, perhaps an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@chejennifer
Copy link
Contributor Author

@chejennifer
Copy link
Contributor Author

updated changes:

  • add alternate sv description with just murder for svs about murders & non-negligent murders
  • remove "heart attack" from stroke description
  • update PerCapita classification for "rates" to use same regex as "rate"
  • only remove PerCapita stop words for toolformer mode

sv diffs for index updates: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_11_11_41_44.html
sv diffs for index updates AND removal of per capita stop words: https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_11_09_43_43.html

  • base is old index and KEEP per capita stop words
  • test is new index and REMOVE per capita stop words

diffs look ok to me

Awesome! How did you generate the 2nd diffs? For future reference.

updated code in differ.py:

_STRIP_STOP_WORDS_NEW_FN = 'STRIP_STOP_WORDS_NEW'

flags.DEFINE_enum('test_query_transform', _EMPTY_FN, [
    _STRIP_STOP_WORDS_FN, _STRIP_STOP_WORDS_NO_EXCLUSION_FN, _EMPTY_FN,
    _STRIP_STOP_WORDS_NEW_FN
], 'Transform to perform on test query.')

_ALL_STOP_WORDS_NEW = shared_utils.combine_stop_words(shared_constants.HEURISTIC_TYPES_IN_VARIABLES_TOOLFORMER)

_QUERY_TRANSFORM_FUNCS: dict[str, Callable[[str], str]] = {
    _STRIP_STOP_WORDS_FN:
        lambda q: shared_utils.remove_stop_words(q, _ALL_STOP_WORDS),
    _STRIP_STOP_WORDS_NO_EXCLUSION_FN:
        lambda q: shared_utils.remove_stop_words(q, _ALL_STOP_WORDS, {}),
    _STRIP_STOP_WORDS_NEW_FN:
        lambda q: shared_utils.remove_stop_words(q, _ALL_STOP_WORDS_NEW)
}

command to run the diff:

./run.sh base_uae_mem base_uae_mem --base_query_transform=STRIP_STOP_WORDS --test_query_transform=STRIP_STOP_WORDS_NEW --queryset=tools/nl/svindex_differ/queryset_vars_withstopwords.csv

@chejennifer chejennifer merged commit 95e7096 into datacommonsorg:master Jul 12, 2024
9 checks passed
jm-rivera added a commit to ONEcampaign/one-datacommons-website that referenced this pull request Oct 18, 2024
* Added bard staging environment (#4359)

- Added bard staging environment
- Fixed bug in `gke/get_storage_permission.sh` where the project
environment name was being used for the robot service account instead of
the project_id. Before:
`website-robot@bard_staging.iam.gserviceaccount.com` (inconsistent with
the account created by `create_robot_account.sh` script . After :
`[email protected]`

* [rag eval] read eval type from sheet (#4363)

* Show per capita when the query has "per person" (#4367)

* [rag eval] add table pane component (#4370)

https://github.com/datacommonsorg/website/assets/69875368/cf87a224-58c2-4a6c-ad2a-92ac1aa07e0a

* Add block evaluation claim counter (#4366)

* [rag eval] add concept of feedback stage (#4372)

- Add a new FeedbackStage enum to be used when deciding what to display
in the query section and feedback sections
- Changes to make RIG eval tool work with this new concept
- only change made for RAG was displaying the rag calls in the query
section when feedback stage is the CALLS stage. Actual navigation for
RAG will come later

* DC website compare tool navigation support (#4365)

With this, when clicking in one side of the iframe, the other iframe can
be updated correspondingly based on the url path.

* [eval] fix previous button bug (#4373)

* Updated mixer for 6/20/2024 release (#4375)

* Added updated sdg table and ilo tables (#4376)

* Fix errors in curated stat var descriptions (#4379)

sv diff report:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_06_20_22_01_06.html

* [rag eval] Add navigation logic and get calls feedback working (#4377)

https://github.com/datacommonsorg/website/assets/69875368/29977c28-21ad-4ebf-8bdd-7f6f8300b3ca

TODO: add loading spinner & maybe look into why the feedback section is
slow to update

* Add cacheSVFormula to autopush.yaml (#4381)

Some examples testing locally of calculations: 

Timeline: 
<img width="1512" alt="Screenshot 2024-06-21 at 9 23 44 AM"
src="https://github.com/datacommonsorg/website/assets/77713883/02b0d6b8-bd7e-4d8c-bc44-e10d66d0803b">

Map: 
<img width="1512" alt="Screenshot 2024-06-21 at 9 23 51 AM"
src="https://github.com/datacommonsorg/website/assets/77713883/3b27feeb-cacb-40c1-abc9-9c6875dae133">

* [rag eval] add rag ans section (#4380)

https://github.com/datacommonsorg/website/assets/69875368/7334d32d-4042-41a8-aca0-c7a8ef75e62f

* Avoid place fallback for mode=toolformer_* (#4374)

Data Gemma use-cases expect exact place's results, and when we don't
have data, don't try to do place fallback.

**Before**

![image](https://github.com/datacommonsorg/website/assets/4375037/88d17463-fe5a-4f5e-ba2f-f713d59e1009)

**After**

![image](https://github.com/datacommonsorg/website/assets/4375037/079f2217-d5d1-4b97-9dfb-60b547b607cd)

* fixed caching issue for POST requests to /api/observations/series (#4382)

Before: POST request for a stat var returns the the wrong result due to
caching
Requested: Count_Person
<img width="1428" alt="Screenshot 2024-06-21 at 3 34 31 PM"
src="https://github.com/datacommonsorg/website/assets/13766/67eb91c1-4ac7-4c7e-abfc-9c81b919678c">
Got: SDG variable
<img width="1458" alt="Screenshot 2024-06-21 at 3 34 40 PM"
src="https://github.com/datacommonsorg/website/assets/13766/1c2ea88c-de9a-45eb-96b3-ed4d11f61ff3">

After cache key correction:
<img width="1240" alt="Screenshot 2024-06-21 at 3 38 24 PM"
src="https://github.com/datacommonsorg/website/assets/13766/c5d9f1d4-6ad6-4eac-af39-4bbc903a7e57">

* updated apigee instructions (#4368)

* Updated un staging tables (#4378)

* Added redis cache layer for staging.datacommons.org (#4383)

Note: after merging, update oncall docs with a step to clear redis after
deploying to staging

* Major re-structure of build-embeddings tool; Use multiple sv/topic with the same description now (#4371)

Major changes:

- Re-structure build embeddings script to use a few modularized
functions from tools/nl/embeddings/utils.py
- Make base and custom embeddings build scripts *almost* identical now.
Further clean up is needed to merge them.
- Use multiple sv/topic with the same description now. The diff can be
inspected from _preindex.csv files
- Updated all the embedding indexes based on the changes
- svdiff report for base:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_06_20_00_09_46.html

Minor changes:

- Rename/Removed several non critical embedding indexes
- Updated documentation and scripts

TODO:
- clean up all custom_dc path assumptions in nl_server,
web_server(admin) and build_embedding_tool.
- deprecate custom dc embedding build script following above.
- add back tests for tools/nl/embeddings/utils.py.

* [rag eval] some small bug fixes (#4385)

- fix some table rendering issues
- add stage indicator
- update overall options for RAG to be: irrelevant, somewhat relevant,
relevant
- update dc response options for RAG to be: does not match question,
matches question
- handle when there are no dc calls

* Fix html diff tool bug and https->http issue (#4384)

* [rag eval] add claims counters (#4388)

update claim counter page to have 2 sections w/ 3 counters each:
1. statistical claims
    - total claims
    - false claims
    - unique tables
3. inferred claims
    - total claims
    - false claims
    - unsubstantiated claims

![Screenshot 2024-06-24 at 3 07
33 PM](https://github.com/datacommonsorg/website/assets/69875368/de4bbced-8890-4f26-8e38-51e2b4942c32)

* Keep only one build_embedding tool and move all custom DC logic/path logic to admin/html.py (#4389)

Major changes:

- Put most custom DC specific constants, path, logic into admin/html.py.
As this is the "offline import manager", it makes sense to let it decide
on all these params and pass to scripts in different stages.
- Make build_embeddings.py a general tool that can be purely self
contained.
- Support catalog_dict and catalog_paths when reading catalog.
- In custom DC docker, put catalog.yaml as the same path as the GKE
deployment, to unify the paths.

* custom dc script to load data and generate embeddings (#4268)

Copy of import/simple/stats/run_stats.sh to
- process data
- generate NL embedings

* Fork build trigger config and deployment script for custom DC autopush (#4386)

- cloudbuild.deploy.yaml -> cloudbuild.push.yaml, which adds a step to
update repo version and removes a step to build and push a custom DC
image
- deploy shell script -> build-and-deploy shell script, which pulls
latest submodules and builds/pushes a new image before deploying.

Once the new trigger flow is fully set up, the old yaml and script can
be deleted.

* [rag eval] add overall questions feedback section (#4390)

add another eval stage for users to evaluate the questions asked by the
LLM

<img width="1779" alt="Screenshot 2024-06-24 at 5 33 32 PM"
src="https://github.com/datacommonsorg/website/assets/69875368/77ed8486-4d0a-4b43-8118-78236ca008b1">

* Fixed issue when .removeAttribute called on a web component's convertArrayAttribute property (#4391)

Fixes this error in UN staging site:

<img width="1216" alt="Screenshot 2024-06-25 at 12 54 32 AM"
src="https://github.com/datacommonsorg/website/assets/13766/26e7af44-4b2f-4dc9-a384-e6d0cb404f30">

Impacts lit elements with properties with decorator:
```
@property({ type: Array<string>, converter: convertArrayAttribute })
```

When React removes a property from an element, it calls
`<domElement>.removeAttribute`. This then calls
`convertArrayAttribute(undefined)`, which was causing an exception.

TODO: add webdriver unit test for this case

* Fix diff page path input issue and cross origin error; No need to load maps api key in explore page (#4392)

* Fix nl server build issue (#4394)

* Added 'part' selectors to 'show metadata' link (#4393)

Gives web component users ability to hide and style the 'show metadata'
link

* Updated clearcache tool logging (#4387)

Before:
```
./tools/clearcache/clear_prod.sh
WARNING: Your active project does not match the quota project in your local Application Default Credentials file. This might result in unexpected quota issues.

To update your Application Default Credentials quota project, use the `gcloud auth application-default set-quota-project` command.
Updated property [core/project].
Fetching cluster endpoint and auth data.
kubeconfig entry generated for website-us-central1.
Defaulted container "website" out of: website, nl
True
WARNING: Your active project does not match the quota project in your local Application Default Credentials file. This might result in unexpected quota issues.

To update your Application Default Credentials quota project, use the `gcloud auth application-default set-quota-project` command.
Updated property [core/project].
Fetching cluster endpoint and auth data.
kubeconfig entry generated for website-us-west1.
Defaulted container "website" out of: website, nl
True
```

After:
```
./tools/clearcache/clear_prod.sh
WARNING: Your active project does not match the quota project in your local Application Default Credentials file. This might result in unexpected quota issues.

To update your Application Default Credentials quota project, use the `gcloud auth application-default set-quota-project` command.
Updated property [core/project].
Fetching cluster endpoint and auth data.
kubeconfig entry generated for website-us-central1.
Defaulted container "website" out of: website, nl
Clearing cache for datcom-website-prod/website-us-central1/us-central1, redis host 10.167.58.139: True
WARNING: Your active project does not match the quota project in your local Application Default Credentials file. This might result in unexpected quota issues.

To update your Application Default Credentials quota project, use the `gcloud auth application-default set-quota-project` command.
Updated property [core/project].
Fetching cluster endpoint and auth data.
kubeconfig entry generated for website-us-west1.
Defaulted container "website" out of: website, nl
Clearing cache for datcom-website-prod/website-us-west1/us-west1, redis host 10.158.101.59: True
```

* remove redirect for missing trailing slashes on explore links (#4396)

the redirect from /explore landing pages caused a downgrade from https
to http.
also fix a few http:// links to docsite.

follow up to #4392

* [Build Embeddings] Save index config, md5 and add test to check integrity (#4348)

- Removed old preindex/duplicate files and keep all input files simply
under 'input' directory.
- Re-built all indexes to generate the index_config.yaml and md5sum.txt
in the embeddings output gcs folder.

* Use commit hashes from all of website, mixer, and import to label custom DC images (#4399)

* [eval] Fix empty feedback recorded, table parsing (#4404)

- fix bug with empty call response being saved to firestore when "apply
to next" is selected
- fix table parsing issue when header contains "-"

* [rag eval] update feedback stage order (#4405)

move overall question feedback to before individual question feedback

* Support query transforms with SV differ (#4400)

Diff for medium_ft with vs. without stop-words removal:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/shanth_medium_ft_2024_06_25_21_00_30.html

The canonical query in this case has stop-words.

* [rag eval] fix crashing question, add loading, fix answer showing instead of questions (#4407)

- fix crashing when a query doesn't have an answer
- add loading when fetching data for the feedback section
- update reading either answer or questions in the LHS

* Fix submodule treatment in custom DC build script (#4403)

- Actually initialize submodules so that checkout + pull has an effect.
- Make a temporary commit so that HEAD:import and HEAD:mixer are
updated.

* [rag eval] update wording in calls feedback, add another question to overall feedback (#4406)

- update wording in calls feedback
<img width="1766" alt="Screenshot 2024-06-26 at 3 17 43 PM"
src="https://github.com/datacommonsorg/website/assets/69875368/3cb62970-e6c5-4a78-8a56-1bf014920028">

- add another question to overall feedback
<img width="1755" alt="Screenshot 2024-06-26 at 3 17 23 PM"
src="https://github.com/datacommonsorg/website/assets/69875368/b392481e-2d26-410d-aff4-2f0aae411b6e">

* Set up git email before creating temp commit (#4408)

Without this extra line, `git commit` fails with an error about
requiring identity.

* Add a direct link to the required version of protoc (#4412)

When I was onboarding, I found the versioning for protoc confusing, and
it also took me a while to figure out where I could download the
required version. Adding a direct link for convenience for future
readers/contributors.

* update goldens (#4411)

* Updated unsdg staging tables (#4395)

* [nl] prune topic vars for rig (#4409)

Allow topics for rig but prune the svs so that we only keep the ones
that were also separately detected on their own.

Only keep pruned topics if there were no svs returned (which, I'm
slightly conflicted if we need this piece)

* Fix a bunch of stat var descriptions based on gemma eval results (#4413)

- Add "carbon footprint" to greenhouse  gas emissions description
- Remove `(Non-Biogenic)` in GHG variables to be consistent
- use "without" for No_HealthInsurance var descriptions.

svdiff:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_06_27_15_06_07.html

* Support Incremental embeddings build (#4401)

Load saved embeddings vector from catalog.yaml and only compute text
that are not seen.

Also:
- Use a SentenceObject to hold dcid, text and vector
- Added a bunch of tests

* Make a shared nl requirements.txt that can be imported from nl server and nl tools (#4410)

* Update about.html (#4369)

Change "I" to "we"

* [nodejs] update csv table header for toolformer_rag mode, add website commit in debug info (#4414)

- update csv table headers for mode=toolformer_rag:
- line chart: "label" -> "date" e.g.,
https://paste.googleplex.com/4545373437952000
- bar chart: "label" -> "variable" e.g.,
https://paste.googleplex.com/6171709243916288
  - map_chart: "label" -> "place"

- read website commit from environment and return it as part of
debug_info

* [eval] handle empty text processing (#4418)

* Update RAG eval tool counter layout for easier interaction (#4420)

![image](https://github.com/datacommonsorg/website/assets/5951856/b69aab9d-7c6e-4660-90de-c68f4033b620)

* [rag eval] make less sheets reads, fix more table parsing (#4422)

- make one call to read mulitple rows instead of making one call per row
- fix table parsing when header contains "," or "₂"

* [nl] don't use default place for toolformer modes (#4419)

* Update SDG topics (#4421)

* Update NL embeddings eval playground (#4424)

Remove golden stat vars and eval scores since these are not practical
and reliable.

Updated this into a playground that can upload sv descriptions and
queries on the fly and check stat var matches. The matches can highlight
stat vars that have override descriptions.


![image](https://github.com/datacommonsorg/website/assets/5951856/4c31af16-cfb0-4e11-ba4a-7e4dcee74dde)

* Add e2e instructions of adding stat var descriptions (#4425)

* Remove "total" from "total population" and some other stat var descriptions. (#4423)

Consolidated and cleaned up some other stat vars.
Also updated svdiff tool README.


https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_06_28_20_09_29.html

* minor fixes to nl scripts (#4426)

* Update diffs after staging push (#4427)

* Update drive to work stat var descriptions (#4431)

Also fixed some doc issues.

svdiff report:


https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_07_01_17_51_44.html

* [Stop Words] Introduce stop words exception list and add "how many", "number of" to this list  (#4416)

The larger embeddings model can understand semantics better (instead of
token match). So we should try to preserve stop words. Including stop
words would be a big change and we should do this in a controlled way.

This introduces an exception list that are related to stop words and
should be kept.

With this, the query sentence will be "number of asian" instead of
"number asian". Turns out this boosts the matching and the score a lot!


https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_07_01_13_41_51.html

* [nodejs] add highlight tile handling (#4417)

allow nodejs to return a tile result for highlight tiles when the mode
is set to something that is not bard

e.g., what is the percentage of urban population in Tamil Nadu?

no mode set: https://paste.googleplex.com/5341628766355456
mode=toolformer_rig: https://paste.googleplex.com/5803309044858880

* [eval tool] fixes for feedback pane formatting, wrong answer showing up, some easy nits (#4428)

* Add ILO topics (#4430)

* Remove old versions of forked deploy script/config (#4429)

- deploy_custom_dc_autopush.sh is superseded by
build_and_deploy_custom_dc_autopush.sh
- deploy.yaml is superseded by push.yaml

* Support sv description yaml file; Convert base DC sdg topics csv into yaml (#4435)

* [embedding playground] Support index selection; Improve performance (#4437)

![image](https://github.com/datacommonsorg/website/assets/5951856/6a49246a-225b-44ab-ba6f-f3b5c1149638)


- Make sure each model box renders only once. The encode and match time
is substantial for larger model.
- Add checkbox for index selection
- Keep only one "apply" button

* skip a broken webdriver test (#4439)

skipping broken webdriver test because looks like it is caused by
somewhere upstream in mixer/data

* [eval] set up components to be used by sxs evals (#4436)

update components that will be shared with sxs evals to take props
instead of using context

* Test custom DC autopush homepage load after deploying (#4346)

* Increased fulfill stat var results limit for undata nl indexes (#4441)

- Increased stat vars returned from 50 to 200
- Updated check to allow for increased limit on the SDG index

* Updated ILO embeddings (#4438)

Updated iLO embedings to reflect stat var grouping changes from #4430.
Test in explore page with URL params `dc=undata_dev` or `dc=undata_ilo`.

Example:
http://localhost:8080/explore#q=unemployment+in+the+usa&client=ui_query&dc=undata_dev

<img width="1397" alt="Screenshot 2024-07-03 at 4 13 38 AM"
src="https://github.com/datacommonsorg/website/assets/13766/32228e74-720b-4cc0-8bed-04e3674032f7">

* Updated un staging tables (#4442)

- Removed sdg table since it's now in covered in
`schema_2024_07_03_12_17_40` and `country_2024_07_03_10_18_47`
- Updated ilo tables

* Update 'prevalence' to 'proportion of' in stat var description (#4434)

"prevalence" is a very strong word match and results in non-optimal
matching and ranking.

"proportion of" seems to be more general wording and works good for a
range of query key words (see diff below).

I have also tried "percentage" which don't really match prevalence query
very well (maybe prevalence could refer to millionth and not really used
in common with percentage?)

sv diff report:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_07_03_18_26_13.html

The two examples below show increased score of matching when asking in
different ways.


![image](https://github.com/datacommonsorg/website/assets/5951856/3b181ca4-4caf-4826-940c-cacde9d10f32)

![image](https://github.com/datacommonsorg/website/assets/5951856/c9f67dcd-a948-4730-b77f-ea2887611ade)

* Add back dc/topic/Asthma now the child sv order is fixed (#4445)

svdiff report:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/boxu_base_uae_mem_2024_07_04_11_32_19.html

* [gemma sxs] create new app (#4440)

- add plumbing for new sxs eval app at:
/eval/retrieval_generation_sxs?sheetIdLeft=&sheetIdRight=&queryId=
- re-use eval components to display the LHS and RHS

<img width="1765" alt="Screenshot 2024-07-03 at 2 10 32 PM"
src="https://github.com/datacommonsorg/website/assets/69875368/90369c8f-670b-450c-8009-959f6b52d58d">

* Allow duplicate dcid among input csv and yaml files (#4443)

This is to allow alternative files.

* enabled mixer rest api for unsdg project (#4446)

This enables /v1 and /v2 endpoints like:
https://unsdg.datacommons.org/v2/node?nodes=dc/g/UN&property=-%3E* which
are used by the UN data site

* [Eval UI] Trigger sign in on load (#4447)

Also show some explanatory text while things are loading.

* [toolformer] replace "residents" with "people" (#4449)

example: https://screenshot.googleplex.com/C3QjzRnnzD9euVc

sv index differ:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_08_12_11_23.html
where test replaces residents with people

* Fix build embeddings tool for local paths. (#4450)

* [nodejs] add highlight field to highlight tile response (#4448)

e.g., what is the percentage of urban population in Tamil Nadu?
https://paste.googleplex.com/4946781240819712

* Fixed css bug that was hiding the stanford custom dc navbar (#4451)

## Before:
![Screenshot 2024-07-08 at 3 30
32 PM](https://github.com/datacommonsorg/website/assets/13766/c52d87de-ca2a-4230-b55b-20020eb17520)

## After:
![Screenshot 2024-07-08 at 3 31
43 PM](https://github.com/datacommonsorg/website/assets/13766/591f906c-d935-4d6c-b87a-5c55f357e8b5)

* Set env variables to fix sentence transformer issues in cloud run. (#4460)

* Added datagemma gke configuration (#4454)

* remove negation from stop words (#4455)

svindex differ:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_09_20_03_18.html

base has all the stop words
test has the negation stop words removed

* Make base/sheets_svs.csv a proper csv (#4464)

So it can be viewed as a [table on
github](https://github.com/datacommonsorg/website/assets/4375037/6f355c58-89da-4708-99dc-eb1061ef771c)
and loaded into BQ, etc.

* [SxS Eval] Update how left/right are determined and add baseline eval type  (#4458)

Also:
- Make all inputs to left/right picker available via context
- Clean up some unused stuff
- Factor out duplicated template section into its own component

* Skip experimental import group for non-autopush (#4466)

* sv index updates (#4457)

- remove temperature, min/max temperature, houseless X
scheduledCaste/scheduledTribe variables
- add descriptions: include "children" for
Count_Person_18OrLessYears_NoHealthInsurance, include "native american"
for Count_Person_AmericanIndianOrAlaskaNativeAlone
- replace Annual_Emissions_CarbonDioxide_Biogenic with
Annual_Emissions_CarbonDioxide_NonBiogenic

svindex diffs:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_10_10_15_51.html

* Move custom DC tests out of server/webdriver/tests (#4465)

Otherwise they get run as part of ./run_test.sh -w

* remove per capita stop words (#4456)

same change as https://github.com/datacommonsorg/website/pull/4415 but
rebased off a clean master

svindex diff:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_09_21_55_14.html

base does not remove per capita stop words
test removes per capita stop words

* Add support for testing bad-words file (#4462)

Instructions in cl/651193865

* NL: Drop custom cleanup logic for SV titles (#4471)

Fork of an old version of
https://github.com/datacommonsorg/website/pull/4461/files.

* [Eval SxS] Feedback footer with navigation; Firestore writes (#4468)

- Add a feedback component and style it as a footer.
- Add previous/next buttons that navigate between queries.
- Write to Firestore when navigating previous/next. Just write static values for now.

Also fix randomization of left/right.

* [toolformer] replace "global" with "world" in queries in toolformer mode (#4475)

ran goldens without the `if (params.is_toolformer_mode(dargs.mode))` &
there were no diffs

* [Eval SxS] Actually get and set ratings; Improve layout (#4472)

- Add radio buttons for preference and a text area for reason
- Fetch and show existing ratings, correcting left/right orientation if
necessary
- Save rating when navigating if it has been updated
- Make panes scroll independently
- Display query ID and question only once
- Make footer take up only as much room as it should


[Screencast](https://github.com/user-attachments/assets/1f7cb506-eacf-408a-8219-84b9be59165c)

* add women in parliament to nl index (#4476)

no diffs:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_12_16_30_34.html

* Added GKE setup script to grant Vertex AI User role to robot/service account (#4479)

* Avoid a buggy regex for comparison heuristic (#4477)

This was added in very very early days in
https://github.com/datacommonsorg/website/pull/2155, for comparison
between two variables.

In practice, "compar.." or "vs" are the trigger words we use. There are
too many potential unknown matches with "...er$". Even comparative words
like "better" or "greater" need not necessarily mean comparison across
vars ("which california county has a greater chance of ...?"). So lets
just drop it!

There are no diffs in integration-tests. TODO: I hope to check the
screenshot diffs after submitting this PR...


![image](https://github.com/user-attachments/assets/f46a6e8c-74fb-46b9-a432-7e73ebd54d75)

* update goldens after topic cache update (#4478)

commute time topic updated with mean commute time replacing total
commute time

* Add CDC SV explorer sanity test. (#4481)

* [gemma eval tools] filter out empty tables in table pane (#4484)

* [toolformer] promote svs from a topic for rag (#4482)

Promote svs that are in a topic but show up immediately after that topic
to before the topic

* Update submods. (#4489)

* Load custom dc embeddings at server startup. (#4473)

* Allow subset of pattern exclusions from heuristics (#4487)

Currently, words like "youngest" etc are added to stop-words. Add a set
of exception patterns per heuristic type. Start off with
toolformer-only.

* [toolformer] exclude correlation fulfiller (#4486)

* Add Custom DC NL sanity test. (#4485)

* Show DC stats in tooltips for RIG (#4483)

For RIG answers:
- Don't show inline DC stats. Instead show DC stat with label in a
tooltip when an LLM stat is hovered over.
- When an LLM stat in SxS UI doesn't have an associated DC stat, don't
highlight it or show a tooltip. (For regular eval UI, still highlight
but don't add a tooltip.)
 
Other changes:
 - Change "Why?" free-text form field label to "Comments (optional)"
- Show "Loading answer..." when fetching answers, so answers are never
out of sync with the feedback form.
 - Fetch all call data when initially loading each spreadsheet
- Don't limit some sheet calls by column (fetch more data including
possibly extraneous data in exchange for making fewer fetches)
 - Minor naming updates for readability


[Screencast](https://github.com/user-attachments/assets/a7ca0359-ee0c-40ff-99eb-b332cae61bff)

* avoid stripping period for "St." (#4490)

There are place names with St. like St. Landry Parish and stripping the
period from that will prevent the place from being detected.

* Add caching to place pages (#4480)

Adds caching to the place pages in preparation for SEO experimentation.

Because the place pages don't get updated often, I did not set a
timeout. The cache will refresh whenever we clear the cache as part of a
website release.

* [timeline] fix ratio bug (#4493)

there was a bug in picking the denominator observation where when
comparing if next denom is better, it was comparing with the earliest
denominator & not the previous denominator. This caused weird behavior
for per capita timelines

autopush: https://screenshot.googleplex.com/6gh9Nyjr7kg638C
local: https://screenshot.googleplex.com/BHUsQPMih5X27Hy

* Added API handler for fetching observation dates given a list of entities and variables (#4474)

Added API handler for fetching observation dates given a list of
entities and variables.

New website API endpoint: `/api/observation-dates/entities`. This
endpoint is similar to the `/api/observation-dates` endpoint, but it
takes a list of entities instead of a parentEntity and childType.

This endpoint will support updating the timeline slider web component to
accept a list of entities rather than parent/child relationships.

Example usage:
  ```
GET
/api/observation-dates/entities?entities=country/USA&entities=country/CAN&variables=Count_Person&variables=Count_Household
  Response:
  {
      "datesByVariable": [
          {
              "variable": "Count_Person",
              "observationDates": [
                  {
                      "date": "1900",
                      "entityCount": [
                          {
                              "facet": "2176550201",
                              "count": 1
                          }
                      ]
                  },
                  {
                      "date": "1901",
                      "entityCount": [
                          {
                              "facet": "2176550201",
                              "count": 1
                          }
                      ]
                  }
              ]
          },
          {
              "variable": "Count_Household",
              "observationDates": [
                  {
                      "date": "1900",
                      "entityCount": [
                          {
                              "facet": "2176550202",
                              "count": 1
                          }
                      ]
                  },
                  {
                      "date": "1901",
                      "entityCount": [
                          {
                              "facet": "2176550202",
                              "count": 1
                          }
                      ]
                  }
              ]
          }
      ],
      "facets": {
          "2176550201": {"importName": "facet1"},
          "2176550202": {"importName": "facet2"}
      }
  }
  ```

* Add frozendict to nl_server (#4494)

https://github.com/datacommonsorg/website/pull/4487 introduced
frozendict import in shared lib. The NL server requirements was missing
it ([Web server requirements
does](https://github.com/datacommonsorg/website/blob/e868bcc8a374fb72106d52cf7fd55a9d8a58f7bb/server/requirements.txt#L9)).

* Update README.md

* [Eval UI] Handle invalid query IDs; clean up after async useEffects (#4495)

* [Eval Sxs] Add an eval list; jump to first incomplete eval on load (#4491)

Also keep answers in loading state until answer text is processed.


[Demo](https://github.com/user-attachments/assets/ac31ea02-8646-4e32-8a59-12cc85178627)

* update nl index for electricity generation & age ranges (#4492)

- update descriptions with "electricity generated" -> "electricity
generation" to be consistent with descriptions about "energy generation"
- add child, adult, and senior age ranges to the index 

sv index diff:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_17_14_22_29.html

* update compose for sentence transformer

* [evals] use value instead of stringValue when getting cell content (#4498)

cell values of type number wasn't fetched when using
getCell().stringValue

* [SxS] only show tables that are used in answer (#4497)

for RAG answers, the table pane should only show tables actually used in
the answer

autopush: https://screenshot.googleplex.com/J5CqEUu87ptoKCd
local: https://screenshot.googleplex.com/8ygR9UTWk7HrCu6

* [CDC Autopush] Reorganize build, deploy, and test scripts (#4499)

For ease of debugging workflow failures, I want to try having each
script be its own build step. The new build config YAML will be:

```
steps:
  - id: clone-website-repo
    name: gcr.io/cloud-builders/git
    entrypoint: bash
    args:
      - -c
      - |
        set -e
        mkdir src
        git clone https://github.com/datacommonsorg/website.git src
        cd src
    waitFor: ['-']

  - id: build-and-tag-latest
    name: gcr.io/cloud-builders/docker
    entrypoint: bash
    args:
      - -c
      - |
        set -e
        ./scripts/build_custom_dc_and_tag_latest.sh
    waitFor: ['clone-website-repo']

  - id: deploy-latest-to-autopush
    name: gcr.io/cloud-builders/gcloud
    entrypoint: bash
    args:
      - -c
      - |
        set -e
        ./scripts/deploy_custom_dc_latest_to_autopush.sh
    waitFor: ['build-and-tag-latest']

  - id: run-tests
    name: python:3.11.3
    entrypoint: bash
    args:
      - -c
      - |
        ./run_test.sh --setup_python
        ./scripts/run_cdc_tests.sh
    waitFor: ['deploy-latest-to-autopush']

options:
  machineType: 'E2_HIGHCPU_32'

```

* [evals] fix format of values read from sheet (#4501)

use formattedValue instead of value to get the actual value that a user
sees in the sheet

<img width="1757" alt="Screenshot 2024-07-18 at 1 08 36 PM"
src="https://github.com/user-attachments/assets/81b83cc2-bfc6-45d8-a836-65a3b8959852">

* [CDC Autopush] Make new scripts executable (#4502)

I always forget this!

* [tiles] fix fraction digits in highlight tile (#4503)

bug was that all the highlight tile values were showing 1 digit after
the decimal. This was because:

- we always defaulted numFractionDigits to 1
- before PR https://github.com/datacommonsorg/website/pull/4417 we
actually weren't setting numFractionDigits field correctly in the
highlight tile & this change here caused us to set numFractionDigits
correctly, but for default cases that meant setting it to 1:
https://github.com/datacommonsorg/website/pull/4417/files#diff-a7d41f3b232c03e93d2230c267b7465828c783406c2be4a4998e0b9e68df1d31R212

To fix this,
- default numFractionDigits to undefined. This is safe in the other
tiles because highlight tile is the only tile that uses
numFractionDigits

screenshot diff that found this bug:
https://screenshot.googleplex.com/4yiW2b8d6WGPZ9f
localhost: https://screenshot.googleplex.com/9cYvM2Xdx4ydo6C

* Fixed csv download and show metadata link in local and custom DC to use local API path (#4463)

Before:
<img width="1407" alt="Screenshot 2024-07-10 at 5 17 34 PM"
src="https://github.com/datacommonsorg/website/assets/13766/254eadbf-8ab5-463f-83e1-d01844be8eab">

<img width="1405" alt="Screenshot 2024-07-10 at 5 17 24 PM"
src="https://github.com/datacommonsorg/website/assets/13766/7a596423-5edd-4bee-aa70-771d6c02d0c7">

After:
<img width="1413" alt="Screenshot 2024-07-10 at 5 16 27 PM"
src="https://github.com/datacommonsorg/website/assets/13766/b5b5943f-5a94-4dfc-9498-0676b36970a8">
<img width="1414" alt="Screenshot 2024-07-10 at 5 16 38 PM"
src="https://github.com/datacommonsorg/website/assets/13766/dfa93d41-6b00-4782-97c2-ce815f7b883f">

* [nl] Fix PC stop words regex (#4505)

fix the regex used for detecting rate/rates

- dropped the word boundary because both classification detection and
stop word removal (both cases where the stop words are used) add their
own word boundary & having the word boundary in the stop word regex
causes problems
- use negative lookbehind instead of negative lookahead because we
should be looking for all cases of rate that are not preceded by special
words like literacy or mortality

* Add ESLint warning for missing return type. (#4496)

Codacy flags this but I would like to find out about it before waiting
for checks to run.

* [Eval UI] Reformat RIG tooltip; Close list modal on esc or external click (#4507)

[Demo](https://github.com/user-attachments/assets/0623d7af-3415-427e-bcc9-02410d15fbed)

* update goldens after mixer staging push (#4506)

* update submodules for release (#4510)

* [CDC Autopush] Fix submod pull no-op case (#4511)

Allow an empty commit when pulling in latest submods in case they're
already up-to-date. This should fix current autopush build failures.

* Consolidate CDC env file. (#4509)

* Make OUTPUT_DIR required. (#4512)

* Headless drivers are now blocked (#4504)

* Create initial data docker image. (#4515)

* Update bad words python test (#4516)

This PR removes the assertion that in the bad words test, in `multi`
mode, lines should not contain spaces. This assertion breaks when
creating cross products that include place names like "united states of
america" or "united kingdom".

This PR also updates the "headless drivers" NL test case to look for the
correct detection type.

This should unblock the python and nl test failures seen in unrelated,
already approved PRs.

* Update nodejs query test goldens (#4514)

Updates the goldens for our nodejs_query_differ test. Looks like there's
been some data updates since we last updated these goldens.

* [SEO Experiment] Add plumbing to read experiment pages from GCS (#4500)

To run our SEO experiments, this PR:

* Adds logic to the place pages for the server to read a jinja template
from GCS.
* Starts a folder in server config to hold experiment template files.
* Checks in a sample template for Egypt with matching CSS.
* Adds a script to sync GCS bucket contents with local config folder.

Note: Templates for the other places in the experiment group will come
in a follow up PR. Until those pages are completed, reading from GCS is
only enabled in autopush, dev, and locally (i.e. not staging and not
prod).

![Screenshot 2024-07-18 at 12 32
54 PM](https://github.com/user-attachments/assets/8a637f23-6665-4119-b2a4-bb1a2c8f8fb3)

* Remove *_env.list files. (#4517)

* Add absolute path comments in env.list. (#4518)

* Update .dockerignore to fix docker build (#4519)

Updates .dockerignore to exclude the sanity.py file (in **/tests) to fix
a cron-testing docker build failure.

* Added support for HIGHEST_COVERAGE to /api/observation/point endpoints when specifying a list of entities and variables (#4513)

Adds support for `date=HIGHEST_COVERAGE` to the
`/api/observations/point` and `/api/observations/point/all` endpoints.
Previously, `date=HIGHEST_COVERAGE` was only supported on
`/api/observations/point/within` and
`/api/observations/point/within/all`

Requesting `date=HIGHEST_COVERAGE` uses a heuristic to fetch a recent
data point (within 5 years or in the 5 most recent points, whichever is
greater). When multiple variables are specified, we apply the same
heuristic across using the total observation count across all variables
for a given date to find the highest coverage.

Example usage:

Multi-entity, single variable

`/api/observations/point?entities=country/RUS&entities=country/USA&entities=country/MEX&variables=Count_Person_InLaborForce&date=HIGHEST_COVERAGE`

Multi-entity, multi-variable:

`/api/observations/point?entities=country/RUS&entities=country/USA&entities=country/MEX&variables=Count_Person_InLaborForce&variables=sdg/SI_POV_DAY1&date=HIGHEST_COVERAGE`

* update copd description in nl index (#4521)

sv diffs (no diffs found):
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_25_16_22_31.html

* Update mixer submodule (#4522)

Updates submodules as part of website release process. Because last
release was not so long ago, only mixer needed to be updated.

* [CDC Autopush] Update scripts to accommodate data docker (#4520)

- Split out submodule update into its own script so it can be run as a
prerequisite step for both data and service docker build steps.
  - Make sure to get website hash before making a temp commit.
- Write image label made of combined commit hashes to a temp file so
other scripts can use it.
- Rename service docker build and deploy scripts to distinguish them
from data docker build and (eventually) deploy scripts.
- Make some minor edits recommended by Shellcheck and shell-format
VSCode extensions.

This PR will temporarily break custom DC autopush until I update the
compose autopush cloudbuild config in the deployment repo. Planned
update: https://paste.googleplex.com/6181631964741632

* Update NL integration test goldens (#4523)

Updates our NL integration test goldens via `./run_test.sh -g`.

Per our release docs, our NL tests rely on staging mixer, so the goldens
need to be updated after a mixer release to staging. These updated
goldens reflect the data changes from this mixer commit:
https://github.com/datacommonsorg/mixer/commit/75c03483a4e843bb77ad71c3e0a6694a2ff39dd0

* sv index updates to handle global mortality/death rates (#4524)

- remove global from topic descriptions
- mortality rate/death rate -> mortalities/deaths
- remove "mortality rate" and "death rate" from Per Capita exclusions
because the exclusion list is all forms of "x rate" that shows up as sv
descriptions in the index

sv diffs (looks ok to me):
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_26_13_31_00.html

* Fix missing overview tile on explore page (#4526)

Adds back the Google Maps API to the explore page. It's previous removal
was causing the overview tile in queries like "Tell me about [place]"
not to render properly.

Before:
<img width="1336" alt="Screenshot 2024-07-29 at 11 36 36 AM"
src="https://github.com/user-attachments/assets/6e86023d-87c4-4964-b7cf-fe9ffb3e97c0">



After:
<img width="1332" alt="Screenshot 2024-07-29 at 11 36 19 AM"
src="https://github.com/user-attachments/assets/9354ffa3-5a23-4c8d-b0f3-32f26b22f5bf">

* Update GlobalHealth topic description (#4529)

- in PR https://github.com/datacommonsorg/website/pull/4524, removed
"Global" from all topic descriptions, however this caused losses for
queries like "Health+conditions+vs+median+age+in+Alameda+County" and
"Most+common+medical+conditions+in+US" because GlobalHealth as a topic
got ranked higher than Health and HealthConditions
- here we revert that global change & instead replace the word "Global"
with "World", which is a word that can not be overindexed because it
would be removed from the query (it is a place) and we confirmed that
"global" in the query does not prefer "world" from SV description

sv diffs:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_07_29_15_21_46.html

* Add "tell me about california" to screenshot tests (#4528)

Adds https://datacommons.org/explore#q=tell%20me%20about%20california to
the screenshot tests to catch any future regressions in the overview
tile on the NL results page.

This PR is a followup to the fix made in #4526

* update bard goldens (#4530)

Update diffs after pushing most recent changes

* [Eval UI] Tweak table parsing (#4531)

- Require table divider to be at least three dashes long. This prevents
sequences of 1 or 2 dashes in values from getting parsed as dividers.
- Allow any character other than a pipe in a table header value.

* Fix map tool stat var search with many places (#4534)

Fixes b/356689537 where variable search in the new map tool when
plotting countries on earth resulted in an error "Request line is too
large".

Since the error only reproduces when running the website server with
Gunicorn, also added a mode to run_server.sh for ease of reproing.

I've moved all request params to the request body, but if we want to
only move places (maybe for the sake of analytics), I can modify this.

* Updated nodejs goldens (#4532)

Reference:
https://github.com/datacommonsorg/website/tree/master/tools/nl/nodejs_query_differ#update-goldens

* Fixed error when fetching HIGHEST_COVERAGE date for variables with no observation-dates (#4533)

Fixes error in prod when fetching HIGHEST_COVERAGE point observations
for variables with no observation-dates:
https://datacommons.org/api/observations/point/within?parentEntity=country/HKG&variables=Count_Person&childType=AdministrativeArea1&date=HIGHEST_COVERAGE

No observation-dates summary for parent=country/HKG
childType=AdministrativeArea1 , variable=CountPerson:

https://datacommons.org/api/observation-dates?parentEntity=country/HKG&childType=State&variable=Count_Person
<img width="1035" alt="Screenshot 2024-07-31 at 11 45 27 PM"
src="https://github.com/user-attachments/assets/dad2fe0c-efc7-4da5-a0a7-c93c0e15368f">


Error message when running locally:
<img width="1335" alt="Screenshot 2024-07-31 at 11 41 35 PM"
src="https://github.com/user-attachments/assets/94d85164-e181-46df-ada3-066d4f5a2fe9">

After fix:
<img width="1265" alt="Screenshot 2024-07-31 at 11 43 22 PM"
src="https://github.com/user-attachments/assets/2e929522-2ee6-4212-92cc-136e8b05df0f">

* Added datacommons-bar 'subscribe' event listener to handle date change events from datacommons-slider.  (#4525)

- Added datacommons-bar 'subscribe' event listener to handle date change
events from datacommons-slider.
- Updated error display for all components.

## Bar chart slider integration

Example usage:
```
  <datacommons-bar
    apiRoot="http://localhost:8080"
    places="geoId/06 geoId/11 geoId/12"
    date="HIGHEST_COVERAGE"
    title="Life expectancy vs Median age in California, the District of Columbia, and Florida (${date})"
    subscribe="dc-bar"
    variables="LifeExpectancy_Person Median_Age_Person"
  >
    <div slot="footer">
      <datacommons-slider
        apiRoot="http://localhost:8080"
        places="geoId/06 geoId/11 geoId/12"
        publish="dc-bar"
        variables="LifeExpectancy_Person Median_Age_Person"
      ></datacommons-slider>
    </div>
  </datacommons-bar>
```
<img width="1070" alt="Screenshot 2024-07-26 at 4 09 48 PM"
src="https://github.com/user-attachments/assets/ebd78c05-6b21-464e-b8b9-7a407d80e616">


## Error display update:


![MG3mrvpTrZLmP2V](https://github.com/user-attachments/assets/c4f9fbe4-f649-4903-8805-3a79dfa3f703)

![76Vd74zssCVhEcR](https://github.com/user-attachments/assets/e11658a2-0a5e-4a49-b270-331bda271ec7)

* updated golden tests (#4535)

* Updated mixer and import submodules. (#4536)

* Update cache key for stat var search POSTs (#4537)

Quick fix to follow up
https://github.com/datacommonsorg/website/pull/4534 and unblock the
website release. For the future it would be nice to wrap @cache.cached
in a custom decorator that we can use everywhere which takes care of
passing default params and making sure post body is in the cache key.

* Fix misplaced comment. (#4538)

* Update submods. (#4540)

* remove world from mortality topic descriptions (#4541)

"world" was being overindexed in queries like "global population". 

This is a short term fix. Longer term fix will be to have a set of
negative variables that require a higher score threshold to be returned

sv diffs:
https://storage.mtls.cloud.google.com/datcom-embedding-diffs/chejennifer_base_uae_mem_2024_08_06_16_04_57.html

* Create initial multi stage CDC services docker image. (#4542)

* Added custom data commons docker-based local development environment (#4543)

Start the docker environment by running:

```
./run_cdc_dev_docker.sh
```

Open http://localhost:8080 in the browser

Changes:
- Replaced `USE_LOCAL_MIXER` environment variable with
`WEBSITE_MIXER_API_ROOT` to specify the specific path of the local
mixer.
- Added `NL_SERVICE_ROOT_URL` optional environment variable to specify
NL service path for the website
- Updated NL app to serve on `0.0.0.0` instead of `127.0.0.1` to allow
docker to expose NL service to other containers
- Updated nl_requirements: `pandas` to `2.1.1` and `scikit-learn` to
`1.3.1` because both of these versions come with pre-built wheels for
python 3.11+ ([pandas wheels](https://www.piwheels.org/project/pandas/),
[scikit-learn wheels](https://www.piwheels.org/project/scikit-learn/)

* Update redirects.json (#4539)

Adding two redirects for datacommons.org/link/video and
datacommons.org/link/form for the two pager

Co-authored-by: Dan Noble <[email protected]>

* Update RIG tooltips to match latest mocks (#4544)

- Incorporate footnote content into tooltip content and don't show
footnotes in answer body
- Also show tooltips when no DC stat is present


https://github.com/user-attachments/assets/0a14aca2-615a-4e8a-a261-7e61b316948c

* Use async when loading Google maps APIs (#4545)

This PR adds `loading=async` to the call to load the Google Maps API, as
per [Google's
Documentation](https://developers.google.com/maps/documentation/javascript/overview#Loading_the_Maps_API).

This removes the following console warning from pages with maps calls:

![Screenshot 2024-08-08 at 9 50
54 AM](https://github.com/user-attachments/assets/4e4ed399-c51d-4f44-80cc-3bc61a54ddbb)

* Update RIG UI to show footnotes if no tooltips (#4546)

If there are no tooltips shown, go back to old UI of showing footnotes

<img width="2547" alt="Screenshot 2024-08-08 at 3 45 40 PM"
src="https://github.com/user-attachments/assets/9ee09deb-cff5-4192-ada9-b3a395831db3">

* Create smaller, faster building services docker. (#4547)

* Create scripts to auto build / deploy the new services docker. (#4548)

* Copy only static dist artifacts to further reduce docker image size. (#4549)

* Update internal to use remoteMixerDomain (#4550)

Verified on https://dc.corp.goog/version

* Build cdc services image with docker buildkit enabled. (#4551)

* Run chmod as a separate step when building docker image. (#4553)

* Fixed local cdc docker compose setup to use custom embeddings and configured a local data path (#4555)

Fixes when running `./run_cdc_dev_docker.sh` :

- Updated FLASK_ENV from local to custom to show the custom_dc homepage
- Added IS_CUSTOM_DC true to load only custom dc embeddings
- Added `OUTPUT_DIR` to env configuration, which:
  - Loads the custom topic cache in dc-website
- Configures the `ADDITIONAL_CATALOG_PATH` (`custom_catalog.yaml`) which
contains custom dc embeddings and index definitions
  - Mounts local sqlite database for the `dc-mixer`

* Added custom data commons terraform deployment scripts (#4552)

Introduces a Terraform-based deployment framework for setting up a
custom Data Commons instance on GCP. The deployment automates the
creation of necessary infrastructure, including Cloud Run services,
Redis, and MySQL instances, and provisions essential API keys and
secrets. It supports multiple instances in a single GCP account using
namespaces and Terraform workspaces.

## Features:

- Automates the deployment of a Data Commons website and data task
containers via Cloud Run.
- Optionally provisions a Redis instance for caching.
- Creates a MySQL instance with a generated password stored securely in
Secret Manager.
- Automatically enables required Google Cloud APIs.
- Supports multiple deployments in the same GCP project using
namespaces.


## How to Use/Deploy:

- Follow instructions in `deploy/terraform-custom-datacommons/README.md`

* Add query logging for Bard instance (#4556)

* Create initial client to fetch dc api keys for a given project. (#4557)

* minor readme and .gitignore updates to custom dc terraform (#4558)

Fixed some typos in the custom datacommons terraform readme, and added
the backend.tf to gitignore

* Update magic_eye to use remoteMixerDomain (#4559)

Verified at https://datcom-magiceye-dev.corp.goog/version

* Add apigee apis for importing keys. (#4560)

* Read projects from Google Sheet and write dc keys back to it. (#4562)

* Updated terraform.tfvars.sample with dc_api_key (#4564)

* Updated comments in terraform.tfvars.sample (#4565)

* Added disableEntityLink option to bar chart web component (#4563)

(UN ask)

Adds `disableEntityLink` option to bar chart web components to remove
the entity link from x-axis:


![9Yy4WfAEkGGD5PY](https://github.com/user-attachments/assets/208c3066-9d19-4c92-908f-c8ef85d007e3)

* Update README.md (#4567)

* updated custom dc terraform services container to use the new image (#4570)

* Add one more step to instructions (#4569)

Co-authored-by: Dan Noble <[email protected]>

* Add support for importing keys into apigee. (#4568)

* Set default value for FLASK_ENV. (#4576)

* updated custom dc docker dev image to pull down models in container build (#4575)

* Bump github.com/hashicorp/go-getter from 1.7.0 to 1.7.5 in /deploy/terraform-datacommons-website/test (#4397)

Bumps
[github.com/hashicorp/go-getter](https://github.com/hashicorp/go-getter)
from 1.7.0 to 1.7.5.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/hashicorp/go-getter/releases">github.com/hashicorp/go-getter's
releases</a>.</em></p>
<blockquote>
<h2>v1.7.5</h2>
<h2>What's Changed</h2>
<ul>
<li>Prevent Git Config Alteration on Git Update by <a
href="https://github.com/dduzgun-security"><code>@​dduzgun-security</code></a>
in <a
href="https://redirect.github.com/hashicorp/go-getter/pull/497">hashicorp/go-getter#497</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/dduzgun-security"><code>@​dduzgun-security</code></a>
made their first contribution in <a
href="https://redirect.github.com/hashicorp/go-getter/pull/497">hashicorp/go-getter#497</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/hashicorp/go-getter/compare/v1.7.4...v1.7.5">https://github.com/hashicorp/go-getter/compare/v1.7.4...v1.7.5</a></p>
<h2>v1.7.4</h2>
<h2>What's Changed</h2>
<ul>
<li>Escape user-provided strings in <code>git</code> commands <a
href="https://redirect.github.com/hashicorp/go-getter/pull/483">hashicorp/go-getter#483</a></li>
<li>Fixed a bug in <code>.netrc</code> handling if the file does not
exist <a
href="https://redirect.github.com/hashicorp/go-getter/pull/433">hashicorp/go-getter#433</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/hashicorp/go-getter/compare/v1.7.3...v1.7.4">https://github.com/hashicorp/go-getter/compare/v1.7.3...v1.7.4</a></p>
<h2>v1.7.3</h2>
<h2>What's Changed</h2>
<ul>
<li>SEC-090: Automated trusted workflow pinning (2023-04-21) by <a
href="https://github.com/hashicorp-tsccr"><code>@​hashicorp-tsccr</code></a>
in <a
href="https://redirect.github.com/hashicorp/go-getter/pull/432">hashicorp/go-getter#432</a></li>
<li>SEC-090: Automated trusted workflow pinning (2023-09-11) by <a
href="https://github.com/hashicorp-tsccr"><code>@​hashicorp-tsccr</code></a>
in <a
href="https://redirect.github.com/hashicorp/go-getter/pull/454">hashicorp/go-getter#454</a></li>
<li>SEC-090: Automated trusted workflow pinning (2023-09-18) by <a
href="https://github.com/hashicorp-tsccr"><code>@​hashicorp-tsccr</code></a>
in <a
href="https://redirect.github.com/hashicorp/go-getter/pull/458">hashicorp/go-getter#458</a></li>
<li>don't change GIT_SSH_COMMAND when there is no sshKeyFile by <a
href="https://github.com/jbardin"><code>@​jbardin</code></a> in <a
href="https://redirect.github.com/hashicorp/go-getter/pull/459">hashicorp/go-getter#459</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/hashicorp-tsccr"><code>@​hashicorp-tsccr</code></a>
made their first contribution in <a
href="https://redirect.github.com/hashicorp/go-getter/pull/432">hashicorp/go-getter#432</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/hashicorp/go-getter/compare/v1.7.2...v1.7.3">https://github.com/hashicorp/go-getter/compare/v1.7.2...v1.7.3</a></p>
<h2>v1.7.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Don't override <code>GIT_SSH_COMMAND</code> when not needed by <a
href="https://github.com/nl-brett-stime"><code>@​nl-brett-stime</code></a>
<a
href="https://redirect.github.com/hashicorp/go-getter/pull/300">hashicorp/go-getter#300</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/hashicorp/go-getter/compare/v1.7.1...v1.7.2">https://github.com/hashicorp/go-getter/compare/v1.7.1...v1.7.2</a></p>
<h2>v1.7.1</h2>
<p>No release notes provided.</p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/hashicorp/go-getter/commit/5a63fd9c0d5b8da8a6805e8c283f46f0dacb30b3"><code>5a63fd9</code></a>
Merge pull request <a
href="https://redirect.github.com/hashicorp/go-getter/issues/497">#497</a>
from hashicorp/fix-git-update</li>
<li><a
href="https://github.com/hashicorp/go-getter/commit/5b7ec5f039197dd363e912c8367329f8399557c6"><code>5b7ec5f</code></a>
fetch tags on update and fix tests</li>
<li><a
href="https://github.com/hashicorp/go-getter/commit/9906874a23919a81eff097d84fdb8f98525ac880"><code>9906874</code></a>
recreate git config during update to prevent config alteration</li>
<li><a
href="https://github.com/hashicorp/go-getter/commit/268c11cae8cf0d9374783e06572679796abe9ce9"><code>268c11c</code></a>
escape user provide string to git (<a
href="https://redirect.github.com/hashicorp/go-getter/issues/483">#483</a>)</li>
<li><a
href="https://github.com/hashicorp/go-getter/commit/975961f5f06346ccc282cd0d9aa16e160d26f9e3"><code>975961f</code></a>
Merge pull request <a
href="https://redirect.github.com/hashicorp/go-getter/issues/433">#433</a>
from adrian-bl/netrc-fix</li>
<li><a
href="https://github.com/hashicorp/go-getter/commit/0298a221674f629339295fa8a1e6a938e28506e0"><code>0298a22</code></a>
Merge pull request <a
href="https://redirect.github.com/hashicorp/go-getter/issues/459">#459</a>
from hashicorp/jbardin/setup-git-env</li>
<li><a
href="https://github.com/hashicorp/go-getter/commit/c70d9c915b8e823c44dd591088d15cde70d5e813"><code>c70d9c9</code></a>
don't change GIT_SSH_COMMAND if there's no keyfile</li>
<li><a
href="https://github.com/hashicorp/go-getter/commit/3d5770fe3ae127b90f54d825ef1772f0b4e86621"><code>3d5770f</code></a>
Merge pull request <a
href="https://redirect.github.com/hashicorp/go-getter/issues/458">#458</a>
from hashicorp/tsccr-auto-pinning/trusted/2023-09-18</li>
<li><a
href="https://github.com/hashicorp/go-getter/commit/06889794ed3f360b24e8ef7169294ccc59abc044"><code>0688979</code></a>
Result of tsccr-helper -log-level=info -pin-all-workflows .</li>
<li><a
href="https://github.com/hashicorp/go-getter/commit/e66f244d9206aca1ce0dee4823c833fecb2f77fc"><code>e66f244</code></a>
Merge pull request <a
href="https://redirect.github.com/hashicorp/go-getter/issues/454">#454</a>
from hashicorp/tsccr-auto-pinning/trusted/2023-09-11</li>
<li>Additional commits viewable in <a
href="https://github.com/hashicorp/go-getter/compare/v1.7.0...v1.7.5">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=github.com/hashicorp/go-getter&package-manager=go_modules&previous-version=1.7.0&new-version=1.7.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/datacommonsorg/website/network/alerts).

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Dan Noble <[email protected]>

* [custom_dc] Add env.list.sample (#4577)

* Removed 'Analyze this data in BigQuery' button from visualization tools (#4582)

## Before

![before1](https://github.com/user-attachments/assets/5757640c-4d90-44bb-825d-10c564c750cd)

![after2](https://github.com/user-attachments/assets/c57368f5-40db-4ef3-9f9f-0b1220d7924f)

## After

![after1](https://github.com/user-attachments/assets/6aa46825-8d27-4623-931b-6e79592f30ba)

![before2](https://github.com/user-attachments/assets/8256c28c-b683-4973-b839-21580e0b6b66)

* Added terraform variable google_analytics_tag_id for enabling Google Analytics (#4572)

- Added terraform variable `google_analytics_tag_id` for enabling Google
Analytics
- The Google Analytics Tag ID can now be passed as an environment
variable to Custom Data Commons. Previously, users had to set the
variable in their `server/app_env/*.py` config file

* Enable NL by default for Custom DC. (#4583)

* [docs] Update readme with setup instructions for tests. (#4578)

Make it clear that devs have to run setup before running tests.

Fixes #3923

* Disable obs browser pages for some custom DCs (#4566)

Updates 
* climate_trace
* custom 
* feedingamerica
* iitm
* unsdg

This is to remove dependency on BQ (and is currently broken for many
instances currently), so we don't have to keep as many BQ versions

Verified locally for each instance, but wanted to double check that this
won't negatively impact any prod instance?

* Updating goldens for nodejs query differ (#4585)

Responding to the alerts on autopush

* Add semicolon to server script (#4571)

* Remove scripts and docker files related to old CDC docker image. (#4590)

* Stanford Upload 8/24 (#4586)

Adding Sustainable Systems Lab upload to Stanford env

---------

Co-authored-by: Bo Xu <[email protected]>
Co-authored-by: Carolyn Au <[email protected]>

* Reduced UN prod cluster resources (#4508)

Metrics links:
- [Requests per
second](https://console.cloud.google.com/monitoring/metrics-explorer;duration=P14D?pageState=%7B%22xyChart%22:%7B%22constantLines%22:%5B%5D,%22dataSets%22:%5B%7B%22plotType%22:%22LINE%22,%22targetAxis%22:%22Y1%22,%22timeSeriesFilter%22:%7B%22aggregations%22:%5B%7B%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22groupByFields%22:%5B%22resource.label.%5C%22url_map_name%5C%22%22%5D,%22perSeriesAligner%22:%22ALIGN_RATE%22%7D%5D,%22apiSource%22:%22DEFAULT_CLOUD%22,%22crossSeriesReducer%22:%22REDUCE_SUM%22,%22filter%22:%22metric.type%3D%5C%22loadbalancing.googleapis.com%2Fhttps%2Frequest_count%5C%22%20resource.type%3D%5C%22https_lb_rule%5C%22%20resource.label.%5C%22project_id%5C%22%3D%5C%22datcom-recon-autopush%5C%22%20resource.label.%5C%22url_map_name%5C%22%3D%5C%22k8s2-um-35faua5t-website-website-ingress-2uz3r82p%5C%22%22,%22groupByFields%22:%5B%22resource.label.%5C%22url_map_name%5C%22%22%5D,%22minAlignmentPeriod%22:%2260s%22,%22perSeriesAligner%22:%22ALIGN_RATE%22%7D%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22y1Axis%22:%7B%22label%22:%22%22,%22scale%22:%22LINEAR%22%7D%7D%7D&project=datcom-recon-autopush&e=13803378&hl=en&inv=1&invt=AbXICg&mods=-monitoring_api_staging)
-
[Latency](https://console.cloud.google.com/monitoring/metrics-explorer;startTime=2024-07-09T04:02:03Z;endTime=2024-07-16T04:02:03Z?pageState=%7B%22xyChart%22:%7B%22constantLines%22:%5B%5D,%22d…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants