-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
US NTSB Crash NL embeddings #4432
base: master
Are you sure you want to change the base?
Conversation
@@ -2370,6 +2370,10 @@ Count_Person_InLaborForce_ResidesInCollegeOrUniversityStudentHousing,Number of L | |||
Count_Person_InLaborForce_ResidesInGroupQuarters,Number of Labor Force Participants reside in Group Quarters | |||
Count_Person_InLaborForce_ResidesInNoninstitutionalizedGroupQuarters,Number of Labor Force Participants reside in Noninstitutionalized Group Quarters | |||
Count_Person_IsInternetUser_PerCapita,percentage of internet users | |||
Count_Person_InvolvedInCrash_Motorists,Number of motorists involved in crash | |||
Count_Person_InvolvedInCrash_MotorVehicleOccupant,Number of motor vehicle occupants involved in crash | |||
Count_Person_InvolvedInCrash_NonMotorists,Number of non-motorists involved in crash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now we don't add "non xxx" and "other xxx" variables to the index since they are vague when matching query (or even match the opposite meaning)
@@ -2760,6 +2764,10 @@ Count_ThunderstormWindEvent,Number of thunderstorm wind events | |||
Count_TornadoEvent,Number of tornado events | |||
Count_TropicalStormEvent,Number of tropical storm events | |||
Count_UnemploymentInsuranceClaim_StateUnemploymentInsurance,Number of state unemployment insurance claims | |||
Count_Vehicle_InvolvedInCrash_InTransport,Number of vehicles crashed in transport | |||
Count_VehicleCrashIncident_NationalHighway,Number of crash incidents occurred on the national highway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"number of vehicle crash incidents..." same as below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since its counting the vehicles in-transport, shall we add sentence as "Number of vehicles in-transport involved in crash incidents"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, i did not meant to include line 2767, that line looks good
Count_Vehicle_InvolvedInCrash_InTransport,Number of vehicles crashed in transport | ||
Count_VehicleCrashIncident_NationalHighway,Number of crash incidents occurred on the national highway | ||
Count_VehicleCrashIncident_StateHighway,Number of crash incidents on state highway | ||
Count_VehicleCrashIncident_USHighway,Number of crash incidents on US highway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
US highway is a bit vague in the meaning and maybe very hard to match actual query. Looking at these 3 stat vars, I wonder if there are aggregate stat var?
For example, the most likely query is "how many highway car crashes happened in 2020", which stat vars do we want to match here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The raw dataset we processed is at a crash granularity which we aggregated. Looking at the data the USHighway or StateHighway may or may not be part of the National Highway Network. The US highway seems to be an older system as per https://en.wikipedia.org/wiki/Numbered_highways_in_the_United_States. Shall we remove the USHighway from the index and retain both National highway and State Highway. Is the question "how many highway car crashes happened in 2020" returning both NationalHighway, StateHighway numbers a desirable behavior ?
How about we add to a new CSV file? Like say based on date: 2024_q3.csv? |
Added the NL sentences to a new csv file and modified the NL sentences. |
@@ -4,37 +4,6 @@ | |||
"categories": [ | |||
{ | |||
"blocks": [ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should not be deletion.
- If you sync the branch, make sure to rebuild the embeddings.
- can re-run
run_test.sh -g
see if this gets fixed.
Count_Person_InvolvedInCrash_Motorists,Number of motorists involved in crash | ||
Count_Person_InvolvedInCrash_MotorVehicleOccupant,Number of motor vehicle occupants involved in crash | ||
Count_Vehicle_InvolvedInCrash_InTransport,Number of vehicles in motion involved in crash | ||
Count_VehicleCrashIncident_CollisionCrash_HeadOnCollision,Number of headon vehicle collisions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
head-on
dcid,sentence | ||
Count_Person_InvolvedInCrash_Motorists,Number of motorists involved in crash | ||
Count_Person_InvolvedInCrash_MotorVehicleOccupant,Number of motor vehicle occupants involved in crash | ||
Count_Vehicle_InvolvedInCrash_InTransport,Number of vehicles in motion involved in crash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so "in transport" does not mean being transported from place A to B.
How about just say "moving vehicles"?
Count_Person_InvolvedInCrash_Motorists,Number of motorists involved in crash | ||
Count_Person_InvolvedInCrash_MotorVehicleOccupant,Number of motor vehicle occupants involved in crash | ||
Count_Vehicle_InvolvedInCrash_InTransport,Number of vehicles in motion involved in crash | ||
Count_VehicleCrashIncident_CollisionCrash_HeadOnCollision,Number of headon vehicle collisions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a side question, is there distinction between "collision" and "crash" in schema? And shall we unify / distinct them in description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per definition crash seems to be more severe with damage or fatalities, hence will stick with crash for all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hareesh-ms - sorry for the delay. Is it ready for a review? I had some of the same comments as Bo.
NL embeddings for US NTSB crash dataset.
diff : https://storage.mtls.cloud.google.com/datcom-embedding-diffs/hareeshms_base_uae_mem_2024_07_02_11_49_29.html