Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aliases are duplicated #28

Open
cristan opened this issue Sep 30, 2024 · 2 comments
Open

Aliases are duplicated #28

cristan opened this issue Sep 30, 2024 · 2 comments

Comments

@cristan
Copy link

cristan commented Sep 30, 2024

Check out https://github.com/datasets/un-locode/blob/main/data/alias.csv

Let's take the first line:

GL,Christianshaab = Qasigiannguit (Christianshaab),Christianshaab = Qasigiannguit (Christianshaab)

That's there twice (also at line 88). This applies to all the lines I've checked.

@sabas
Copy link
Contributor

sabas commented Oct 7, 2024

chatgpt suggests simply to drop duplicates :D ,
will see after other PR are discussed (@gradedSystem)

# Collect alias rows in a list
alias_list = []

for index, row in unlocode_df.iterrows():
    if pd.isna(row['Location']) or row['Location'] == '':
        if row['Change'] == '=': # alias row
            alias_list.append(row[['Country', 'Name', 'NameWoDiacritics']])

# Create alias_df from the list
alias_df = pd.DataFrame(alias_list, columns=['Country', 'Name', 'NameWoDiacritics'])
alias_df.drop_duplicates(inplace=True)

# Save the alias DataFrame to CSV
alias_df.to_csv(f"data/alias.csv", index=False)

@gradedSystem
Copy link
Member

gradedSystem commented Oct 7, 2024

@sabas what if we just do something like this (using simple regex operator):

GL,Christianshaab, Qasigiannguit

wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants