-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas to Postgres -> My own method is a lot faster than odo (or I made a mistake) #614
Comments
I was not able to use
|
In case you are looking for an alternative, you might want to consider d6tstack. You can process CSV and then export to csv, parquet or SQL without having to write custom functions. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in. It benchmarks well vs
|
Thanks!!! |
Hi could you help me to understand what you will add in schema ? |
@SandervandenOord thanks for sharing this technique! I found it really valuable for a project I'm working on. Some of my schema names needed to be quoted in the SQL statement (long story 🙄) so I had to change this line: schema_tablename = '{}.{}'.format(schema, table_name) to this: schema_tablename = '"{}"."{}"'.format(schema, table_name) I also wanted to use Here's a minor revision that uses cursor.copy_expert instead: def df_to_database(engine, df, schema, table_name, if_exists='replace', encoding='utf-8', index=False):
# Create Table
df[:0].to_sql(table_name, engine, if_exists=if_exists, index=index, schema=schema)
# Prepare data
output = StringIO()
df.to_csv(output, sep=',', header=False, encoding=encoding, index=index)
output.seek(0)
# Insert data
connection = engine.raw_connection()
cursor = connection.cursor()
schema_tablename = '"{}"."{}"'.format(schema, table_name)
cursor.copy_expert("COPY " + schema_tablename + " FROM STDIN WITH (FORMAT CSV)""", output)
connection.commit()
cursor.close() Just sharing in case other people run into the same. Thank you again for the great technique! |
@wolfc86 You forgot to add def df_to_database(engine, df, schema, table_name, if_exists='replace', encoding='utf-8', index=False):
[...] |
Ah, right you are @mairanteodoro! I updated the snippet in my comment. Thanks! |
@SandervandenOord Thank you for this amazing technique! I had to push over 4 GB of data(mostly text fields - conversations) to Postgres and this made my task much simpler. It took 30 minutes (actual wall time - 30min 35s) to run. I modified @wolfc86 code to add chunksize option in df.to_csv(...). Then I split the dataframe into chunks, created a header outside the method, and pushed the data into the db in small batches(instead of bulk insert). Took 8 minutes to execute. Revised snippet:
Thanks again! |
Works like a charm. Thanks |
I have a pandas dataframe of 8 million rows. I found a fast method on stackoverflow to write this data to Postgres: takes less than 2 minutes.
Today I found odo and was hoping it would be faster, but it wasn't.
In fact my guess is it would take 30 minutes or so with odo, I stopped after 7 minutes.
Here's the code for faster writing of dataframe to Postgres, hope it helps you guys:
The text was updated successfully, but these errors were encountered: