This is a script to download emails from Gmail and store them in a SQLite database for further analysis. I find it extremely useful to have all my emails in a database to run queries on them. For example, I can find out how many emails I received per sender, which emails take the most space, and which emails from which sender I never read.
- Clone this repository:
git clone https://github.com/marcboeker/gmail-to-sqlite.git
. - Install the requirements:
pip install -r requirements.txt
- Create a Google Cloud project here.
- Open Gmail in API & Services and activate the Gmail API.
- Open the OAuth consent screen and create a new consent screen. You only need to provide a name and contact data.
- Next open Create OAuth client ID and create credentials for a
Desktop app
. Download the credentials file and save it undercredentials.json
in the root of this repository.
Here is a detailed guide on how to create the credentials: https://developers.google.com/gmail/api/quickstart/python#set_up_your_environment.
- Run the script:
python main.py sync --data-dir path/to/your/data
where--<data-dir>
is the path where all data is stored. This creates a SQLite database in<data-dir>/messages.db
and stores the user credentials under<data-dir>/credentials.json
. - After the script has finished, you can query the database using, for example, the
sqlite3
command line tool:sqlite3 <data-dir>/messages.db
. - You can run the script again to sync all new messages. Provide
--full-sync
to force a full sync. However, this will only update the read status, the labels, and the last indexed timestamp for existing messages.
python main.py sync-message --data-dir path/to/your/data --message-id <message-id>
usage: main.py [-h] [--data-dir DATA_DIR] [--update] {sync, sync-message}
Main commands:
sync Sync emails from Gmail to the database.
sync-message Sync a single message from Gmail to the database.
--data-dir DATA_DIR Path to the directory where all data is stored.
--full-sync Force a full sync.
--message-id MESSAGE_ID Sync only the message with the given message id.
CREATE TABLE IF NOT EXISTS "messages" (
"id" INTEGER NOT NULL PRIMARY KEY, -- internal id
"message_id" TEXT NOT NULL, -- Gmail message id
"thread_id" TEXT NOT NULL, -- Gmail thread id
"sender" JSON NOT NULL, -- Sender as JSON in the form {"name": "Foo Bar", "email": "[email protected]"}
"recipients" JSON NOT NULL, -- JSON object: {
-- "to": [{"email": "[email protected]", "name": "Foo Bar"}, ...],
-- "cc": [{"email": "[email protected]", "name": "Foo Bar"}, ...],
-- "bcc": [{"email": "[email protected]", "name": "Foo Bar"}, ...]
--}
"labels" JSON NOT NULL, -- JSON array: ["INBOX", "UNREAD", ...]
"subject" TEXT NOT NULL, -- Subject of the email
"body" TEXT NOT NULL, -- Extracted body either als HTML or plain text
"size" INTEGER NOT NULL, -- Size reported by Gmail
"timestamp" DATETIME NOT NULL, -- When the email was sent/received
"is_read" INTEGER NOT NULL, -- 0=Unread, 1=Read
"is_outgoing" INTEGER NOT NULL, -- 0=Incoming, 1=Outgoing
"last_indexed" DATETIME NOT NULL -- Timestamp when the email was last seen on the server
);
SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
GROUP BY sender->>'$.email'
ORDER BY count DESC
This is great to determine who is spamming you the most with uninteresting emails.
SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
WHERE is_read = 0
GROUP BY sender->>'$.email'
ORDER BY count DESC
- For years:
strftime('%Y', timestamp)
- For months in a year:
strftime('%m', timestamp)
- For days in a month:
strftime('%d', timestamp)
- For weekdays:
strftime('%w', timestamp)
- For hours in a day:
strftime('%H', timestamp)
SELECT strftime('%Y', timestamp) AS period, COUNT(*) AS count
FROM messages
GROUP BY period
ORDER BY count DESC
This is an amateurish way to find all newsletters and group them by sender. It's not perfect, but it's a start. You could also use
SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
WHERE body LIKE '%newsletter%' OR body LIKE '%unsubscribe%'
GROUP BY sender->>'$.email'
ORDER BY count DESC
SELECT sender->>'$.email', sum(size)/1024/1024 AS size
FROM messages
GROUP BY sender->>'$.email'
ORDER BY size DESC
SELECT count(*)
FROM messages
WHERE EXISTS (
SELECT 1
FROM json_each(messages.recipients->'$.to')
WHERE json_extract(value, '$.email') = '[email protected]'
)
AND sender->>'$.email' = '[email protected]'
SELECT sender->>'$.email', sum(size)/1024/1024 as total_size
FROM messages
WHERE is_outgoing=false
GROUP BY sender->>'$.email'
ORDER BY total_size DESC
- Detect deleted emails and mark them as deleted in the database.