Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More realistic example query plus additional explanantions. #2803

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/en/images/column-oriented-example-query.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
31 changes: 25 additions & 6 deletions docs/en/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,22 +24,41 @@ In a row-oriented database, consecutive table rows are sequentially stored one a

ClickHouse is a column-oriented databases. In such systems, tables are stored as a collection of columns, i.e. the values of each column are stored sequentially one after the other. This layout makes it harder to restore single rows (as there are now gaps between the row values) but column operations such as filters or aggregation becomes much faster than in a row-oriented database.

The difference is best explained with an example query which filters on three columns:
The difference is best explained with an example query running over 100 million rows of [real-world anonymized web analytics data](https://clickhouse.com/docs/en/getting-started/example-datasets/metrica):

```sql
SELECT *
FROM table
WHERE time > '2024-09-01 13:14:15'
AND location = ‘Berlin’
AND `Mobile Phone` LIKE '%61010%`
SELECT MobilePhoneModel, COUNT() AS c
FROM metrica.hits
WHERE
RegionID = 229
AND EventDate >= '2013-07-01'
AND EventDate <= '2013-07-31'
AND MobilePhone != 0
AND MobilePhoneModel not in ['', 'iPad', 'iPhone']
GROUP BY RegionID, MobilePhone, MobilePhoneModel
ORDER BY c DESC
LIMIT 5;
```

You can click [here](https://sql.clickhouse.com?query=U0VMRUNUIE1vYmlsZVBob25lTW9kZWwsIENPVU5UKCkgQVMgYyAKRlJPTSBtZXRyaWNhLmhpdHMgCldIRVJFIAogICAgICBSZWdpb25JRCA9IDIyOSAKICBBTkQgRXZlbnREYXRlID49ICcyMDEzLTA3LTAxJyAKICBBTkQgRXZlbnREYXRlIDw9ICcyMDEzLTA3LTMxJyAKICBBTkQgTW9iaWxlUGhvbmUgIT0gMCAKICBBTkQgTW9iaWxlUGhvbmVNb2RlbCBub3QgaW4gWycnLCAnaVBhZCcsICdpUGhvbmUnXSAKR1JPVVAgQlkgUmVnaW9uSUQsIE1vYmlsZVBob25lLCBNb2JpbGVQaG9uZU1vZGVsCk9SREVSIEJZIGMgREVTQyAKTElNSVQgNTs&chart=eyJ0eXBlIjoicGllIiwiY29uZmlnIjp7InhheGlzIjoiTW9iaWxlUGhvbmVNb2RlbCIsInlheGlzIjoiYyJ9fQ&run_query=true) to run this query that selects and filters just a few out of [over 100](https://sql.clickhouse.com/?query=U0VMRUNUIG5hbWUKRlJPTSBzeXN0ZW0uY29sdW1ucwpXSEVSRSBkYXRhYmFzZSA9ICdtZXRyaWNhJyBBTkQgdGFibGUgPSAnaGl0cyc7&tab=results&run_query=true) existing columns, returning the result within milliseconds:

![Row-oriented](@site/docs/en/images/column-oriented-example-query.png#)




**Row-oriented DBMS**

In a row-oriented database, even though the query above only processes a few out of the existing columns, the system still needs to load the data from other existing columns from disk to memory. The reason for that is that data is stored on disk in chunks called [blocks](https://en.wikipedia.org/wiki/Block_(data_storage)) (usually fixed sizes, e.g., 4 KB or 8 KB). Blocks are the smallest units of data read from disk to memory. When an application or database requests data, the operating system’s disk I/O subsystem reads the required blocks from the disk. Even if only part of a block is needed, the entire block is read into memory (this is due to disk and file system design):


![Row-oriented](@site/docs/en/images/row-oriented.gif#)

**Column-oriented DBMS**

Because the values of each column are stored sequentially one after the other on disk, no unnecessary data is loaded when the query from above is run.
Because the block-wise storage and transfer from disk to memory is aligned with the data access pattern of analytical queries, only the columns required for a query are read from disk, avoiding unnecessary I/O for unused data. This is much faster compared to row-based storage, where entire rows (including irrelevant columns) are read:

![Column-oriented](@site/docs/en/images/column-oriented.gif#)

## Data Replication and Integrity
Expand Down