Indexing only recent data - adventures with large datasets & archiving
+by Oren Eini
++ + + + posted on: July 26, 2024 +
+We recently got a support request from a user in which they had the following issue:We have an index that is using way too much disk space. We don’t need to search the entire dataset, just the most recent documents. Can we do something like this?from d in docs.Events where d.CreationDate >= DateTime.UtcNow.AddMonths(-3) select new { d.CreationDate, d.Content };The idea is that only documents from the past 3 months would be indexed, while older documents would be purged from the index but still retained. The actual problem is that this is a full-text search index, and the actual data size required to perform a full-text search across the entire dataset is higher than just storing the documents (which can be easily compressed). This is a great example of an XY problem. The request was to allow access to the current date during the indexing process so the index could filter out old documents. However, that is actually something that we explicitly prevent. The problem is that the current date isn’t really meaningful when we talk about indexing. The indexing time isn’t really relevant for filtering or operations, since it has no association with the actual data. The date of a document and the time it was indexed are completely unrelated. I might update a document (and thus re-index it) whose CreationDate is far in the past. That would filter it out from the index. However, if we didn’t update the document, it would be retained indefinitely, since the filtering occurs only at indexing time.Going back to the XY problem, what is the user trying to solve? They don’t want to index all data, but they do want to retain it forever. So how can we achieve this with RavenDB?Data Archiving in RavenDBOne of the things we aim to do with RavenDB is ensure that we have a good fit for most common scenarios, and archiving is certainly one of them. In RavenDB 6.0 we added explicit support for Data Archiving.When you save a document, all you need to do is add a metadata element: @archive-at and you are set. For example, take a look at the following document:{ "Name": "Wilman Kal", "Phone": "90-224 8888", "@metadata": { "@archive-at": "2024-11-01T12:00:00.000Z", "@collection": "Companies", } }This document is set to be archived on Nov 1st, 2024. What does that mean? From that day on, RavenDB will automatically mark it as an archived document, meaning it will be stored in a compressed format and excluded from indexing by default.In fact, this exact scenario is detailed in the documentation. You can decide (on a per-index basis) whether to include archived documents in the index. This gives you a very high level of flexibility without requiring much manual effort. In short, for this scenario, you can simply tell RavenDB when to archive the document and let RavenDB handle the rest. RavenDB will do the right thing for you.
Cryptographically impossible bug hunt
by Oren Eini
@@ -207,18 +219,6 @@ posted on: July 22, 2024
Learn how to integrate AI into your .NET applications with Prompty, a powerful Visual Studio Code extension.
-Introducing CoreWCF and WCF Client Azure Queue Storage bindings for .NET
-by Subhrajit Saha
-- - - - posted on: July 18, 2024 -
-The initial beta release of the official libraries Microsoft.CoreWCF.Azure.StorageQueues and Microsoft.WCF.Azure.StorageQueues.Client library for .NET is now available.
Temporal cattle and other important jargon
by Oren Eini
diff --git a/site/index.html b/site/index.html
index 39f515890..5718e7f79 100644
--- a/site/index.html
+++ b/site/index.html
@@ -147,6 +147,18 @@
by Mehul Harry
+
+
+
+ posted on: August 14, 2024
+ Announcing .NET Conf 2024 - a free, three-day virtual developer event that celebrates the release of .NET 9. by Luis Quintanilla
@@ -255,18 +267,6 @@
posted on: July 29, 2024
Learn how to get started creating bindings with Native Library Interop by following this example binding native Chart libraries in a .NET MAUI application. by Oren Eini
-
-
-
- posted on: July 26, 2024
- We recently got a support request from a user in which they had the following issue:We have an index that is using way too much disk space. We don’t need to search the entire dataset, just the most recent documents. Can we do something like this?from d in docs.Events
where d.CreationDate >= DateTime.UtcNow.AddMonths(-3)
select new { d.CreationDate, d.Content };The idea is that only documents from the past 3 months would be indexed, while older documents would be purged from the index but still retained. The actual problem is that this is a full-text search index, and the actual data size required to perform a full-text search across the entire dataset is higher than just storing the documents (which can be easily compressed). This is a great example of an XY problem. The request was to allow access to the current date during the indexing process so the index could filter out old documents. However, that is actually something that we explicitly prevent. The problem is that the current date isn’t really meaningful when we talk about indexing. The indexing time isn’t really relevant for filtering or operations, since it has no association with the actual data. The date of a document and the time it was indexed are completely unrelated. I might update a document (and thus re-index it) whose CreationDate is far in the past. That would filter it out from the index. However, if we didn’t update the document, it would be retained indefinitely, since the filtering occurs only at indexing time.Going back to the XY problem, what is the user trying to solve? They don’t want to index all data, but they do want to retain it forever. So how can we achieve this with RavenDB?Data Archiving in RavenDBOne of the things we aim to do with RavenDB is ensure that we have a good fit for most common scenarios, and archiving is certainly one of them. In RavenDB 6.0 we added explicit support for Data Archiving.When you save a document, all you need to do is add a metadata element: @archive-at and you are set. For example, take a look at the following document:{
"Name": "Wilman Kal",
"Phone": "90-224 8888",
"@metadata": {
"@archive-at": "2024-11-01T12:00:00.000Z",
"@collection": "Companies",
}
}This document is set to be archived on Nov 1st, 2024. What does that mean? From that day on, RavenDB will automatically mark it as an archived document, meaning it will be stored in a compressed format and excluded from indexing by default.In fact, this exact scenario is detailed in the documentation. You can decide (on a per-index basis) whether to include archived documents in the index. This gives you a very high level of flexibility without requiring much manual effort. In short, for this scenario, you can simply tell RavenDB when to archive the document and let RavenDB handle the rest. RavenDB will do the right thing for you..NET Conf 2024 – Celebrating the Release of .NET 9! – Save the Date!
+ Introducing the Azure AI Inference SDK: Access More AI Models with the Azure AI Model Catalog
Indexing only recent data - adventures with large datasets & archiving
-