-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial/incremental decompression of clusters #411
Conversation
Codecov Report
@@ Coverage Diff @@
## master #411 +/- ##
==========================================
+ Coverage 47.17% 48.37% +1.20%
==========================================
Files 66 71 +5
Lines 3305 3390 +85
Branches 1422 1442 +20
==========================================
+ Hits 1559 1640 +81
- Misses 1746 1750 +4
Continue to review full report at Codecov.
|
b850bfd
to
445424d
Compare
That is a big change (I've just quickly read it, not made a review). But as said in #395 (comment), it would have been nice to have some measurements to know where we are going instead of doing such big change for potentially nothing (or worst). Can you explain the global strategy for this PR ? |
This PR is only about partial/incremental decompression (it doesn't include caching on article/item level, though, of course, advances in that direction). As a result of this change clusters are decompressed lazily. When Some minor sources of slowdown are the following:
I am confident that the potential performance improvement will greatly exceed the slowdown, that's why I didn't start with measurement. |
How do you handle the need of more memory as you decompressing more data while keeping existing pointer valid ? |
Blobs of known size are read from a data stream that is decompressed on the fly: Lines 278 to 283 in a22773c
libzim/src/decodeddatastream.h Lines 46 to 78 in a22773c
into a buffer that is created for it by its superclass: Lines 30 to 36 in a22773c
|
7d21ff4
to
fdc4d76
Compare
e351c3f
to
c610baa
Compare
CompressedCluster got its own collection of blobs. The latter is filled by simply breaking down the fully uncompressed data with the help of ReaderDataStreamWrapper.
However the IDataStream object used at this point is still a ReaderDataStreamWrapper reading from the fully uncompressed data.
Cluster is now an abstract base class for NonCompressedCluster and CompressedCluster.
From now on, compressed clusters are decompressed using the streaming decompression (DecodedDataStream) API. However, a cluster is fully decompressed (i.e. all blobs are extracted and stored inside the cluster).
c610baa
to
1c2a7b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a interesting PR. There are some few things I like and some other I don't :)
I mainly disagree with the new code architecture this change introduces.
Let's me explain first the global architecture of the classes of the libzim.
Buffer
represents a memory region. There are several kind of buffers as the memory region can be handled differently (not handled, simply deleted at destruction or mmap backed)Reader
is a interface to read data. It can read data directly from a (potentially split) file (FileReader
) or from a memory zone (BufferReader
).- Other classes (
Dirent
,FileImpl
andCluster
) use a reader to get data. They are about interpret the data they read (the logic) and not how the data is stored/read (implementation). [1] FileImpl
is the smart part that create appropriateReader
/Buffer
depending of the context.Blob
is the end of the chain. It the (size aware) data returned to the user. It represent a memory zone (as aBuffer
). Technically it could be replaced by aBuffer
but aBuffer
is something internal to the libzim andBlob
is public (andBlob
was here long time beforeBuffer
and I didn't what to break the api when I introducedBuffer
).
In our case, this architecture allow :
- Split the reading of the data (
Reader
/Buffer
) from its interpretation (Cluster
/Dirent
) using a interface (Reader
) between the both. [1] FileImpl
to create the "best" reader to read some cluster data (if not compressed: FileReader. If compressed: BufferReader on a decompressed memory region)Cluster
don't have to care about how data is stored. It only care about reading the data and interpret it.
What missing is a way to have lazy decompression. This is mostly because the decompression algorithm is not associated with a reader but it is just a way to create a buffer. So FileImpl
have to decompress the whole cluster data to pass a Reader
to Cluster
.
What you do in this PR complicate a lot the global architecture :
- You split the
Cluster
implementation in two and use different algorithm and storage depending of how the data is read. You break the separation between two different problem. - Create new kind of itermediate class (
IDataStream
,IDataStream::Blob
) - Add somehow complex relation between
zim::Blob
,IDataStream::Blob
andBuffer
.
I don't think we need a dataStream at all. We "simply" need to create a DecompressReader
that lazy decompress the data just when we need(read) it.
The cluster would have to be change a bit. It would have to store the buffer of the already blob and be sure that buffer inferior to idx
has been read when getting blob idx
.
Here is a pseudo code of the DecompressReader
and the change to made on Cluster
.
class DecompressReader : public Reader
{
void read(char* dest, offset_t offset, zsize_t size) const {
// The following code should not used as we are smart on cluster side.
// But we need it to be a fully functional as a reader.
if (last_read_offset > offset) {
last_read_offset = 0;
}
// decompress from last_read_offset to offset (and drop decompressed memory);
decompress(offset-last_read_offset, nullptr);
// Here is the useful decompress
decompress(size, dest);
last_read_offset = offset+size;
}
void decompress(zsize_t size, char* dest) {
//reuse the code of the DecodedDataStream
}
std::shared_ptr<const Buffer> FileReader::get_buffer(offset_t offset, zsize_t size) const {
auto ret_buffer = std::make_shared<MemoryBuffer>(size);
read(ret_buffer->buf(), offset, size);
return ret_buffer;
}
Buffer/*or Reader*/ compresseddata;
compression_stream stream;
offset_t last_read_offset;
};
class Cluster //updated code
{
Blob getBlob(blob_index_t n) const
{
auto nb_read_buffers = already_read_buffers.size();
for (auto i=nb_read_buffers; i<=n; i++) {
// If cluster is not compressed, reader is a `BufferReader` and the create buffer is a sub buffer without memory copy
// If cluster is compressed, reader is a `DecompressReader` and the create buffer is a "just in time" allocated and decompressed buffer.
auto buffer = reader->get_buffer(startOffset+offsets[blob_index_type(i)], getBlobSize(i));
already_read_buffers.push_back(buffer);
}
return Blob(already_read_buffers[n]);
}
std::vector<std::shared_ptr<const Buffer>> already_read_buffers;
}
[1] Yes, Dirent
uses a Buffer
and not a Reader
. But this is because I haven't made thing correctly. We should change this to make the Dirent
use a Reader
, potentially creating a buffered reader (not to be confused with BufferReader
) to avoid too many syscall when reading each dirent's fields. Directly using a Buffer
is somehow a wrong implementation of a lazy developer (me) than a design choice.
src/idatastream.h
Outdated
class Blob | ||
{ | ||
private: // types | ||
typedef std::shared_ptr<const char> DataPtr; | ||
|
||
public: // functions | ||
Blob(const DataPtr& data, size_t size) : data_(data) , size_(size) {} | ||
|
||
const char* data() const { return data_.get(); } | ||
size_t size() const { return size_; } | ||
|
||
private: // data | ||
DataPtr data_; | ||
size_t size_; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using zim::Blob
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention is to promote IDataStream::Blob
to zim::Blob
, so eventually we will use zim::Blob
here 😉
This version of a blob implementation has several advantages over the current version of zim::Blob
- It is a little more lightweight
- and still can replace, after minor enhancements, both
zim::Blob
andzim::Buffer
const char* data_; | ||
size_t size_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a zim::Buffer
associated with a offset ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have wicked plans of getting rid of zim::Buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please stop wicked plans :) Let's talk together before.
namespace | ||
{ | ||
|
||
class IDSBlobBuffer : public Buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A buffer is a memory zone. It is somehow non sense to back it by a datastream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there is some misunderstanding here. This is a temporary helper class used to translate wrap IDataStream::Blob
in a zim::Blob
.
This is probably the ugliest part of the new code that will be eliminated when IDataStream::Blob
takes over zim::Blob
|
||
Blob idsBlob2zimBlob(const IDataStream::Blob& blob, size_t offset, size_t size) | ||
{ | ||
return Blob(std::make_shared<IDSBlobBuffer>(blob, offset, size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the user ask for a blob I think it is time to stop being lazy and load the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data is already loaded at this point (and blob points to it).
namespace zim | ||
{ | ||
|
||
class BufDataStream : public IDataStream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class seems never used at all (except in test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I thought it was used in an intermediate state but it only enables unit-testing DecodedDataStream
.
The fact that Buffer makeMemoryBuffer(size_t size)
{
return Buffer(std::shared_ptr<const char>(new char[size], std::default_delete<char[]>()), size);
}
Buffer makeMemoryViewBuffer(const char* start, size_t size)
{
return Buffer(std::shared_ptr<const char>(start, NoDelete()), size);
}
Buffer makeMmapBuffer(const char* start, size_t size)
{
// pseudocode
Mmapping mmapping(start, size);
return Buffer(std::shared_ptr<const char>(start, mmapping.getUnmmapper()), size);
} That's why I think that |
You are right that the new code adds some complexity. A far reaching intent is to arrive at a simpler architecture as outlined in the previous comment (as well as noted in #402 (comment)) |
Yes, that will work. But it achieves the result by using a more complex tool (the random access |
On one hand you write
On the other hand you propose
which is sub-optimal for non-compressed clusters. The distinction between non-compressed and compressed clusters (if we thrive for the most optimal solution in each case) is that of between random and sequential access, which has no solution other than treat them differently. |
Added a unit-test covering the scenario where the loop body in `DecodedDataStream::readImpl()` has to be executed more than once.
Added benchmarking data in the description of this PR. |
Having something to read data, possibly served from a remote resource is the job of
Please talk with me before making plan to libzim and implementing things.
I'm not sure a sequential reader is simpler. But anyway, we would end with two tools instead of one. And the global situation would definitively not be simpler.
How ? The only usecase we have is for decompression stream.
How it is sub-optimal ? Uncompressed clusters still have no memory copy, no big data allocation (not more than before). The only things that are created/copied are few shared_ptr. And this is a price I totally accept for a simple design. |
One another important point is that this must be threadsafe. |
Superseeded by #421 |
Fixes #78
Fixes #394
Enables #395 for compressed clusters
Added
IDataStream
interface for sequential reading of data and its 3 implementations:BufDataStream
: reading data from a bufferDecodedDataStream
: for on the fly decompression of data coming from another data streamReaderDataStreamWrapper
: a wrapper aroundzim::Reader
presenting it asIDataStream
Converted
Cluster
into an abstract base class with two implementations:NonCompressedCluster
CompressedCluster
Implemented partial/incremental decompression of compressed clusters using
DecodedDataStream
.It is better to review this PR commit-by-commit.
Benchmarking data:
These changes are most beneficial to
kiwix-serve
. The benchmark was performed using thebenchmark_kiwix_serve
script as follows:That script starts a kiwix server and fetches from it about 1000 articles (more accurately every 509'th article, which amounts to 1002 articles) using
wget --recursive -l 1
.(It was verified that the output directories created by wget from both runs are identical, with the exception of
/random
)zimcheck -A
exercises a flow that should not benefit from this optimization. In fact there is some (about 3%) slowdown: