Discussion:
Question from a LMDB user
Tao Chen
2015-11-03 20:23:17 UTC
Permalink
Hi Sir/Madam,

Recently I'm trying to use LMDB to store and randomly acess large amount
of features. Each feature blob is 16kB.
Before trying LMDB, I just stack all the features together into one huge
binay file, and use seek function in C++ to access each feature. Since
the feature size is fixed, I can easily compute the address of each
feature in the file.

Then I tried LMDB. The value is the feature as it is. The key is "1",
"2", "3", .... Since 16kB is exactly 4 x page_size, adding the key and
header, each feature will occupy 5 x page_size, so the db file on disk
is about 1.25 times of the previous binary file, this is already a
disadvantage for LMDB, but I still hope there can be some efficiency
trade-off. I use LDMB++ C++ wrapper to access features.

Next, I compared two approach by accessing the same random 1% features
from about 300k features. Before the test, I use vmtouch to evict both
files from memory cache. The result is surprising. The one use LMDB is
1.5 times slower than the raw binary file (30s vs 20s).

Is this because the size of feature (exactly 4 pages)? Do I understand
the use of LMDB incorrectly?
Thank your for your time!

Best Regards,

Tao Chen
Howard Chu
2015-11-04 10:56:12 UTC
Permalink
Content preview: Tao Chen wrote: > Hi Sir/Madam, > > Recently I'm trying to
use LMDB to store and randomly acess large amount of > features. Each feature
blob is 16kB. > Before trying LMDB, I just stack all the features together
into one huge binay > file, and use seek function in C++ to access each feature.
Since the feature > size is fixed, I can easily compute the address of each
feature in the file. > > Then I tried LMDB. The value is the feature as it
is. The key is "1", "2", > "3", .... Since 16kB is exactly 4 x page_size,
adding the key and header, each > feature will occupy 5 x page_size, so the
db file on disk is about 1.25 times > of the previous binary file, this is
already a disadvantage for LMDB, but I > still hope there can be some efficiency
trade-off. I use LDMB++ C++ wrapper to > access features. > > Next, I compared
two approach by accessing the same random 1% features from > about 300k features.
Before the test, I use vmtouch to evict both files from > memory cache. The
result is surprising. The one use LMDB is 1.5 times slower > than the raw
binary file (30s vs 20s). > > Is this because the size of feature (exactly
4 pages)? [...]

Content analysis details: (-4.2 points, 5.0 required)

pts rule name description
---- ---------------------- --------------------------------------------------
-2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium
trust
[69.43.206.106 listed in list.dnswl.org]
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: highlandsun.com]
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
Post by Tao Chen
Hi Sir/Madam,
Recently I'm trying to use LMDB to store and randomly acess large amount of
features. Each feature blob is 16kB.
Before trying LMDB, I just stack all the features together into one huge binay
file, and use seek function in C++ to access each feature. Since the feature
size is fixed, I can easily compute the address of each feature in the file.
Then I tried LMDB. The value is the feature as it is. The key is "1", "2",
"3", .... Since 16kB is exactly 4 x page_size, adding the key and header, each
feature will occupy 5 x page_size, so the db file on disk is about 1.25 times
of the previous binary file, this is already a disadvantage for LMDB, but I
still hope there can be some efficiency trade-off. I use LDMB++ C++ wrapper to
access features.
Next, I compared two approach by accessing the same random 1% features from
about 300k features. Before the test, I use vmtouch to evict both files from
memory cache. The result is surprising. The one use LMDB is 1.5 times slower
than the raw binary file (30s vs 20s).
Is this because the size of feature (exactly 4 pages)?
That certainly doesn't help, given the 16 byte page header. We expect to
remove this page header on overflow pages in LMDB 1.0.
Post by Tao Chen
Do I understand the use
of LMDB incorrectly?
You are comparing a B+tree which has complexity O(logN) to a direct access
with complexity of O(1). The result you got is exactly as expected.

There are only 2 reasons to use a tree structure:
1) you will have frequent inserts/deletes from the data set.
2) your data sizes are variable or unknown.

Your experiment uses a constant array, so reason 1 is invalid. And all of your
records are identical size, so reason 2 is invalid.

This is basic computer science, nothing special about LMDB.
Post by Tao Chen
Thank your for your time!
Best Regards,
Tao Chen
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
Loading...