Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Got it, thank you.

I've been experimenting with a less general segmented & offset addressed approach to deal with the same issues. Quite surprising how much semantics can be encoded in 64 bytes. Performance gains are substantial.



sort by: page size:

I'm not talking about byte-sized data though. It's bit-sized, that's tough, especially when it crosses byte boundaries.

Could you explain (even just a link if you have one) what you and GP mean by 'bytes vs. meaning'?

I can't really imagine what the latter is, other than perhaps signing byte values within a structure rather than the structure as a whole, and if that's it I don't understand how it's a non trivial difference, or what you mean by 'byte sequences mapping to a given meaning'.

Not looking to roll my own, just curious!


Thanks! Interesting to have this confirmed. I had a think about how this might be used in practice. Probably not huge wins but every byte counted in those days!

Thank you @eesmith. Comments appreciated, and PRs to the repo as well. ;-) The multi-byte is great catch! I made the wrong assumption, on single byte separators. Perhaps a library limitation if the we want to keep the logic simple. Ideas on the fix?

Thanks for the tip. By the way, line 84 introduces some unneeded serialization; you could get some more speed there by fetching the bits in parallel.

Was always a bit surprised that if you had the length encoding byte, that it didn't also imply an offset by the values that should have been encoded with a shorter sequence

Thanks for the explanation. The byte code analogy made me understand the technology a lot better.

Oh interesting, I hadn't seen that. It looks like it uses the same idea of shuffle-lookups on the first three nibbles. That's a fairly large patch though, I don't think I've fully grokked it. At the very least, my code differs from that in that mine does less byte stream shifting, instead doing shifts in the scalar domain. Their code also needs to deal with quotes, escaping, etc. due to being part of a JSON validator.

Huh, it looks like that only works on 1-byte values? That’s an interesting choice.

Thanks! That's good info. The hardest part was that in OTP 22 the examples given used many of the deprecated erl_* apis. I was able to update the C examples (see the second link) to use non-deprecated ei_* api calls. Mostly small changes and a bit better buffer management. Though, I don't like the lack of buffer length check in even the newer ei_encode_* functions. :/ I added a 24 byte padding guessing most single item encodes are less than that, and then check variable length items for size. Still hard to use safely without a buffer overrun. I'll take a look and see what else may have changed. It's exciting seeing all the continual beam improvements!

Thanks, I was stuck on byte code.

IIRC, Intel actually does do that. They do a 16-way decode, each a single byte of offset.

It's hard to define semantics. Here, you have potentially copied less than dst->len bytes into dst->data. Is the remaining data (up to dst->len) still valid, is it garbage, should you have adjusted dst->len, etc?

When does it even make sense to partially mix byte-level representations of some data?

The buffer abstraction is necessary, but far from sufficient to build security-critical code. To do so, you also need to code in semantics somewhere, and buffers know zilch about the meaning of the data.


> There are APIs to fetch byte ranges from objects. Would that work without decompressing the entire object?

I wonder if you could store the objects in chunks and then individually compressed. Create an index of the byte ranges for those chunked objects so they can be looked up easily.

For example:

range 0-4000 -> objectA, objectB, objectC

Then just return those 3 objects, decompressed.


Is that an example of “remove all occurrences of a specific byte value from an array”? Wouldn’t packet processing require some sort of structural parsing?

> very easy to partition and process in chunks

Which counters to increment at each byte depends on the previous bytes, though. You could probably succeed using overlapping chunks, but I wouldn't call it very easy.


Presumably this also only works well if the data is 4-byte aligned.

Making the message IDs equal to the byte offset of the message in the history of all messages for the topic/partition is a neat trick to avoid indexing overhead.

"Any software problem can be solved by adding another layer of indirection. Except, of course, the problem of too much indirection." - Steve Bellovin of AT&T Labs


> Now, we all know that it can take a while to find a long sequence of digits in ?, so for practical reasons, we should break the files up into smaller chunks that can be more readily found. In this implementation, to maximise performance, we consider each individual byte of the file separately, and look it up in ?.

So basically like a regular filesystem except now lookup each byte every time you intend to use it?

next

Legal | privacy