Hacker Read

eternalban · 2015-11-30 16:37:36+00:00

Got it, thank you.

I've been experimenting with a less general segmented & offset addressed approach to deal with the same issues. Quite surprising how much semantics can be encoded in 64 bytes. Performance gains are substantial.

reply

throwawaymaths | karma 3432 | avg karma 1.75 · | 2022-09-09 08:34:48

I'm not talking about byte-sized data though. It's bit-sized, that's tough, especially when it crosses byte boundaries.

OJFord | karma 22072 | avg karma 2.2 · | 2021-09-23 18:44:39

Could you explain (even just a link if you have one) what you and GP mean by 'bytes vs. meaning'?

I can't really imagine what the latter is, other than perhaps signing byte values within a structure rather than the structure as a whole, and if that's it I don't understand how it's a non trivial difference, or what you mean by 'byte sequences mapping to a given meaning'.

Not looking to roll my own, just curious!

reply

klelatti | karma 9172 | avg karma 4.41 · | 2023-01-16 05:16:27

Thanks! Interesting to have this confirmed. I had a think about how this might be used in practice. Probably not huge wins but every byte counted in those days!

canimus | karma 3 | avg karma 0.33 · | 2020-03-26 16:04:01

Thank you @eesmith. Comments appreciated, and PRs to the repo as well. ;-) The multi-byte is great catch! I made the wrong assumption, on single byte separators. Perhaps a library limitation if the we want to keep the logic simple. Ideas on the fix?

steveklabnik | karma 91260 | avg karma 5.08 · | 2019-09-28 18:17:45+00:00

Thanks for the tip. By the way, line 84 introduces some unneeded serialization; you could get some more speed there by fetching the bits in parallel.

bartwe | karma 458 | avg karma 2.07 · | 2021-04-08 05:36:38

Was always a bit surprised that if you had the length encoding byte, that it didn't also imply an offset by the values that should have been encoded with a shorter sequence

speedgoose | karma 8135 | avg karma 2.13 · | 2020-06-06 12:47:15+00:00

Thanks for the explanation. The byte code analogy made me understand the technology a lot better.

zwegner | karma 1066 | avg karma 3.34 · | 2019-11-15 20:59:45+00:00

Oh interesting, I hadn't seen that. It looks like it uses the same idea of shuffle-lookups on the first three nibbles. That's a fairly large patch though, I don't think I've fully grokked it. At the very least, my code differs from that in that mine does less byte stream shifting, instead doing shifts in the scalar domain. Their code also needs to deal with quotes, escaping, etc. due to being part of a JSON validator.

saagarjha | karma 56017 | avg karma 2.29 · | 2019-09-09 03:59:47+00:00

Huh, it looks like that only works on 1-byte values? That’s an interesting choice.

elcritch | karma 3737 | avg karma 1.99 · | 2020-05-14 00:23:15+00:00

Thanks! That's good info. The hardest part was that in OTP 22 the examples given used many of the deprecated erl_* apis. I was able to update the C examples (see the second link) to use non-deprecated ei_* api calls. Mostly small changes and a bit better buffer management. Though, I don't like the lack of buffer length check in even the newer ei_encode_* functions. :/ I added a 24 byte padding guessing most single item encodes are less than that, and then check variable length items for size. Still hard to use safely without a buffer overrun. I'll take a look and see what else may have changed. It's exciting seeing all the continual beam improvements!

agumonkey | karma 29727 | avg karma 1.44 · | 2015-09-10 14:37:51

Thanks, I was stuck on byte code.

colejohnson66 | karma 9319 | avg karma 1.93 · | 2021-05-26 03:33:13+00:00

IIRC, Intel actually does do that. They do a 16-way decode, each a single byte of offset.

zvrba | karma 1881 | avg karma 2.13 · | 2014-04-09 06:40:43

It's hard to define semantics. Here, you have potentially copied less than dst->len bytes into dst->data. Is the remaining data (up to dst->len) still valid, is it garbage, should you have adjusted dst->len, etc?

When does it even make sense to partially mix byte-level representations of some data?

The buffer abstraction is necessary, but far from sufficient to build security-critical code. To do so, you also need to code in semantics somewhere, and buffers know zilch about the meaning of the data.

reply

latchkey | karma 15349 | avg karma 2.37 · | 2022-08-20 16:24:33

> There are APIs to fetch byte ranges from objects. Would that work without decompressing the entire object?

I wonder if you could store the objects in chunks and then individually compressed. Create an index of the byte ranges for those chunked objects so they can be looked up easily.

For example:

range 0-4000 -> objectA, objectB, objectC

Then just return those 3 objects, decompressed.

reply

layer8 | karma 23301 | avg karma 2.59 · | 2023-03-13 15:59:39

Is that an example of “remove all occurrences of a specific byte value from an array”? Wouldn’t packet processing require some sort of structural parsing?

teo_zero | karma 667 | avg karma 1.52 · | 2024-06-22 05:22:14

> very easy to partition and process in chunks

Which counters to increment at each byte depends on the previous bytes, though. You could probably succeed using overlapping chunks, but I wouldn't call it very easy.

reply

IshKebab | karma 13023 | avg karma 1.29 · | 2023-11-28 16:32:38

Presumably this also only works well if the data is 4-byte aligned.

SpikeGronim | karma 776 | avg karma 4.62 · | 2010-12-02 03:15:08+00:00

Making the message IDs equal to the byte offset of the message in the history of all messages for the topic/partition is a neat trick to avoid indexing overhead.

"Any software problem can be solved by adding another layer of indirection. Except, of course, the problem of too much indirection." - Steve Bellovin of AT&T Labs

reply

jlas | karma 870 | avg karma 4.24 · | 2017-03-14 18:06:29

> Now, we all know that it can take a while to find a long sequence of digits in ?, so for practical reasons, we should break the files up into smaller chunks that can be more readily found. In this implementation, to maximise performance, we consider each individual byte of the file separately, and look it up in ?.

So basically like a regular filesystem except now lookup each byte every time you intend to use it?

reply