How can dereferencing the first character take longer when the string is longer?

stabbles | karma 2467 | avg karma 3.87 · 2018-12-21 11:08:03+00:00

In C++ with short string optimization short strings are on the stack while longer strings are on the heap. That's the difference between a cache hit and a cache miss.

foota | karma 6719 | avg karma 2.15 · 2018-12-21 11:50:43+00:00

The article is about C, not C++.

thisismyaccoun7 | karma 110 | avg karma 3.44 · 2018-12-21 14:43:57+00:00

Which is why s/he specified that the post was about c++ and not c in literally the first two words of the post :)

paulmooreparks | karma 740 | avg karma 3.12 · 2018-12-22 06:02:59

The article is about cache effects on performance, independent of language. The example code just happened to be in C.

sigi45 | karma 818 | avg karma 1.25 · 2018-12-21 12:05:14+00:00

How does this make a difference?

The articel states that it becomes slower when two strings don't fit together in one cache line anymore. That is independend on using heap or stack.

reply

Sharlin | karma 15457 | avg karma 3.55 · 2018-12-21 12:52:41

A stack variable is almost always already in the cache.

jonhohle | karma 5691 | avg karma 3.05 · 2018-12-21 13:20:12+00:00

I’m assuming the OP meant that there is a higher likelihood of SSO strings (24 bytes)§ being in the same cache line (64 bytes)§ when they are stored in the same stack frame.

Two small strings allocated on the heap at different times have a lower likelihood of being in the same cache line. Even within consecutive mallocs the addresses might not fall within the same line. Unless the memory is allocated exlipicitly for locality, there’s a higher likelihood (>=0) that memory fragmentation causes the allocations to be split on different lines.

Ultimately, this can be controlled on either the heap or stack, SSO seems to be optimizing for the stack case, which probably works pretty well for most cases (especially for oblivious developers).

§ on recent x86-64 processors

reply

sigi45 | karma 818 | avg karma 1.25 · 2018-12-21 22:09:12+00:00

But I don't create a string on the stack first.

First allocation on the heap than probably a copy onto the stack.

reply

CyberDildonics | karma 3006 | avg karma 0.57 · 2018-12-21 16:21:32

Two string on the stack are far more likely to be cached at a low cache level and/or prefetched (but probably already cached).

A noticeable difference is much more likely to be because a string is too big and moved to the heap, which is in a different part of memory and thus not cached. Also there is the translation look aside buffer which caches the virtual memory to real memory address lookup. This also needs to be done on cold pages of memory and be expensive since it requires looking up in to the virtual memory table.

reply

bostonpete | karma 2330 | avg karma 2.69 · 2018-12-21 13:25:14+00:00

Huh? I guess you mean short strings are stored in the string object rather than separately allocated. That memory will only be on the stack if the string itself is...

austincheney | karma 6940 | avg karma 1.65 · 2018-12-21 12:12:30+00:00

I suspect that in order to access any character index of a given string you must access the entire string and then the specified index. The longer a string the more indexes there are from which to access. I suspect the speed differences between short and long strings is negligible and possibly not something most tools can detect in isolation. There are second and third order consequences that compound over other operations when executed in a loop though.

scarface74 | karma 25313 | avg karma 1.25 · 2018-12-21 12:46:45

When you access an index of an array via a[x], it should only access the memory of:

a + sizeof({type of a}) * x.

The compiler has no concept of a “string” in this context. A string is just a null terminated character array.

reply

dpark | karma 10696 | avg karma 2.45 · 2018-12-21 13:22:37+00:00

> I suspect that in order to access any character index of a given string you must access the entire string and then the specified index. The longer a string the more indexes there are from which to access.

Are you saying that you believe accessing a single character (index) requires reading the entire string? This is incorrect. In C, you might have to read the first N characters in order to read the N+1th character, but that’s a bounds-checking cost due to null-terminated strings(and is avoided if the length is known). And in any case, you need not access anything past the character you wish to read.

The article answers the question. It’s a well-known locality effect.

reply

austincheney | karma 6940 | avg karma 1.65 · 2018-12-21 16:11:37+00:00

The locality effect only explains access via cache versus memory. Just a paragraph later the article mentions this:

> Reading the first character from the string adds another memory access, and the characteristics of that memory access vary depending on the length of the string.

Provided access were always to memory, such that strings in an array of strings exceed cache size, it would not be a locality issue and string length would still result in a performance variance.

reply

dpark | karma 10696 | avg karma 2.45 · 2018-12-21 17:28:36

I'm confused by your response. You quote the article, but you seem to disagree with it. The author explicitly states that this is a data locality effect. The access characteristics vary because of cache effects, as the article explicitly states: "When the strings are short...they are more likely to occupy the same cache line.... As the strings get longer...fewer strings will fit inside a cache line. Eventually...you don't gain any significant benefit from locality."

> Provided access were always to memory, such that strings in an array of strings exceed cache size, it would not be a locality issue and string length would still result in a performance variance

If you always miss the cache, then performance will no longer vary with string length.

reply

austincheney | karma 6940 | avg karma 1.65 · 2018-12-21 18:04:04+00:00

If the strings are short and everything always lived in the cache I don't think the author would have posted this in the first place. If the strings are always long and never fit in the cache the performance cannot be related to cache or anything related to cache vs memory access. In this case I don't think the author would have posted about variable performance when accessing the strings from memory if it were something they had not observed.

Locality only applies in a third middle state when some strings are short enough to fit in the cache and when others are not. In this case the degrading performance is predictable to the number of memory look ups. If the string is in cache there are no memory look ups. If it is in memory there are two look ups: one of the string and a second for the index upon that string. Finally the article mentions the second look up varies according to the length of the string, which suggests different performances for strings of different length even if both reside in memory.

Its not that I disagree with the article, but rather the finality of assumptions drawn from the article. Logically you cannot access a specified index on a string without first accessing the string and then navigating to the given index. There is instruction time in all of that and it isn't solely confined to locality.

reply

dpark | karma 10696 | avg karma 2.45 · 2018-12-21 18:51:23+00:00

> If the strings are short and everything always lived in the cache I don't think the author would have posted this in the first place.

Raymond Chen didn't post this because he had trouble understanding the behavior. He posted this as an exercise for his readers. There is zero chance that Raymond Chen doesn't fully understand this behavior.

Also, the strings being short doesn't mean they're all in the cache. If your strings are 4 chars each but you're storing a billion of them, you won't have them all in cache, but cache effects will still matter.

> Locality only applies in a third middle state when some strings are short enough to fit in the cache and when others are not. In this case the degrading performance is predictable to the number of memory look ups. If the string is in cache there are no memory look ups. If it is in memory there are two look ups: one of the string and a second for the index upon that string. Finally the article mentions the second look up varies according to the length of the string, which suggests different performances for strings of different length even if both reside in memory.

Look at the code again. All the strings have the same length for a given execution. There is no case here where "some strings are short enough to fit in the cache and when others are not".

I think your mental model of cache behavior and why it's relevant here is wrong. The question isn't whether all the strings fit into cache. Nor whether individual strings are "too large" to fit into the cache. The former is uninteresting and the latter is irrelevant because we aren't looking at the entire string, only the first character, which will most certainly fit in the cache.

Modern CPUs will pull in an entire cache line at a time. If you access the first character of a string, you'll pull in the surrounding N bytes as well. If your string is long, you'll only pull in bytes from that string. (You will also pull in the bytes preceding your string if your string doesn't begin on the start of the cache line.) If it's short and other strings are adjacent, you'll pull in bytes from the next string (or strings) as well. This is where the data locality bit matters. If your strings are very short and your layout is sequential, the cache is extremely efficient at minimizing main memory lookups, and your comparison of one string will pull several following strings into the cache "for free".

> Logically you cannot access a specified index on a string without first accessing the string and then navigating to the given index.

I don't understand what your mental model of string access looks like, but accessing a character within a string in C is literally a single pointer addition and a dereference. There is no "navigating" nor is there any meaningful notion of "accessing the string". You access characters individually by offsetting a pointer and dereferencing.

reply

austincheney | karma 6940 | avg karma 1.65 · 2018-12-21 23:53:41+00:00

Excellent. As somebody who does not write C this is the answer I needed.

dpark | karma 10696 | avg karma 2.45 · 2018-12-22 05:11:48+00:00

Glad to help.

They key thing I realized after I wrote my last comment is that you've (probably) been thinking in terms of an object cache. Your comments make a lot of sense if the cache is something like a key/value store that you shove an address and an object into. If you can't fit the string in the cache, it won't be in the cache at all. If you want to access a character in the string, you need to access the string first (pull it from the cache) and then access the character you want.

CPU caches don't map very well to that model. CPU caches make a lot more sense if you think about the system memory as one enormous array of contiguous bytes. The CPU doesn't know anything about objects. It just knows about different-sized chunks of bytes (pages, cache lines) that it can read from the array (system memory). The cache doesn't hold objects (and cannot, because the CPU has no concept of an object). The cache just holds a bit of memory pulled from the array. The benefits of CPU caches come from recency (you'll probably touch the same memory again very soon) and data locality (the CPU will already read in an entire chunk, which is awesome if you're going to touch nearby memory). Your strings are embedded in this enormous array. The "chunking" behavior means that adjacent strings will be loaded together if they're short enough. The shorter, the better, as observed here.

From a quick look, this appears to be one of the better explanations of CPU caches online:

https://courses.cs.washington.edu/courses/cse378/09wi/lectur...

reply

hopler | karma 432 | avg karma 0.53 · 2018-12-21 13:25:05

The page containing the pointer's value is needed, not the string.

rurban | karma 5612 | avg karma 0.84 · 2018-12-21 16:10:17

The page is irrelevant. Important is if the first string is longer than 63, thus forcing the next string to occupy a seperate cache line.

With the usual malloc you have to subtract 8 from 63, as malloc reserves a word at the front. With _independent_comalloc for efficient array storage not.

reply

austincheney | karma 6940 | avg karma 1.65 · 2018-12-21 16:37:19+00:00

I don't why this is downvoted. Reminds of me bitter old burnouts at r/programming on Reddit downvoting anything about JavaScript or WASM not immediately destroying JavaScript. Childish behavior from stupid people with big tears is why I deleted my Reddit account.

userbinator | karma 78987 | avg karma 4.37 · 2018-12-21 12:54:15

This is also why data alignment/structure padding can be slower than packing all the data together. Cache usage is very important in CPUs now, because in a lot of cases the bottleneck is the memory. Padding is effectively wasted memory.

dpark | karma 10696 | avg karma 2.45 · 2018-12-21 13:15:56

Is this true on even x86? On most platforms[1] misaligned access will simply crash your program and you definitely will not execute faster (but you’ll probably terminate faster). On x86 you’ll pay a hefty penalty for the misaligned access and I would expect that it likely costs more than the cache optimization saves (but maybe not).

Of course you can probably optimize your struct packing anyway. Shuffling members around can get a lot of benefit without a bunch of internal padding, though you might still have some trailing padding.

[1] I guess these days the platforms are essentially just x86 and ARM.

reply

dooglius | karma 2466 | avg karma 2.82 · 2018-12-21 13:28:41+00:00

As long as you don't cross a cacheline boundary, I don't believe there is any misalignment penalty on recent x86/ARM processors.

dpark | karma 10696 | avg karma 2.45 · 2018-12-21 13:41:56+00:00

Interesting. I wasn’t aware that ARM allowed this at all.

avian | karma 5135 | avg karma 5.21 · 2018-12-21 13:56:54

It really depends on architecture version and profile. Recent application profiles allow it. On microcontrollers misaligned access is still mostly forbidden as far as I know.

api | karma 31631 | avg karma 2.2 · 2018-12-21 16:38:37+00:00

This inconsistency means you have to assume unaligned access is forbidden in portable code. Fun times.

tinus_hn | karma 7585 | avg karma 1.16 · 2018-12-22 12:10:11

Most code is not portable at the assembler level and compiler hides this kind of detail.

api | karma 31631 | avg karma 2.2 · 2018-12-22 15:49:26+00:00

In C/C++ you can easily do stuff that violates alignment constraints unless you totally avoid casts or packed structs.

tinus_hn | karma 7585 | avg karma 1.16 · 2018-12-23 16:59:03+00:00

If what you are doing has defined behavior the compiler will make it work, although more slowly.

dmitrygr | karma 11233 | avg karma 3.24 · 2018-12-22 08:48:53

Allowed in v7m sometimes with a one cycle penalty for data. Not allowed for code. Not allowed for some instructions ( just like x86 requires alignment for some avx inductions)

CyberDildonics | karma 3006 | avg karma 0.57 · 2018-12-21 16:14:53+00:00

It depends far more on memory access patterns. If memory is accessed linearly (backwards or forwards now) the prefetcher will get it for you before you get to it.

Alignment matter much more when using atomics. X64 can do unaligned atomic access although (I think) there is a penalty for crossing the cache line boundary with atomics. Also there is an issue of 'false sharing' - two threads needing to sync cache lines even though the bytes they are accessing don't overlap.

128 bit (16 byte) atomic operations need to be aligned on 128 bit boundaries or they will crash.

(if anyone sees something inaccurate here feel free to correct me)

reply

stcredzero | karma 37049 | avg karma 2.13 · 2018-12-21 17:42:46

Alignment matter much more when using atomics.

Someone should make a lockless library and call it, "Family." Most people will think it is a reference to the Fast and the Furious, but it would actually be a reference to Dune.

reply

userbinator | karma 78987 | avg karma 4.37 · 2018-12-22 05:20:49

On x86 the effects of misalignment haven't been "hefty" since at least the 486, when they first got cache and everything within a cacheline would be just as fast --- don't forget that x86 never had aligned instruction bytes either.

stcredzero | karma 37049 | avg karma 2.13 · 2018-12-21 17:37:25+00:00

Isn't data alignment/structure padding the sort of thing we ought to be putting into optimizing compilers? People used to allocate registers by hand.

Also, isn't it a pretty strong signal that there's so much talk of cache oblivious algorithms and cache usage? As in an olfactory signal. If we are warping our algorithms and coding so much around it, isn't this a sign that the memory abstraction with the current caching scheme is an overly leaky abstraction?

This, and the abuse of memory as a means of communicating between threads running on different CPU cores are two areas that seem very much like a signal that a paradigm shift of some kind is due on the level of the abstract model of the hardware.

reply

Retra | karma 4460 | avg karma 0.93 · 2018-12-21 18:49:46+00:00

>If we are warping our algorithms and coding so much around it, isn't this a sign that the memory abstraction with the current caching scheme is an overly leaky abstraction?

No, because we don't code our programs in the 'language of performance'. Performance of algorithms is always exterior to our abstractions. There's no way to talk about performance without talking about all the things that 'leak'.

Conversely, if you design your algorithms with no consideration for performance, you will not care what the cache does. If you design your algorithms to maximize something they don't literally encode, then you're escaping the abstraction by definition, and side channels become relevant.

In other words, the abstractions don't leak, they are just not up to the task of modeling optimal performance, so you have to use your human brain to do it on the side.

reply

stcredzero | karma 37049 | avg karma 2.13 · 2018-12-21 20:53:01

Performance of algorithms is always exterior to our abstractions.

From a certain point of view.

In other words, the abstractions don't leak, they are just not up to the task of modeling optimal performance

If the performance becomes pathological, to the point where the app or system is unusable, then from the point of view of the users, the abstraction is leaking.

Let me put it this way. Maybe a flat Random Access Machine isn't the best mental model for certain coding tasks? Maybe there's a way we could have a cleaner abstract model that makes the pathologies visible, yet is a bit more abstract than just thinking about the actual cache?

reply

herf | karma 2888 | avg karma 8.14 · 2018-12-21 15:00:14+00:00

Remember it takes 100ns to get a value from main memory (after you miss all the caches). With random-access data structures the total size tells you the speed.

Here's a test to see how your CPU does: https://github.com/emilk/ram_bench

reply

scarface74 | karma 25313 | avg karma 1.25 · 2018-12-22 00:17:30

Slightly off topic and I feel a little hypocritical for saying this since I’m always railing against the need for every developer to know leetCode and algorithms, but some of the responses make me believe that every developer needs to have at least some understanding of lower level languages like C and/or assembly at some point.