Rough context: When the GPU (or the CPU back in the days) had to draw a texture, it had to sample pixels, maybe one, maybe four and do some say bi-linear filtering in between them, and then use that pixel as a result.
Now several problems:
1. If your texture is sampled roughly one pixel from it to one pixel on the screen, and if all pixels are read linearly ("horizontally"), then you are good with the cache, cause for the first pixel you've read the cache maybe cold at that address, but it'll load the next pixels just in case you need them, and here is the catch - you need to get always benefit from that - it's like always using your coupons, and deals, and your employer perks. So the CPU/GPU might read 4, 8, 16 who really knows (but you) bytes in advance, or around that pixel.
2. But then you turn the texture 90 degrees, and suddenly it's much slower. You draw a pixel from the texture, but then the next pixel is 256, 512, or more away ("vertically), and then next too, your cache line of 4, 8, 16, 32 or 64 read bytes that you did not used, and by the time you might need them again, the cache discarded them. Hence the slowness now - much much slower!
4. To fix it, you come up with "swizzling" - e.g. instead of the texture being purely scan-line by scan-line, you kind of split your image in blocks - maybe 8x8 tiles, or 32x32, and then make sure those tiles are linearly written one to each other. You can go even further, but the idea is that under any angle, if you decide to read, you most likely would hit pixels from the same cache line you've read before. It's not that simple, and my explantation is poor, but here is someone who can do that better than me: https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-...
8. But even with swizzling, tiling, whatever you call that magic to keep pixels together in any direction really together stops working as soon as you have to draw that big texture/image on a much smaller scale.
16. Say 256x256 would have to be drawn as 16x16 - And you say I don't need mipmapping, I don't need smaller versions of my image, I can just use my own image - well then nomatter how you swizzle/tile, you'll be skipping a lot of pixels - hop from here to there, and lost cache lines.
32. For that reasons mipmaps are here to help, but stay with me for one more minute - and see how they almost fix the shimmering problem of textures with "fences", "grills on a car", a pet's cage, or something like it.
64. And when you hear the artist ready to put real spider-man logo on the character made out of real polygons, and real grill in front of the car, made out of real polygons, and real barbed-wired made out of polygons very nice looking fence - then stay away, as these polys woulds shimmer, no good level of detail can be done for them (it'll pop) and such things are just much easily done with textures and mipmaps - it kind of solves a lot of problems.
> With modern GPUs pushing bitmaps is actually far more efficient than drawing shapes.
No I mean it's inefficient from a product development standpoint. With a flat design, you can make a design system which will scale very easily to multiple screen sizes and aspect ratios. If you have to texture everything there is much more you have to account for.
> I have a blue noise background for my terminals which tremendously helps with eyestrain.
Yeah I mean blue noise is one thing, it's explicitly non-information, but I think in general trying to parse text on top of a pattern is harder than without.
context: I'm building a game engine in Zig called Mach engine; this video goes over my plans for tackling vector graphics in the engine (for text rendering, GUI, 2D characters, etc.)
Modern vector graphics are surprisingly complex, so while none of the ideas here are necessarily groundbreaking, the approach I'm taking here of a simpler ground-up 'what does my GPU know how to do? -> build on top of that' is fairly interesting I believe.
This was also the first full-length talk I've ever given, so go easy on me :) Happy to answer questions.
Slides are here[0] in case you prefer to just flip through them.
> One would assume/hope that specifying "bitmaprenderer" for the context type would give you a regular immediate-mode CPU rasterizer. Is that not the case?
No, that's something else.
> To expand on this...
I almost wrote something like that, but then I considered that I haven't really benchmarked this. Streaming data from CPU onto the GPU is certainly possible and graphics APIs do have hints for such usage. You also don't need to convert to a texture to get arbitrary data on the screen, a trivial shader can do that for you.
If your data/transformations naturally live in RAM/CPU, that may well be the most efficient thing to do.
> Why bother with rasterization when you can write a ray tracer in a page of code and get better results?
You do realize that ray tracing is ultimately a problem whose performance is determined by the external memory system?
Unless you’re thinking about each mini CPU rendering a bunch of reflective balls where the full scene can be stored locally in each CPU.
Because once you don’t, you’ll have to find ways to cover the access latency to the shared memory pool, and before you know it your super simple CPU will looks suspiciously like the shader core of today’s GPUs.
Your other examples have similar limitations.
The truth is that there are not many problems that can efficiently be mapped to an architecture with tons of small CPUs, some local RAM, and nothing else.
> In the basic model of a computer architecture, the screen is abstracted as a pixel array in memory - set those bits and the screen will render the pixels. The rest is hand waved as hardware.
It was. I still remember the days.
It was nice to be able to put pixels on the screen by poking at a 2D array directly. It simplified so much. Unfortunately, it turned out that our CPUs aren't as fast as we'd like them at this task, said array having 10^5, and then 10^6 cells - and architecture evolved in a way that exposes complex processing API for high-level operations, where the good ol' PutPixel() is one of the most expensive ones.
It's definitely a win for complex 3D games / applications, but if all you want is to draw some pixels on the screen, and think in pixels, it's not so easy these days.
You divide the image into quarters and store each quarter as a continuous block of memory. Do this recursively.
Normally we'd index into the pixel data using pixel[x,y]. You can get Z-ordering by using pixel[interleave(x,y)] where the function interleave(x,y) interleaves the bits of the two parameters.
This works fantastically well when the image is a square power of two, and gets terrible when it's one pixel high and really wide. I think a combination of using square tiles where each one is Z-ordered is probably a useful combination.
For my ray tracer I use a single counter to scan all the pixels in an image. I feed the counter into a "deinterleave" function to split it into an X and Y coordinate before shooting a ray. That way the image is rendered in Z-order. That means better cache behavior from ray to ray and resulted in a 7-9 percent speedup from just this one thing.
Once you have data coherence, swapping is not a big deal either in applications where you're zoomed in.
> Are fonts always rendered 'completely' these days? I thought that the fonts would be rendered once to a cache of bitmaps/textures, and then those bitmaps can be copied to the screen/buffer pretty much instantaneously.
They are. But (a) non-Latin languages often miss in the cache; (b) subpixel positioning makes cache misses happen more often; (c) sometimes people animate font size, negating the optimization; (d) we care about initial load time.
> Since a lot of the work is stateful and difficult to parallelize, doing it on CPU will probably be faster, that way you only pay the latency to jump to the GPU once.
You can still easily cache the glyphs post processed, especially if you don’t use subpixel AA. There isn’t that much state to a scrollback buffer post glyph processing.
I don’t get the resistance to this type of rendering when at this point there are at least three major monospace glyph rendering libraries implemented for the GPU, and I bet there are dozens I don’t know about.
> If you've got 1000 glyphs at a specific visual size to pre-cache into alpha-mask textures;
How often does that happen? There are definitely languages where that is a plausible scenario (eg, Chinese), but for the majority of written languages you have well under 100 glyphs of commonality for any given font style.
And then as you noted, you cache these to an alpha texture. So you need all of those 1000 glyphs to show up in the same frame even.
> Especially on a modern low-power system (e.g. a cheap phone), where you might only have 2-4 slow CPU cores, but still have a bounty of (equally slow) GPU cores sitting there doing mostly nothing?
But the GPU isn't doing nothing. It's already doing all the things it's actually good at like texturing from that alpha texture glyph cache to the hundreds of quads across the screen, filling solid colors, and blitting images.
Rather, typically it's the CPU that is consistently under-utilized. Low end phones still tend to have 6 cores (even up to 10 cores), and apps are still generally bad at utilizing them. You could throw an entire CPU core at doing nothing but font rendering and you probably wouldn't even miss it.
The places where GPU rendering of fonts becomes interesting is when glyphs get huge, or for things like smoothly animating across font sizes (especially with things like variable width fonts). High end hero features, basically. For the simple task of text as used on eg. this site? Simple CPU rendered glyphs to an alpha texture is easily implemented and plenty fast.
> I find this to be the most elegant way to handle dithering
Yes, it's so simple, it can be applied in a single pass on a single pixel buffer. Because in convolution kernel terms - it's only sampling from half of a moor neighbourhood, and that half can be from pixels not yet processed in the same buffer when moving through them in order.
> I've been looking to build an engine/game that approximates this art style: https://obradinn.com
Killedbyapixel took the above dweet for inspiration in the style of some of his proc gen art, although I haven't dug into the how yet. I suppose deeper game/object awareness integration could produce better results than merely piping the output of the renderer into the dither algorithm, perhaps even the rendering could be optimized by targetting dithering specifically.
> How exactly do you know what resolution textures are or how many polygons are rendered.
A couple comments ago you were complaining at me for not being specific about why the rendering was bad, and then when someone gives you specific examples you lash out with "well how do you know that's why it's bad!?". We know because we have eyes.
> The copying process was never that much of a big deal
I don't know about that? Texture memory management in games can be quite painful. You have to consider different hardware setups and being able to keep the textures you need for a certain scene in memory (or not, in which case, texture thrashing).
> But you try to lock() on a backgroundworker (the sane thing to do)
That’s just a leaky abstraction. Updating image from background thread is a sane thing to do from general-purpose programmer POV. To understand why it’s not so good idea, and why it’s not supported, you need to know what happens under the hood. Specifically, how 3D GPU hardware works and executes these commands.
> No solution exists.
From graphic programmers POV, the sane solution — only call GPU from a single thread. In D3D11 there’re things which can be called from background threads. It’s possible to create new texture on the background thread uploading the new data to VRAM, pass the texture to the GUI thread, then on the GUI thread either copy between textures (very fast) or replace textures destroying the old one (even faster). Unfortunately, doing so is slower in many practical use cases, like updating that bitmap at 60Hz: creating new resources is relatively expensive, more so than updating preexisting one.
> and tells the GPU what assets to start preloading, etc.
How does that shader gain information about which portions of the texture are needed at each mipmap level?
Or do you just load whole texture and consume memory you don't actually need to render the image. It'd perform badly due to unnecessary I/O causing a long loading time. You'd also waste large portions of GPU RAM.
Or does your shader try to guess? Do you attempt to reverse engineer exactly how GPU trilinear texture sampler operates, because otherwise you won't know which parts of asset data is needed -- guess wrong and you get weird artefacts, when GPU samples your texture at a memory location you didn't load. Oops. Rounding differences compared to hardware texture sampler would almost certainly get you. Not sure if it's still true, but at least in past different brand/model GPUs implemented texture sampling slightly differently [1], enough to force you to have a version for many GPU vendors and models.
Or do you disable trilinear sampling and use just one mipmap level you somehow pick. You'd get bad image quality, blur and/or aliasing (like moire).
Even after considering all that, how are you going to deal with texture atlases?
Your way sounds really complicated. Unless you're willing to do rather serious compromises. Robustness, quality or loading time performance.
> I guess I’m a tiny bit surprised that the browsers are de-allocating and re-allocating memory for the re-render of the text box elements, since they don’t change size. Presumably this would be texture memory? Does anyone here know precisely what’s happening there, and whether a render-to-texture won’t work for some reason? Is it normal for browsers to allocate memory for all element repaints, or is this something unique to text typing or font handling?
It highly depends on the browser and the specific painting backend in use. With a traditional CPU painting backend, browsers typically render at tile granularity, and tiles shuffle in and out of a global pool for repaints. That is, when the browser goes to repaint content, the painter grabs a tile from a shared pool, renders the new content into it, and then hands it off the compositor, which swaps the tile buffer and places the old buffer into a pool. This means that the GPU memory access patterns are somewhat irregular.
If the GPU is being used for painting, as for example in Skia-GL/Ganesh, then there are all sorts of caches and allocators in use by the backend. Typically, these systems try to avoid allocation as much as possible by aggressively reusing buffers. Allocating every frame is generally bad. But it can be hard to avoid, especially with immediate-mode APIs like Skia. (I would expect that WebRender could be better here long-term, though we probably have work to do as well.)
> First, you'll have to learn to draw to the screen. I had no idea how this worked. You are actually clearing the screen then drawing each portion of the screen in rapid succession, many times a second, to create the effect that objects are moving.
> Can someone expand on exactly what is meant by "model LOD" in this context?
Back in the day, games just had one version of each model, that gets loaded or not.
Nowadays, games with lots of models and huge amount of detail lets each model have multiple different versions, with their own LOD (Level of Detail).
So if you see a tree from far away, it might be 20 vertices because you're far away from it so you wouldn't see the details anyways. But if you're right next to it, it might have 20,000 vertices instead.
It's an optimization technique to not send too much geometry to the GPU.
> And of course your next keystroke is not zero copy. It needs to be sent to the GPU and the texture needs to be updated.
No textures need to be updated. The texture stays in VRAM across each frame, just need to change which texture the character cell points to. That is zero-copy.
If you were to do this with a software rendered terminal, at minimum the software would tell the windowing system which region of the window changed and then copy that region to VRAM. That’s only if the window system supports region updates, if not you’d need to copy the entire window to VRAM each frame. Much slower than just twiddling a pointer to a texture.
Now several problems:
1. If your texture is sampled roughly one pixel from it to one pixel on the screen, and if all pixels are read linearly ("horizontally"), then you are good with the cache, cause for the first pixel you've read the cache maybe cold at that address, but it'll load the next pixels just in case you need them, and here is the catch - you need to get always benefit from that - it's like always using your coupons, and deals, and your employer perks. So the CPU/GPU might read 4, 8, 16 who really knows (but you) bytes in advance, or around that pixel.
2. But then you turn the texture 90 degrees, and suddenly it's much slower. You draw a pixel from the texture, but then the next pixel is 256, 512, or more away ("vertically), and then next too, your cache line of 4, 8, 16, 32 or 64 read bytes that you did not used, and by the time you might need them again, the cache discarded them. Hence the slowness now - much much slower!
4. To fix it, you come up with "swizzling" - e.g. instead of the texture being purely scan-line by scan-line, you kind of split your image in blocks - maybe 8x8 tiles, or 32x32, and then make sure those tiles are linearly written one to each other. You can go even further, but the idea is that under any angle, if you decide to read, you most likely would hit pixels from the same cache line you've read before. It's not that simple, and my explantation is poor, but here is someone who can do that better than me: https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-...
8. But even with swizzling, tiling, whatever you call that magic to keep pixels together in any direction really together stops working as soon as you have to draw that big texture/image on a much smaller scale.
16. Say 256x256 would have to be drawn as 16x16 - And you say I don't need mipmapping, I don't need smaller versions of my image, I can just use my own image - well then nomatter how you swizzle/tile, you'll be skipping a lot of pixels - hop from here to there, and lost cache lines.
32. For that reasons mipmaps are here to help, but stay with me for one more minute - and see how they almost fix the shimmering problem of textures with "fences", "grills on a car", a pet's cage, or something like it.
64. And when you hear the artist ready to put real spider-man logo on the character made out of real polygons, and real grill in front of the car, made out of real polygons, and real barbed-wired made out of polygons very nice looking fence - then stay away, as these polys woulds shimmer, no good level of detail can be done for them (it'll pop) and such things are just much easily done with textures and mipmaps - it kind of solves a lot of problems.
128. Read about impostor textures.
reply