Hacker Read

npsimons | karma 4272 | avg karma 2.33 · 2021-03-23 16:11:07

Not to offend, but yeah, if you've already partitioned the work to be done, then of course it's not hard.

I'm pretty sure that anyone who's come into even vague contact with parallel programming understands that the hardest part is communication and synchronization.

Your post feels incredibly pedantic.

reply

dragontamer | karma 30240 | avg karma 2.84 · 2021-03-23 16:17:53+00:00

I think you underestimate the number of real-world problems that can be solved by a bash script executing "./some_task &" in a for loop, and using "jobs", "fg", and "bg" as needed.

For many problems, that's literally good enough parallelism.

--------

Case in point: I once needed to render an animation in Blender, and I noticed that Blender's built-in parallelism was kind of crap (it was an older version of Blender at the time. My 32-thread / 16-core system barely reached 400% utilization... I know I could do better).

So I just spawned 16x Blenders in headless mode, each handling a different frame of the animation, and achieved far higher utilization than the innate Blender parallelism.

Yes, literally:

        #!/bin/bash
        for i in `seq 1 16`;
        do
                blender (params) & # Yeah, just run a bunch of blenders in headless + background mode
        done

Bonus points: I tweaked the script to take advantage of NUMA-locality, by automatically pinning blender to one NUMA-node. All without ever touching a single line of C code.

-------

Communication / synchronization is hard. But parallelism (even high-performance NUMA-locality with thread affinity pinned to particular CPU nodes) doesn't have to be hard. Just think a bit smarter.

--------

The equivalent to doing "./some_job &" is fork() and wait() in POSIX C / C++. That's good enough for a ton of problems (hell, its how us older programmers achieved parallelism back in the CGI-BIN days).

Don't do complicated high-performance techniques if you only need a bit of parallelism. Don't overthink things, stick to simple parallelism when simple solutions solve simple problem.

Don't premature optimize into complicated producer/consumer patterns unless you actually need the performance. Definitely don't premature optimize into high-performance lockless programming techniques unless you really need to.

reply

npsimons | karma 4272 | avg karma 2.33 · 2021-03-23 16:33:33+00:00

I mean in the example you gave, the work was already partitioned for you. That we have high level tools that make it easy doesn't change the fact that the work of rendering frames in an animation are very clear chunks of work that lend themselves naturally to parallelization.

We can quibble about the semantics of "parallel" programming all day, but I think that either too broad or too narrow a scope for the term misses the point: that data coherency is what makes parallel programming hard.

In the ultimate reductio ad absurdum, you can just say it falls into one of the two classes of hard computer science problems, that being "cache invalidation", where the "cache" is some form of shared data. In this sense, I agree with you: it's not the parallelism per se that's hard, but I think anyone talking about the difficulty of it quite clearly is referring to the shared data problems.

reply

dragontamer | karma 30240 | avg karma 2.84 · 2021-03-23 16:47:44+00:00

> I mean in the example you gave, the work was already partitioned for you.

In my experience, I keep running into problems *already partitioned*.

Make -j. 3d-animation frames. SAXY arrays. (Seriously, do you know how many for loops I've come across that can become #pragma omp parallel for and just automatically benefit from parallelism??). Multiple connections to a server (ex: Fork/join Apache model / CGI-bin / PHP / etc. etc.). Multiple connections to other servers (SMTP, IRC, etc. etc.)

A ton of problems are "easy parallelism" in my experience. Probably most problems. Yeah, there's some hard problems out there, but beginners should learn how to parallelize the easy stuff first.

reply

dagss | karma 1603 | avg karma 2.89 · 2021-03-23 16:57:52

The point is that these easy cases may just referred to as "work to do". Using more than one CPU core at the time to process through a backlog of work is just...the normal case. Doing work simultanously is not what people refer to when they say "parallel programming is hard"; and the pedantery is in insisting that people word themselves to exclude trivially parallel work when what they want to talk about are the cases where parallelism is not trivial.

The conversation above is like this:

"Solving cubic equations is hard"

"Actually, solving x ^ 3 = 27 is easy, just take the cubic root.."

"Well that is not what we meant when we said cubic equations, we were interested in the general case..."

"I see equations like x^3 = y all the time, it is not that often you have a linear term too..."

reply

dragontamer | karma 30240 | avg karma 2.84 · 2021-03-23 17:38:28+00:00

I see it differently.

* "Taking care of your car is hard".

* "No its not. 90% of the time, its just changing your oil, refilling the fuel tank, washer fluid, and rotating tires".

* "But you have to also replace the transmission fluid and sometimes rebuild the engine"

* "True, but those are not a very common tasks".

-----------

That's how I see it. Most parallelism is "easy parallelism" in my experience. The goal isn't necessarily to train programmers into being able to do "engine rebuilds", but instead train them for the easy "oil change / rotate tires" part of taking care of cars (or programs).

For programming: that's fork / wait(), or pthread_create() / pthread_join(), or #pragma omp parallel for. Those constructs covers the vast, vast majority of parallel constructs I've run across.

Yes, mutex is hard. Condition variables are hard. Lock-free programming is hard. High performance (false-sharing / MESI) is hard. GPU ballots / __shared__ memory / butterfly permutes are very hard. Prefix-sum / scan and Mergepath algorithms are hard. Lock-free Producer-consumer is really, really, really hard (Probably PH.D level hard)

But... I really don't come across a lot of situations where I need to Mergepath, GPU-ballot, or need to manually program / debug a Lock-free algorithm. That's just the facts. I'm glad I learned these constructs (and I await the day when a high-performance programming problem is given to me such that I need to "pull out the big guns" that I've spent a lot of effort learning)... but I recognize that its not really needed in typical day-to-day programming parallel jobs.

Even in GPU-programming: the bulk of GPU-programming is just simple fork/join stuff. A GPU-kernel is a fork, and when the GPU-kernel finishes execution, its an innate join() (with synchronization guaranteed). Its really not that hard in most situations.

Schedule your chain of GPU-kernels to execute in order (a "CUDA-stream"), and they automatically join and synchronize with themselves each time the kernel completes. Its that easy in most cases.

"Oh, but for maximum performance you want to use dynamic / recursive parallelism, or block-off your data into __shared__ memory and think about L1 caches". Erm, yes, if you care to put that much work into performance its hard. But you're no longer talking about parallelism per se, but instead the very difficult field of high-performance compute at that point.

reply

dagss | karma 1603 | avg karma 2.89 · 2021-03-24 00:41:17+00:00

Your analogy is very close to mine.

Following your analogy then -- this is a HN thread where car mechanics discuss a book about transmission fluids and engine rebuilding.

Did anyone make the point that most programmers often need to care about this stuff?

reply

imtringued | karma 11098 | avg karma 0.8 · 2021-03-24 09:34:46+00:00

>* "True, but those are not a very common tasks".

Pretty much every survival/simulation game with complex mechanics suffers from low performance because at some point the simulation becomes very CPU intensive but parallelizing the code is still extremely difficult because of extensive interdependencies.

If you go with the trivial "divide the problem into the smallest parallelizable components" strategy you will end up with huge interconnected graphs, sometimes there isn't even a second graph that can be put on a second CPU core. If you are insane enough to divide the problem at the agent level then you end up with a synchronization nightmare. Sure, given enough effort you can do these all these things but that only entrenches the perception of parallelism being a last resort instead of being the default solution for everything.

reply

phkahler | karma 20899 | avg karma 2.69 · 2021-03-23 16:33:01

I think a lot of people don't know how easy it can and should be to handle the easy cases. C++ has some parallel constructs, but they are not as simple as OpenMP because they're meant to handle the harder stuff.

I think it's important to show and use the easy ways before getting into the complex things.

reply