Playing with Go: Embarrassingly Parallel Scripts

jgrahamc | karma 89756 | avg karma 10.41 · 2012-12-03 15:33:29+00:00

Yes, Go does make writing stuff like that nice, but I think the actual code given is pretty heavyweight. This does the same thing:

  package main

  import (
    "fmt"
    "net"
    "io/ioutil"
    "strings"
  )

  func main() {
    file_in, _ := ioutil.ReadFile("domains.txt")
    domain_list := string(file_in)

    done := make(chan bool)
    count := 0

    for _, domain := range strings.Split(strings.TrimSpace(domain_list), "\n") {
      go func(d string) {
        ipAddresses, _ := net.LookupIP(d)

        ip := ""
        if len(ipAddresses) > 0 {
          ip = ipAddresses[0].String()
        }

        fmt.Println("Mapping: ", d, "->", ip)
        done <- true
      }(domain)

      count++
    }

    for i := 0; i < count; i++ {
  	<- done
    }
  }

dadkins | karma 406 | avg karma 4.56 · 2012-12-03 15:40:37+00:00

There's a library for that pattern: http://golang.org/pkg/sync/#WaitGroup

Instead of counting and using a done channel,

    import "sync"
    ...
    var wg sync.WaitGroup
    for ... {
        wg.Add(1)
        go dowork()
    }
    wg.Wait()

jameskilton | karma 3849 | avg karma 6.56 · 2012-12-03 09:46:23

Nice. Yeah I'm very new to Go so I know almost nothing about the stdlib libraries. Thanks for the pointer.

darkhelmetlive | karma 186 | avg karma 4.33 · 2012-12-03 12:45:51

Tooting my own horn here, but I'm working on a book covering the Go Standard Library. Still in progress, and not at the sync package yet, but it's coming along. Check it out if you feel inclined.

http://thestandardlibrary.com/go.html

reply

lsb | karma 4061 | avg karma 4.13 · 2012-12-03 13:22:27

You know what'd be cool? To take arbitrary code in a language, pattern match on the implementation in that language of each std lib function, and actively recommend substitutes for duplicated code.

jlgreco | karma 18247 | avg karma 2.97 · 2012-12-03 14:46:26

That would be very cool. I wonder how hard it would be though. At least in Go I know that the standard lib contains lots of duplicate code (primarily so that things that should be small don't require larger things as dependencies. I think the time package's String() functions use reimplemented fmt package functionality for example, since fmt is a much larger dependency than time should have.)

skybrian | karma 22817 | avg karma 2.5 · 2012-12-03 21:02:48+00:00

Yes, but outside the Go standard libraries, adding a dependency on a standard library isn't a big deal and won't add a cycle.

jlgreco | karma 18247 | avg karma 2.97 · 2012-12-03 21:06:07+00:00

Yes. What I mean though is that if you are reimplementing, say, fmt.Printf, such a suggestion system might correctly suggest you use fmt.Printf instead, but also suggest you can use func (m Month) String() string from time, or something equally silly.

Since the standard libs in Go duplicate code, you would have to be careful that your suggestion system isn't picking up false positives. I think the idea has a lot of promise though.

reply

stock_toaster | karma 6271 | avg karma 3.37 · 2012-12-03 19:43:45+00:00

Looks interesting (thanks). Will certainly check it out.

zmj | karma 696 | avg karma 3.74 · 2012-12-03 15:53:12+00:00

I'm embarrassed how many times I've implemented the above without thinking to look for a standard lib solution. Thanks!

0xABADC0DA | karma 172 | avg karma 1.5 · 2012-12-04 00:33:01+00:00

Even so why would you write that boilerplate code out each time?

Something like this would work even better:

    result = src.asyncMap { |e| dowork(e) };

Except Google Go returns several values instead of tuples, so you can't just collect all the results as-is. And with no generics it would be annoying to actually use e and the result, since they would need casts. Too bad.

burntsushi | karma 13683 | avg karma 4.52 · 2012-12-04 05:19:10+00:00

Why am I not surprised to see you trolling HN too? What's all too sweet is that you've brought your anti-Go zealotry here too! Joy.

> Except Google Go returns several values instead of tuples, so you can't just collect all the results as-is.

This is false. Functions in Go do not have to return multiple values. Therefore, you can "collect all the results as-is".

> And with no generics it would be annoying to actually use e and the result, since they would need casts. Too bad.

Actually, it wouldn't be annoying, because you wouldn't use a general purpose map like you've shown. You'd use code shown in the parent.

reply

0xABADC0DA | karma 172 | avg karma 1.5 · 2012-12-04 06:01:26+00:00

If you think you can do asyncMap in Google Go, without casts and without manually collecting multiple return values, by all means show us the code. I would find that really interesting.

burntsushi | karma 13683 | avg karma 4.52 · 2012-12-04 01:23:36

I didn't claim I could. Take your trolling elsewhere.

y0ghur7_xxx | karma 2612 | avg karma 4.57 · 2012-12-03 15:49:25+00:00

I know it's about Go, but bash is awesome for stuff like this:

   while read line; do
   host $line | head -n1 | awk '{print $1 " -> " $4}' &
   done < domains.txt

etrain | karma 1737 | avg karma 8.73 · 2012-12-03 16:05:53+00:00

I realize it's a one-off, but you could potentially mix lines of output with a script like that.

If you gnu parallel (xargs++) installed - turn that second line into a script (and replace $line with $1).

  cat domains.txt | parallel -n 1 -P 50 script.sh

icebraining | karma 48925 | avg karma 2.14 · 2012-12-03 10:30:37

Notice that with parallel, you don't need the -n 1, since that's the default.

etrain | karma 1737 | avg karma 8.73 · 2012-12-03 10:45:02

Ah, didn't realize that - throwback to my xargs days! Thanks for the tip.

crusso | karma 6439 | avg karma 2.86 · 2012-12-03 16:34:27+00:00

Thanks for the note on parallel, it looks to be an excellently useful tool. I can find a lot of ways to save time with it.

FYI: On OSX, there is a Homebrew formula (brew install parallel).

reply

ole-tange | karma 1 | avg karma 1.0 · 2012-12-04 08:57:42+00:00

If you are looking for which domain point to 1.2.3.4:

   cat domains.txt | parallel -P 100 --tag host | grep 1.2.3.4

Somewhat slower runtime than the go-solution, but may be faster to write.

philsnow | karma 4551 | avg karma 2.16 · 2012-12-04 14:51:31+00:00

You don't need to spawn N awks,

  <domains.txt while read line; do
    host $line | head -n1 &
  done | awk '{print $1 " -> " $4}'

drivebyacct2 | karma 10952 | avg karma 2.47 · 2012-12-03 09:54:54

So many ignored errors. :(

jameskilton | karma 3849 | avg karma 6.56 · 2012-12-03 16:04:17+00:00

It's a one-off script, I didn't really care about errors. If this was something run regularly inside of a bigger application yes I'd have full error handling.

ImJasonH | karma 1107 | avg karma 6.18 · 2012-12-03 21:23:28+00:00

But tiny throwaway scripts get built up into huge applications all the time, sometimes by other people who don't have the mental TODO to go back and handle errors.

stusmall | karma 2065 | avg karma 3.61 · 2012-12-03 16:25:05+00:00

This is one of the things I love about go. If he wants the return value but doesn't want to address the errors, he has to actively do it. It makes you think twice when you are putting a _ for an error return. And now to another programmer coming in to maintain it, they stick out like a sore thumb.

saraid216 | karma 6617 | avg karma 1.54 · 2012-12-03 23:26:50+00:00

I know zero Go (it's a little far down on the to-learn list), and maybe he edited it, but I went back and read the GP's code and didn't see anything that looked like ignoring errors.

Is it the two ", _" things?

reply

stusmall | karma 2065 | avg karma 3.61 · 2012-12-03 18:08:18

So for this line of code:

file_in, _ := ioutil.ReadFile("domains.txt")

It says call ReadFile. In go a function can return multiple values. ReadFile returns an byte array and an error value. Normally you'd check to see if err is nil, if it is then no error happened. If it isn't you can check it out for information about the error and handle it.

A unique feature about go is that declaring a variable and never using it is an error and not a warning. Actually there aren't even compiler warnings. Either it is right or wrong. This means if he had called:

file_in, err := ioutil.ReadFile("domains.txt")

but never checked err it would not build. So to get the byte array but not the error you use the _ symbol to tell it to throw away that return value. This is what I was on about in that you have to actively ignore error handling if you want a return value.

reply

frou_dh | karma 8157 | avg karma 3.09 · 2012-12-04 02:34:39+00:00

It seems tantalising for the compiler to also protest when return values of type error are not assigned to anything. An obvious inconvenience being that use of fmt.Println and similar would suddenly become noisy.

saraid216 | karma 6617 | avg karma 1.54 · 2012-12-04 10:08:07+00:00

> A unique feature about go is that declaring a variable and never using it is an error and not a warning. Actually there aren't even compiler warnings. Either it is right or wrong.

Ahhh. Nice. That sounds like a feature I could get behind, too. Thank you.

reply

hannibalhorn | karma 1021 | avg karma 4.42 · 2012-12-03 15:42:40+00:00

Go is great for stuff like this, especially when part of an actual system.

That said, if this was just an adhoc job (to figure out which domains point to a specific IP address) you can just use "xargs -P" or GNU parallel and it becomes a pretty basic shell script, along the lines of: cat domains.txt | xargs -P 1000 -n 1 host

reply

mistermcgruff | karma 1105 | avg karma 8.91 · 2012-12-03 15:52:36+00:00

xargs -P is a godsend for one-off jobs needing parallelism

skeltoac | karma 1047 | avg karma 7.43 · 2012-12-03 16:13:46+00:00

Until last week I didn't know that xargs could invoke commands in parallel.

  xargs -n1 -P8 dig A <hosts.txt | grep -v ';' | grep $TARGET_IP | sed 's/\.\s.*//' >hosts-matched.txt

scurry | karma 5 | avg karma 2.5 · 2012-12-03 13:17:30

So what's the difference between xargs and parallel? I thought the point of parallel was that it was xargs with the addition of running things in parallel. But if xargs can do that already, is there any reason to use one over the other?

prodigal_erik | karma 5385 | avg karma 2.11 · 2012-12-03 19:52:26+00:00

For what it's worth, GNU Parallel seems to be fairly new; at work I have an ubuntu distro from this year and it's not in the package repo yet. xargs is POSIX so you can expect it everywhere, though no parallel option is specified (merely encouraged).

eggoa | karma 752 | avg karma 5.15 · 2012-12-03 20:40:02+00:00

It's mostly about how they handle special characters.

http://www.gnu.org/software/parallel/man.html#differences_be...

reply

hannibalhorn | karma 1021 | avg karma 4.42 · 2012-12-03 14:42:46

In addition to the above, it's worth nothing that parallel also supports running the jobs on multiple remote systems via ssh, giving you an easy way to take advantage of a whole cluster.

skeltoac | karma 1047 | avg karma 7.43 · 2012-12-05 16:19:51+00:00

xargs is already in your path.

lazyjones | karma 5054 | avg karma 1.69 · 2012-12-03 16:06:03+00:00

To be fair to languages without such great parallelism support: you can do this using asynchronous/event-loop-based code because the parallelism will be limited by the nameserver anyway (the calling code does almost nothing, it mostly waits for the net / the nameserver).

chimeracoder | karma 20702 | avg karma 3.63 · 2012-12-03 21:36:03+00:00

> To be fair to languages without such great parallelism support: you can do this using asynchronous/event-loop-based code

Well you can do with callbacks anything that you can do with channels and goroutines. Go's primary appeal is that it makes concurrent[1] code easy to reason about, not that it enables you to do anything that you "couldn't do" otherwise.

Continuations are just GOTOs, and just like GOTOs, some people love them and some people hate them, but even people who like them can find them difficult in large doses. Goroutines and channels are nice, because they fit the structure of imperative code, whereas callbacks sort of resemble imperative code but "inside out".

[1] Note that I didn't say parallel!

reply

j2labs | karma 1159 | avg karma 5.77 · 2012-12-03 10:10:59

This kind of a thing is also very easy in Python if you use Gevent.

I have used Gevent a lot over the last couple years and see a lot of similarities in Go's concurrency, which I think is great.

Concurrency doesn't have to be about insane looking code!

reply

d0mine | karma 4431 | avg karma 2.02 · 2012-12-03 22:23:29+00:00

With concurrent.futures http://docs.python.org/3/library/concurrent.futures.html#con...

it could look like:

  from concurrent.futures import ThreadPoolExecutor as Pool
  from socket import getaddrinfo

  def lookup(domain):
        try:
            result = getaddrinfo(domain, 80)
        exception Exception as e:
            print("error %s -> %s" % (domain, e))
        else:
            print("done  %s -> %s" % (domain, result))

  nconcurrent = 20
  with open('domains.txt') as file, Pool(nconcurrent) as pool:
        for domain in (line.strip() for line in file):
            pool.submit(lookup, domain)

To run multiple processes instead of threads, change the import to ProcessPoolExecutor.

To support multiprocessing.Pool (for Python 2 where concurrent.futures is not in stdlib), replace pool.submit() with pool.apply_async() and use contextlib.closing() around the Pool().

reply

jnazario | karma 3228 | avg karma 4.85 · 2012-12-03 10:48:12

in the DNS case asynchronous event handling would be super easy to do. in python asyncore with something like dpkt to construct and read DNS lookups works like a champ, as does twisted. i did a simple async DNS resolver in pure python (asyncore, dpkt) and can sustain thousands of lookups a second. GNU adns also has bindings in various languages.

you can get Go's parallelisms via CSP (e.g. python-csp, ruby-csp) and replace a lot of fragile threading/parallel code with it. i've been doing that in lieu of learning Go (i know i know .. i'm lazy) and been very pleased.

anyhow, many ways to skin cats. those are just two or three.

reply

tptacek | karma 394296 | avg karma 6.04 · 2012-12-03 11:43:37

You wouldn't do DNS lookups asynchronously in Go to begin with. Modeling concurrency of any sort in Go the way you would with an event loop is usually a code smell.

jnazario | karma 3228 | avg karma 4.85 · 2012-12-03 11:53:12

agreed, and i should have been more clear. the author's blog post states that one of the reasons he explored Go was that his initial sketch of a solution in his language of choice (ruby) was sequential. my point was that you can get performance in ruby with asynchronous operations, and that you don't need to go parallel for something like this.

then again if it was a matter of "well, i had a problem to solve and i had a desire to explore another language, so solving it in that new language was a way to explore" then the point is moot.

however agreed 100% or more on the "code smell" of doing an event loop in Go.

reply

blablabla123 | karma 1744 | avg karma 1.13 · 2012-12-03 22:11:08+00:00

Not sure if I understand your point, can you explain a bit more please? Are you saying event loops with select {} are considered code smell?

tptacek | karma 394296 | avg karma 6.04 · 2012-12-03 23:19:38+00:00

If I had to simultaneously generalize the idea and make it specific enough to explain it further, I'd say fiddly callback state machines are a code smell in Go.

dkhenry | karma 4400 | avg karma 3.7 · 2012-12-03 16:50:52+00:00

Scala is really great for this too. The downside is the spin up time for the JVM, but the upside is that if you use SBT's script launcher it will compile and cache the script transparently for you , you can still pull in _any_ JVM dependency and you can run it just like a shell script. I needed to test for a port being open in parallel and it was a cake walk to use NIO's SocketSelector to do it reactivly.

danneu | karma 1932 | avg karma 2.48 · 2012-12-03 18:39:34+00:00

My attempt with Ruby and Celluloid.

https://gist.github.com/a803d86234e8d1fc5496

I also include a list of 100 domains in a domains.txt if anyone wants to try for themselves.

    require "socket"
    require "celluloid"

    class IPGetter
      include Celluloid

      def get(url)
        Socket.getaddrinfo(url, "http")[0][2]
      end
    end

    pool = IPGetter.pool(size: 100)
    ips = {}

    File.open("domains.txt").each_line do |line|
      line.chomp!
      ips[line] = pool.future.get(line)
    end

    ips.each do |url, ip_future|
      puts "#{url} => #{ip_future.value}"
    end

ericmoritz | karma 734 | avg karma 3.16 · 2012-12-03 15:45:45

Or Erlang

https://gist.github.com/bb01a85404e6b445dcb3#file_resolve_do...

    % -*- erlang -*-                                                                                                                              
    %%! -smp enable                                                                                                                               
                                                                                                                                                  
    worker(Hostname) ->                                                                                                                           
        {ok, IP} = inet:getaddr(Hostname, inet),                                                                                                  
        io:format(                                                                                                                                
          "~s => ~s~n",                                                                                                                                 
          [ip_to_string(IP)]                                                                                                                      
         ).                                                                                                                                       
                                                                                                                                                  
    ip_to_string({N1,N2,N3,N4}) ->                                                                                                                
        io_lib:format(                                                                                                                            
          "~w.~w.~w.~w",                                                                                                                          
          [N1,N2,N3,N4]                                                                                                                           
         ).                                                                                                                                       
                                                                                                                                                  
    main([DomainFile]) ->                                                                                                                         
        {ok, Bin} = file:read_file(DomainFile),                                                                                                   
        String = binary_to_list(Bin),                                                                                                             
        Domains = string:tokens(String, "\n"),                                                                                                    
        plists:foreach(                                                                                                                           
          fun(Domain) -> worker(Domain) end,                                                                                                      
          Domains                                                                                                                                 
        ).

This uses https://github.com/eveel/plists/

dxbydt | karma 4746 | avg karma 5.51 · 2012-12-04 00:37:56+00:00

In Scala using the par function: https://gist.github.com/4199415

    def ip(host:String)=java.net.InetAddress.getAllByName(host)(0).getHostAddress
    var map = Map[String,String]()
    val hosts = List("google.com","twitter.com","facebook.com")
    hosts.par.map(host => (host,ip(host))).foreach(hostip => map+=(hostip._1->hostip._2))

    scala> map
    res1: scala.collection.immutable.Map[String,String]
     =Map(google.com -> 74.125.224.71,
          twitter.com -> 199.59.150.7, 
          facebook.com -> 69.171.229.16)

jonhohle | karma 5691 | avg karma 3.05 · 2012-12-04 06:51:40+00:00

How about a few lines of C?

  // clang -lcares -I/opt/local/include -L/opt/local/lib -o domains domains.c
  #include <ares.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/socket.h>
  #include <arpa/inet.h>
  #include <netinet/in.h>
  #include <netdb.h>
  #include <stdarg.h>
  #include <string.h>
  #include <ctype.h>
  #include <unistd.h>
  #include <sys/queue.h>
  #include <fcntl.h>
  
  static void callback(void *arg, int status, int timeouts, struct hostent *host)
  {
  
      if(!host || status != ARES_SUCCESS){
          printf("Failed to lookup %s\n", ares_strerror(status));
          return;
      }
  
      char ip[INET6_ADDRSTRLEN];
      int i = 0;
  
      for (i = 0; host->h_addr_list[i]; ++i) {
          inet_ntop(host->h_addrtype, host->h_addr_list[i], ip, sizeof(ip));
          printf("%s: %s\n", host->h_name, ip);
      }
  }
  
  static void wait_ares(ares_channel channel)
  {
      for(;;){
          struct timeval *tvp, tv;
          fd_set read_fds, write_fds;
          int nfds;
  
          FD_ZERO(&read_fds);
          FD_ZERO(&write_fds);
          nfds = ares_fds(channel, &read_fds, &write_fds);
          if(nfds == 0){
              break;
          }
          tvp = ares_timeout(channel, NULL, &tv);
          select(nfds, &read_fds, &write_fds, NULL, tvp);
          ares_process(channel, &read_fds, &write_fds);
      }
  }
  
  struct Request {
      ares_channel channel;
      char* domain;
      SLIST_ENTRY(Request) next;
  };
  
  int main(const int argc, const char* argv[]) {
      SLIST_HEAD(Requests, Request) requests;
      SLIST_INIT(&requests);
      struct Request* last = NULL;
  
      int status;
      struct ares_options options;
      int optmask = 0;
      const char* path;
  
      optmask |= ARES_OPT_TIMEOUTMS;
      options.timeout = 1000;
  
      status = ares_library_init(ARES_LIB_INIT_ALL);
      if (status != ARES_SUCCESS){
          printf("ares_library_init: %s\n", ares_strerror(status));
          return 1;
      }
  
      for (int i = 1; (path = argv[i]) != NULL || i == 0; i++) {
          
          int fd = open(path, O_RDONLY);
          FILE* fp = fdopen(fd, "r");
          char* buffer = NULL;
          size_t bufferSize = 0;
          int lineLength = 0;
          
          while ((lineLength = getline(&buffer, &bufferSize, fp)) > 0) {
  
              struct Request* request = malloc(sizeof(struct Request));
  
              status = ares_init_options(&request->channel, &options, optmask);
              if(status != ARES_SUCCESS) {
                  printf("ares_init_options: %s\n", ares_strerror(status));
                  continue;
              }
              
              buffer[lineLength - 1] = '\0';
  
              request->domain = buffer;
              buffer = NULL;
              if (NULL == last) {
                  SLIST_INSERT_HEAD(&requests, request, next);
              } else {
                  SLIST_INSERT_AFTER(last, request, next);
              }
              last = request;
  
              ares_gethostbyname(request->channel, request->domain, AF_INET, callback, NULL);
          }
      }
  
      while (!SLIST_EMPTY(&requests)) {
          struct Request* request = SLIST_FIRST(&requests);
          wait_ares(request->channel);
          ares_destroy(request->channel);
          free(request->domain);
          SLIST_REMOVE_HEAD(&requests, next);
          free(request);
      }
  
      ares_library_cleanup();
      return 0;
  }

geoka9 | karma 3349 | avg karma 3.56 · 2012-12-03 19:00:43+00:00

To be fair to Ruby, it releases the GVL when a thread is blocked on IO; so for this kind of task it wouldn't be an issue, probably.