Flat Data

idan | karma 4286 | avg karma 15.59 · 2021-05-18 18:09:35+00:00

Hi HN! Our team has loved building this, as well as all of the storytelling and examples. We'd love your feedback!

everybodyknows | karma 7076 | avg karma 3.28 · 2021-05-18 19:52:49+00:00

The screen videos are interesting, but too fast to follow, and make reading the accompanying text impossible for those of us with fragile concentration. Reader View (Safari) drops the screen imagery entirely, so goes too far the other way.

How about a video pause/seek control?

reply

idan | karma 4286 | avg karma 15.59 · 2021-05-18 16:39:31

Hey! This is a great callout, we'll think about how to make it better!

travismark | karma 10 | avg karma 0.53 · 2021-05-18 23:36:17+00:00

I agree, I had to right click and unselect "Loop" just so I could read the nearby text.

Also, the "Why Flat Data" section feels like it was written by a someone trying to win a high school essay contest - it's full of unnecessary language. I would have an easier time understanding without the awkward metaphors, exclamation points, and over-casual language like "cool"

reply

idan | karma 4286 | avg karma 15.59 · 2021-05-20 10:48:27

OK we added the pause/seek controls! Thanks for bringing this to our attention, it's way more accessible this way.

searchableguy | karma 191 | avg karma 0.12 · 2021-05-19 00:27:36+00:00

Hey!

I saw the flat module was written in deno and published on deno.land/x/.

How was your experience working with deno?

Is deno being used anywhere else at github?

reply

rattray | karma 6323 | avg karma 3.04 · 2021-05-21 09:49:33

Requiring Deno, and not allowing Node, seems like it could cause a lot of annoyance for people who aren't accustomed to Deno or want to use packages written for Node. I hope this becomes more flexible in the future.

rattray | karma 6323 | avg karma 3.04 · 2021-05-21 14:55:01+00:00

This looks really incredible and I'm excited to find ways to use it!

Crossing my fingers for nested data support in the viewer, too.

reply

abuehrle | karma 139 | avg karma 2.14 · 2021-05-18 13:24:54

This is really cool! I would have liked to have incorporated this into my vaccine appointment slot finder tool a few months ago. I like using git commits for change tracking too. Seems not dissimilar (though not identical) to what they're doing at Dolt (https://www.dolthub.com/).

idan | karma 4286 | avg karma 15.59 · 2021-05-18 19:12:13+00:00

Yup, there's Dolt, and DVC, and probably a dozen other projects I'm forgetting or haven't heard of. Dat!

There's more than one way to data. We looked at a bunch of them, and the key thing we keep coming back to is git semantics. In many ways, all these other projects attempt to graft git semantics on top of more scalable datastores, allowing you to "fork" your data or roll it back to a given version. Trouble is, these abstractions have subtly different semantics or behaviors. These aren't inherently bad — just not the same as the ones you know from git.

This approach sacrifices "scalability" in order to let you Just Use Git™. It won't work (well) for a larger dataset, but we find that it's useful in a ton of situations.

For example: I have personally shipped bugs to production because my test fixtures had stale example data. I should have remembered to create new fixtures, but I didn't. Flat could have made them for me, on a schedule, subsampling and anonymizing production data as it worked.

It's a subtle difference in appplication. If your goal is to version $BIGDATA, then Flat isn't the right tool for the job, and you should check out Dolt, DVC &co.

reply

ellimilial | karma 162 | avg karma 2.19 · 2021-05-18 18:32:13+00:00

Very interesting how Github comes with more and more interesting 'actions' to turn repos into 'platforms' and moves us closer to serverless future.

@idan how does it scale with the size (including storage)? Is 'a billion rows' a goal or an actual tested use case?

reply

eksabajt | karma 39 | avg karma 3.55 · 2021-05-18 13:39:50

It's storing the files in the repository which has a file size limit of 100MB. I think the repositories themselves have a soft limit of 5GB and a hard limit of 100GB.

idan | karma 4286 | avg karma 15.59 · 2021-05-18 13:41:45

It doesn't scale! This isn't a replacement for databases.

Our take on this is about "working sets" of data — if you have billions of rows, that's a lot bigger than a working set! At some point, you have to query, filter, and aggregate to get your data down to a chewable size for work.

You can do that in your code too, and sometimes that's absolutely the right approach! But often it's easier to push that work to "outside your code," and that is what Flat is great for.

reply

ellimilial | karma 162 | avg karma 2.19 · 2021-05-18 18:53:02+00:00

Thank you for the response and clearing up the 'billion rows' / surly bonds confusion I had from reading project's Why Flat Data? section. I think I understand the target use case slightly better now.

One of the strong arguments for object-like storage (S3 etc) in the context of plain / flat data is scalability and availability for large scale processing frameworks. Databases are only occasionally relevant.

reply

jasoncwarner | karma 366 | avg karma 14.08 · 2021-05-18 18:53:14+00:00

Hi! Jason, CTO @ GitHub, here

You’re getting at the heart of Actions. Actions was never intended to be “CI” or any such vertical capability. It has always intended to be a platform that exposes capabilities like CI or packages etc out to the world, but the underlying serverless very flexible workflow platform is the bedrock upon which we want to build the future

My long held view that the only real ‘competitor’ to what I want github to be was AWS/major cloud infra companies and if you believe in that view along with me, you likely see what the why the past four years of github and the next few years of github make a lot of sense

And it even makes more sense when you squint just a bit and realize what codespaces + repos + actions (CI/security/packages + other things) + automated workflows would eventually do. Now imagine a bit further out into the future and what it would mean if we understood your production workloads a bit more

reply

ellimilial | karma 162 | avg karma 2.19 · 2021-05-18 19:29:59+00:00

Hi Jason, thank you very much for the background and the explanation. It is fascinating to see the progress in this direction.

I started raising my eyebrow (in the best possible sense) upon seeing parts of tooling very similar to ours but simpler and more importantly - without moving parts. We operate in biomedical data space and deal with flat/static data a lot, for example we power https://biokeanos.com with data-in-repo, so Flat Data was immediately interesting.

It is really inspiring to see GitHub actions to having a foray in this direction, definitely something to keep an eye on.

reply

duped | karma 7438 | avg karma 3.07 · 2021-05-19 02:27:55+00:00

If this is the vision, please let us write actions directly with typescript or some legitimate programming language (not YAML). It is currently impossible to debug and reuse action code.

I am working on an entire company migration off GitHub actions because it cannot scale. Full programmatic control and local debugging that allows me to reuse and test code in a single repository would have justified staying with GH.

reply

jasoncwarner | karma 366 | avg karma 14.08 · 2021-05-19 02:45:42+00:00

You can already do that in a few ways like with docker or the run command. It’s been there from the beginning. YAML is just the config.

https://docs.github.com/en/actions/reference/workflow-syntax...

reply

duped | karma 7438 | avg karma 3.07 · 2021-05-19 04:23:25+00:00

Those only work at the actions level. I need them at the workflow level. The biggest issue here is the differentiation between actions and workflows, I need my workflows to be treated as actions and reuse entire segments of them. This isn't possible without copy/pasting code.

I also need arbitrary logic to configure and run my workflows (like an else branch... that would be nice).

This is in large part why Team City and Jenkins beat out Actions when we reevaluated.

The YAML file is not a config file in the workflows I have written, it is the top level program calling many other programs. The syntax limitations (and unsafety of its interpreter) make that unwieldy. But it's not possible to workaround without never using any other action in the marketplace, which kind of defeats the purpose of using actions at all.

reply

gerner | karma 122 | avg karma 2.35 · 2021-05-18 13:40:08

I don't know much about Flat Data, but I'm impressed with how much GitHub is doing as GitHub since the MSFT acquisition. They continue to offer compelling services to developers, and increasingly to enterprise customers. All without abandoning much of what made GitHub great: a focus on developers and easy to access dev productivity.

Notice the prominence of the VSCode integration here. Notice the dramatically increased presence of MSFT on GitHub in general. It seems like they've managed to integrate these two cultures and product-sets in sensible ways. Given how hard big integrations like this are to pull off, I feel like the community really dodged a bullet in terms of access to products/tools.

reply

alexander-litty | karma 548 | avg karma 22.83 · 2021-05-18 13:47:44

Dodged a bullet for now.

I’m worried this is their extend-embrace stage, and the extinguish is yet to come.

I truly hate to be pessimistic, and I’m not trying to start a flame war. I just can’t see this behavior lasting in the long run.

reply

gerner | karma 122 | avg karma 2.35 · 2021-05-18 18:53:10+00:00

Agree, it's important that we keep an eye on things and, however we can, hold MSFT and GitHub accountable to keep up the good showing.

We've seen new features launched (e.g. this one) long enough after the acquisition that much (most, all?) of the work happened in the post acquisition environment that I'm optimistic. But I've been wrong before.

reply

pwdisswordfish8 | karma 927 | avg karma 3.5 · 2021-05-18 21:21:59+00:00

It's already here, is just that the userbase and third parties are (happily) doing the dirty work for them. Try going GitHub-free for a month or three and you'll notice how many things rest on the assumption that you have a GitHub account. "Log in with GitHub" is essentially what Microsoft hoped for with Passport, if Passport had actually been successful.

Look at how it shat on Markdown with what it calls "GitHub Flavored Markdown". Look at the things that it calls "wikis". Look at how GitHub's PR merge tool junks up the commit log. Look at how many projects don't even have a way to accept a fix unless you submit it with GitHub's janky pull request workflow. Hell, a bug in Netlify's command-line client managed to make its way into release versions that would straight up cause the process to terminate when cwd was a repo that wasn't hosted on github.com, leading to unhandled exception.

The tacit assumption that you're using GitHub is like the tacit assumption 15 years ago that you were using Visual Studio, only this time, you can't escape just by steering clear of Windows-related tech.

reply

agency | karma 1881 | avg karma 5.86 · 2021-05-18 21:30:31+00:00

I have no particular love for MSFT but I don’t think any of the issues you mentioned began after the acquisition.

pwdisswordfish8 | karma 927 | avg karma 3.5 · 2021-05-18 16:50:06

...so?

They acquired a company that was doing the thing that they are wont to do and are criticized for, and have poured the significant resources at their disposal into growing the circle of impact. Where it originates from and whether it was or wasn't already independently in full swing (or partial, in this case) before their involvement doesn't matter, the effect on the user is the same. Besides that, if a person's problem with a given practice is whether or not Microsoft is the perpetrator, then that person is a hypocrite and doesn't actually give a shit about the the thing they claim to be concerned about.

reply

gerner | karma 122 | avg karma 2.35 · 2021-05-19 03:59:42+00:00

GitHub Flavored Markdown seems like a nice extension to Markdown to me. Fenced code blocks? Great idea. Lots of other flavors of Markdown do the same thing. I don't know who's the leader or follower here, but I'm glad they're doing it. I'm not sure what's the gold standard for wikis, but they all seem like the kind of thing every vendor has similar, flawed, good-enough solutions for. And I know there are other thoughts around how to manage merges, but having a merge commit (or a squash merge or a fast forward) seems like a reasonable contender for handling a feature branch. But maybe there's something I'm missing? I guess any hegemony is bad for innovation?

Are there a lot of walled gardens that only allow sign-in with GitHub? That's not really an issue I've run into. I can't think of any site I want to invite my aunt/uncle/cousin to log into that only accepts GitHub login. In fact, I'm not sure there's a lot of tools I want my colleagues to use that require a GitHub login that isn't already tied to a GitHub hosted repo.

I would love to hear what I'm missing though.

reply

idan | karma 4286 | avg karma 15.59 · 2021-05-18 13:52:20

The OCTO DevEx team reaaaaaallly loves VS Code — beyond the editor, it's just a great surface for experimental developer tooling!

GitHub Codespaces aren't generally available yet, but being able to target both "native" VS Code as well as in-browser VS Code with the same extension is super powerful. Expect a lot more from us on that front.

We've also released a pair of little projects re VS Code development that we've extracted from our work:

https://github.com/githubocto/tailwind-vscode: a Tailwind CSS plugin which creates Tailwind color tokens for each of the VS Code theme colors, easing theme-native styling in VS Code.

https://github.com/githubocto/snowpack-vscode-extension-temp...: a VS Code extension template that incorporates the fastest toolchain with the wisdom we've accumulated about webview development.

reply

adamcstephens | karma 559 | avg karma 1.62 · 2021-05-18 17:58:29

Can you help me get a Codespaces invite? ;)

idan | karma 4286 | avg karma 15.59 · 2021-05-19 02:48:55+00:00

Are you https://github.com/adamcstephens?

adamcstephens | karma 559 | avg karma 1.62 · 2021-05-20 02:06:21+00:00

idan | karma 4286 | avg karma 15.59 · 2021-05-20 15:44:50+00:00

yourein.gif

rozab | karma 3009 | avg karma 3.63 · 2021-05-18 18:46:02

Makes sense, since vscode was originally a clone of github's editor atom ;) Which they are still maintaining, to their credit

syntaqx | karma 11 | avg karma 1.57 · 2021-05-18 22:51:07

Honestly, I wish I was able to get into the Codespace Preview so I could play around with a lot the experimental features it offers - Especially the tie in with Docker Desktop Dev Environments, my whole development workflow is likely going to change drastically this year.

duped | karma 7438 | avg karma 3.07 · 2021-05-18 14:14:19

The monthly downtime during working hours has been getting to me lately.

syntaqx | karma 11 | avg karma 1.57 · 2021-05-19 03:47:43+00:00

The GitHub acquisition likely the single catalyst that showed me Microsoft has actually pivoted how they're approaching business, and are at least putting good faith efforts to do better.

With literally 0 actual knowledge to reference on how GitHub has felt since then, outwardly it feels like MSFT has played to all of their strengths (money, infrastructure, money, "developers, "developers", "developers" memes) and amplified what GitHub had been pushing for.

I only hope it stays going well, because it's certainly keeping me engaged in trying more MSFT products and services than I likely would've otherwise even glanced at.

reply

dariosalvi78 | karma 1068 | avg karma 2.5 · 2021-05-18 19:06:07+00:00

nice idea, but exploring the data is very limited. Would be even better if it had some sort of query language and maybe an API.

jillesvangurp | karma 16646 | avg karma 3.74 · 2021-05-19 01:34:28

Flat data tool chains include a lot of common command line tools that you can schedule via github actions.

Have a look at e.g. csvkit (allows you to run sql queries against csv files; even supports joining files), jq, and other command line tools that can work on simple csv or json files. Lower level tools like grep, awk, cut, etc. can also work of course.

I've done quite a bit of ad hoc data querying over the years where our data was just too tiny to bother with a lot of ceremony or infrastructure.

Github actions also allows you to run stuff via docker. So loading some csvs into postgresql (or whatever you need) would be a possibility as well.

The main limitation here is free build minutes and size of the data. Beyond a certain size you might want to use something more suitable. But I guess the point is that a lot of data just really isn't even close to big data.

reply

FemmeAndroid | karma 1144 | avg karma 5.5 · 2021-05-18 19:08:40+00:00

The really interesting thing about this to me is that if this wasn't being put out via GitHub, I would have dismissed it as being potentially against the TOS or abuse of GitHub's free service. But with them putting it out, I'm quite interested in reevaluating my use cases for GitHub.

gerner | karma 122 | avg karma 2.35 · 2021-05-18 14:28:34

See the comment from @jasoncwarner about GitHub actions being a platform for much more than CI.

I wonder how far that extends to non-GitHub provided services. For instance, could we leverage GitHub actions, perhaps even Flat Data, to scrape some web site and store it (perhaps uploading elsewhere) in a more comprehensive way vs. storing some small snippet of the data in a git repo?

reply

VWWHFSfQ | karma 5427 | avg karma 3.4 · 2021-05-18 20:38:33+00:00

you mean like in a database

gerner | karma 122 | avg karma 2.35 · 2021-05-18 21:12:42+00:00

Yes. Or S3 bucket, or whatever. The thing I'm getting at is, can we use GitHub actions for application tasks like web sraping that need compute and network access, but that don't really do much with with a git repo. Does GitHub want to support that?

FractalHQ | karma 776 | avg karma 1.42 · 2021-05-18 19:20:15+00:00

Funny, I’m currently working on a project where I’m fetching post data from a Wordpress backend with a few GQL queries via the WPGraphQL plug-in and `@urql/svelte` to populate a static SSG’d frontend. While developing locally, I copied and pasted the JSON response into a local file in the repo to develop against. I was thinking this would be nice to automate.

If I’m understanding correctly, it seems like this tool more or less automates that process?

Can it send a GQL query?

reply

idan | karma 4286 | avg karma 15.59 · 2021-05-18 14:25:14

This is a really powerful use-case! If you saw Alex Gaynor's election tracker[1] during the US 2020 elections, it's exactly how it worked. Actions scraped the NYT election results.json, and a static site on GH pages rendered the data, XHRing the scraped JSON out of the repo periodically.

There's no GraphQL backend yet! We've only done HTTP and SQL backends so far. If your GQL query is simple enough, you might be able to squeak by with an HTTP flat action whose target is https://your.site/graphql?query=whatever ?

[1] https://alex.github.io/nyt-2020-election-scraper/battlegroun...

reply

simonw | karma 58201 | avg karma 7.31 · 2021-05-18 19:35:49+00:00

If you want to run GraphQL queries against this kind of data I have a roundabout way of doing it:

1. Set up a repo that uses actions to scrape data into a CSV

2. Set up another action that converts that CSV to SQLite (using my sqlite-utils tool) and then...

3. Publishes that database to Cloud Run or Vercel with Datasette and with the datasette-graphql plugin

Here's an example repo that does exactly that: https://github.com/simonw/cdc-vaccination-history

It scrapes vaccination data from the CDC, complies that into a SQLite database and publishes it using Datasette on Vercel at https://cdc-vaccination-history.datasette.io/

Then you can run GraphQL queries at https://cdc-vaccination-history.datasette.io/graphql

(Here's the plugin: https://datasette.io/plugins/datasette-graphql)

Another demo: https://covid-19.datasettes.com/graphql runs from this repo: https://github.com/simonw/covid-19-datasette

reply

yNeolh | karma 79 | avg karma 2.39 · 2021-05-18 19:25:56+00:00

Interesting, is not an official product from Github, but I love the idea and they being upfront about their inspiration from Simon, a really interesting person to follow, I love his investment in Datasette, SQLite utils and Django.

The thing about Git Scrapping, although I think the idea is awesome, I thought It was against Github Actions rules, or at the very least being on the edge. So I don't know what the position from Github is about this as this is not an official thing from them, but this gives me positive vibes.

reply

simonw | karma 58201 | avg karma 7.31 · 2021-05-18 19:27:08+00:00

Same here! I was reasonably confident that Git scraping was within the boundaries of GitHub Actions supported use-cases but it did always feel a little bit on the edge, this is fantastic confirmation that it's a supported technique.

danso | karma 162920 | avg karma 11.44 · 2021-05-18 19:30:11+00:00

As someone who has written so much boilerplate data-collection code (i.e. scripts that I cron on my local repo, then push to Github), this is really incredible. I've been really impressed with what Simon W. has shown off with Github Actions but hadn't yet felt compelled enough to dive in and learn the conventions...but this looks like a great entry point.

Don't know if this is the place to report bugs, but I was trying the github»flatgithub data viewer trick on an old repo that has a name of `white_house_salaries`.

My data subdirectories have several files named white_house_salaries.csv — e.g. data/wrangled/white_house_salaries.csv is the "finished" version. However, visiting that file in flatgithub.com gives me a "No valid data" error:

https://flatgithub.com/storydrivendatasets/white_house_salar...

I get the same error when visiting data/fused/white_house_salaries.csv.

However, when I rename the file to something other than "white_house_salaries.csv", like, data/wrangled/white_house_salaries_wrangled.csv, it works as expected:

https://flatgithub.com/storydrivendatasets/white_house_salar...

I'm guessing there must be some issue with the data filename (white_house_salaries.csv) sharing the same name as the repo (storydrivendatasets/white_house_salaries)?

reply

rothenbizzle | karma 4 | avg karma 4.0 · 2021-05-18 19:44:13+00:00

Hey there! Matt from the DevEx team here. Apologies for the lack of polish – I think the issue here is that the flatgithub.com URL only works when you specify the repo owner and repo name, à la https://flatgithub.com/storydrivendatasets/white_house_salar....

It gets confused by all of the other stuff afterward, "tree/master/data/wrangled".

Let me know if that gets you sorted!

reply

whats_spinning | karma 1 | avg karma 0.33 · 2021-05-18 19:30:29+00:00

How big of data can this handle?

ekzy | karma 331 | avg karma 2.45 · 2021-05-21 05:00:28

from github docs:

> We recommend repositories remain small, ideally less than 1 GB, and less than 5 GB is strongly recommended. Smaller repositories are faster to clone and easier to work with and maintain. Individual files in a repository are strictly limited to a 100 MB maximum size limit. For more information, see "Working with large files." If your repository excessively impacts our infrastructure, you might receive an email from GitHub Support asking you to take corrective action.

reply

dataangel | karma 1250 | avg karma 2.74 · 2021-05-18 14:54:48

...they reinvented cron? it just commits a file on a timer

idan | karma 4286 | avg karma 15.59 · 2021-05-18 20:16:15+00:00

Correct! And if you're Simon Willison, this is a super easy thing to Just™ implement manually.

The point of Flat Data is to push the edges of that bubble outwards. Add tooling and examples. Add a viewer. Make the "happy path" situations where this is helpful really fast and easy.

We're pretty upfront about this not being a major technological advance. The difference between a difficult-to-use API and a good API is usually just about the mental model. We like this mental model, and the kinds of patterns it encourages!

reply

datadrivenangel | karma 2827 | avg karma 3.3 · 2021-05-19 09:25:12

Having it running in source control and on someone else's computer is a moderate advantage over regular cron.

nt2h9uh238h | karma 77 | avg karma 1.15 · 2021-05-18 20:20:44+00:00

I'm actually very excited about it. Could start a new era of how we develop and work with data.

bob1029 | karma 15848 | avg karma 4.21 · 2021-05-18 15:38:03

I am sensing some interesting capabilities here, but also get the impression that this is more about denormalized views of data (JSON/CSV/etc) than anything else. It's also in the name - 'Flat'.

Perhaps it is actually supported and I can't read properly, but I feel like you are just 1 tiny step away from allowing someone to write one of these things such that it can ETL any arbitrary data source into a SQLite database (i.e. many tables). There's not a whole lot of difference between CSV and SQLite when it comes to repository file management. Granted, SQLite databases would present as opaque blobs at code review time, but this is something we can tolerate because you still get all of the nice versioning & project consistency. Hell, you could probably write a special GitHub-branded diff viewer that allows you to compare 2 different SQLite databases, schema & all.

SQLite in general is such a force to be reckoned with. You could do a lot of damage (in a good way) with product features built up around the most popular database engine on earth.

reply

simonw | karma 58201 | avg karma 7.31 · 2021-05-19 00:26:11+00:00

I have an experiment from a while ago around this idea: sqlite-diffable is a tool for dumping out a SQLite database to disk in a format that is designed to live in a git repository and produce readable diffs: https://github.com/simonw/sqlite-diffable

trinovantes | karma 1958 | avg karma 2.42 · 2021-05-18 15:49:32

I once ran a web scraper on an hourly schedule with GitHub Action that wrote to a json file in my gh-pages branch and saved its results with sh "git commit --amend". Glad to see this workflow in a more integrated environment than my janky hack

buro9 | karma 24223 | avg karma 6.83 · 2021-05-19 07:39:19+00:00

Between Flat Data and Github Actions I feel like we've just invented Yahoo Tubes again but without the snazzy UI.

The simplicity of what Github is doing leads to tighter and simpler control loops that will favour simple workflows.

reply

idan | karma 4286 | avg karma 15.59 · 2021-05-20 15:47:12+00:00

Hello, fellow internet old! I too remember Yahoo Pipes. :D

1_player | karma 8196 | avg karma 4.61 · 2021-05-19 08:04:59+00:00

> Let's say we only want a snippet of the data in our original endpoint. We can create a postprocessing.js or postprocessing.ts file written in Deno to filter and save just the data we care about.

Nice to see Deno being used and preferred instead of Node for this use case: the post-processing script can be only a single file that imports its dependencies from URL and caches them, instead of having to create a whole folder with package.json for the script.

reply

milliams | karma 2525 | avg karma 5.33 · 2021-05-19 09:45:14+00:00

It wasn't until halfway down the page that I saw

> It allows you to download any kind of data (JSON, CSV, images, text files, zip folders, etc.) into your repositories at a repeatable schedule.

that I had any idea what "Flat Data" was or why I should care.

reply