The screen videos are interesting, but too fast to follow, and make reading the accompanying text impossible for those of us with fragile concentration. Reader View (Safari) drops the screen imagery entirely, so goes too far the other way.
I agree, I had to right click and unselect "Loop" just so I could read the nearby text.
Also, the "Why Flat Data" section feels like it was written by a someone trying to win a high school essay contest - it's full of unnecessary language. I would have an easier time understanding without the awkward metaphors, exclamation points, and over-casual language like "cool"
Requiring Deno, and not allowing Node, seems like it could cause a lot of annoyance for people who aren't accustomed to Deno or want to use packages written for Node. I hope this becomes more flexible in the future.
This is really cool! I would have liked to have incorporated this into my vaccine appointment slot finder tool a few months ago. I like using git commits for change tracking too. Seems not dissimilar (though not identical) to what they're doing at Dolt (https://www.dolthub.com/).
Yup, there's Dolt, and DVC, and probably a dozen other projects I'm forgetting or haven't heard of. Dat!
There's more than one way to data. We looked at a bunch of them, and the key thing we keep coming back to is git semantics. In many ways, all these other projects attempt to graft git semantics on top of more scalable datastores, allowing you to "fork" your data or roll it back to a given version. Trouble is, these abstractions have subtly different semantics or behaviors. These aren't inherently bad — just not the same as the ones you know from git.
This approach sacrifices "scalability" in order to let you Just Use Git™. It won't work (well) for a larger dataset, but we find that it's useful in a ton of situations.
For example: I have personally shipped bugs to production because my test fixtures had stale example data. I should have remembered to create new fixtures, but I didn't. Flat could have made them for me, on a schedule, subsampling and anonymizing production data as it worked.
It's a subtle difference in appplication. If your goal is to version $BIGDATA, then Flat isn't the right tool for the job, and you should check out Dolt, DVC &co.
It's storing the files in the repository which has a file size limit of 100MB. I think the repositories themselves have a soft limit of 5GB and a hard limit of 100GB.
It doesn't scale! This isn't a replacement for databases.
Our take on this is about "working sets" of data — if you have billions of rows, that's a lot bigger than a working set! At some point, you have to query, filter, and aggregate to get your data down to a chewable size for work.
You can do that in your code too, and sometimes that's absolutely the right approach! But often it's easier to push that work to "outside your code," and that is what Flat is great for.
Thank you for the response and clearing up the 'billion rows' / surly bonds confusion I had from reading project's Why Flat Data? section. I think I understand the target use case slightly better now.
One of the strong arguments for object-like storage (S3 etc) in the context of plain / flat data is scalability and availability for large scale processing frameworks. Databases are only occasionally relevant.
You’re getting at the heart of Actions. Actions was never intended to be “CI” or any such vertical capability. It has always intended to be a platform that exposes capabilities like CI or packages etc out to the world, but the underlying serverless very flexible workflow platform is the bedrock upon which we want to build the future
My long held view that the only real ‘competitor’ to what I want github to be was AWS/major cloud infra companies and if you believe in that view along with me, you likely see what the why the past four years of github and the next few years of github make a lot of sense
And it even makes more sense when you squint just a bit and realize what codespaces + repos + actions (CI/security/packages + other things) + automated workflows would eventually do. Now imagine a bit further out into the future and what it would mean if we understood your production workloads a bit more
Hi Jason, thank you very much for the background and the explanation. It is fascinating to see the progress in this direction.
I started raising my eyebrow (in the best possible sense) upon seeing parts of tooling very similar to ours but simpler and more importantly - without moving parts. We operate in biomedical data space and deal with flat/static data a lot, for example we power https://biokeanos.com with data-in-repo, so Flat Data was immediately interesting.
It is really inspiring to see GitHub actions to having a foray in this direction, definitely something to keep an eye on.
If this is the vision, please let us write actions directly with typescript or some legitimate programming language (not YAML). It is currently impossible to debug and reuse action code.
I am working on an entire company migration off GitHub actions because it cannot scale. Full programmatic control and local debugging that allows me to reuse and test code in a single repository would have justified staying with GH.
Those only work at the actions level. I need them at the workflow level. The biggest issue here is the differentiation between actions and workflows, I need my workflows to be treated as actions and reuse entire segments of them. This isn't possible without copy/pasting code.
I also need arbitrary logic to configure and run my workflows (like an else branch... that would be nice).
This is in large part why Team City and Jenkins beat out Actions when we reevaluated.
The YAML file is not a config file in the workflows I have written, it is the top level program calling many other programs. The syntax limitations (and unsafety of its interpreter) make that unwieldy. But it's not possible to workaround without never using any other action in the marketplace, which kind of defeats the purpose of using actions at all.
I don't know much about Flat Data, but I'm impressed with how much GitHub is doing as GitHub since the MSFT acquisition. They continue to offer compelling services to developers, and increasingly to enterprise customers. All without abandoning much of what made GitHub great: a focus on developers and easy to access dev productivity.
Notice the prominence of the VSCode integration here. Notice the dramatically increased presence of MSFT on GitHub in general. It seems like they've managed to integrate these two cultures and product-sets in sensible ways. Given how hard big integrations like this are to pull off, I feel like the community really dodged a bullet in terms of access to products/tools.
Agree, it's important that we keep an eye on things and, however we can, hold MSFT and GitHub accountable to keep up the good showing.
We've seen new features launched (e.g. this one) long enough after the acquisition that much (most, all?) of the work happened in the post acquisition environment that I'm optimistic. But I've been wrong before.
It's already here, is just that the userbase and third parties are (happily) doing the dirty work for them. Try going GitHub-free for a month or three and you'll notice how many things rest on the assumption that you have a GitHub account. "Log in with GitHub" is essentially what Microsoft hoped for with Passport, if Passport had actually been successful.
Look at how it shat on Markdown with what it calls "GitHub Flavored Markdown". Look at the things that it calls "wikis". Look at how GitHub's PR merge tool junks up the commit log. Look at how many projects don't even have a way to accept a fix unless you submit it with GitHub's janky pull request workflow. Hell, a bug in Netlify's command-line client managed to make its way into release versions that would straight up cause the process to terminate when cwd was a repo that wasn't hosted on github.com, leading to unhandled exception.
The tacit assumption that you're using GitHub is like the tacit assumption 15 years ago that you were using Visual Studio, only this time, you can't escape just by steering clear of Windows-related tech.
They acquired a company that was doing the thing that they are wont to do and are criticized for, and have poured the significant resources at their disposal into growing the circle of impact. Where it originates from and whether it was or wasn't already independently in full swing (or partial, in this case) before their involvement doesn't matter, the effect on the user is the same. Besides that, if a person's problem with a given practice is whether or not Microsoft is the perpetrator, then that person is a hypocrite and doesn't actually give a shit about the the thing they claim to be concerned about.
GitHub Flavored Markdown seems like a nice extension to Markdown to me. Fenced code blocks? Great idea. Lots of other flavors of Markdown do the same thing. I don't know who's the leader or follower here, but I'm glad they're doing it. I'm not sure what's the gold standard for wikis, but they all seem like the kind of thing every vendor has similar, flawed, good-enough solutions for. And I know there are other thoughts around how to manage merges, but having a merge commit (or a squash merge or a fast forward) seems like a reasonable contender for handling a feature branch. But maybe there's something I'm missing? I guess any hegemony is bad for innovation?
Are there a lot of walled gardens that only allow sign-in with GitHub? That's not really an issue I've run into. I can't think of any site I want to invite my aunt/uncle/cousin to log into that only accepts GitHub login. In fact, I'm not sure there's a lot of tools I want my colleagues to use that require a GitHub login that isn't already tied to a GitHub hosted repo.
The OCTO DevEx team reaaaaaallly loves VS Code — beyond the editor, it's just a great surface for experimental developer tooling!
GitHub Codespaces aren't generally available yet, but being able to target both "native" VS Code as well as in-browser VS Code with the same extension is super powerful. Expect a lot more from us on that front.
We've also released a pair of little projects re VS Code development that we've extracted from our work:
https://github.com/githubocto/tailwind-vscode: a Tailwind CSS plugin which creates Tailwind color tokens for each of the VS Code theme colors, easing theme-native styling in VS Code.
Honestly, I wish I was able to get into the Codespace Preview so I could play around with a lot the experimental features it offers - Especially the tie in with Docker Desktop Dev Environments, my whole development workflow is likely going to change drastically this year.
The GitHub acquisition likely the single catalyst that showed me Microsoft has actually pivoted how they're approaching business, and are at least putting good faith efforts to do better.
With literally 0 actual knowledge to reference on how GitHub has felt since then, outwardly it feels like MSFT has played to all of their strengths (money, infrastructure, money, "developers, "developers", "developers" memes) and amplified what GitHub had been pushing for.
I only hope it stays going well, because it's certainly keeping me engaged in trying more MSFT products and services than I likely would've otherwise even glanced at.
Flat data tool chains include a lot of common command line tools that you can schedule via github actions.
Have a look at e.g. csvkit (allows you to run sql queries against csv files; even supports joining files), jq, and other command line tools that can work on simple csv or json files. Lower level tools like grep, awk, cut, etc. can also work of course.
I've done quite a bit of ad hoc data querying over the years where our data was just too tiny to bother with a lot of ceremony or infrastructure.
Github actions also allows you to run stuff via docker. So loading some csvs into postgresql (or whatever you need) would be a possibility as well.
The main limitation here is free build minutes and size of the data. Beyond a certain size you might want to use something more suitable. But I guess the point is that a lot of data just really isn't even close to big data.
The really interesting thing about this to me is that if this wasn't being put out via GitHub, I would have dismissed it as being potentially against the TOS or abuse of GitHub's free service. But with them putting it out, I'm quite interested in reevaluating my use cases for GitHub.
See the comment from @jasoncwarner about GitHub actions being a platform for much more than CI.
I wonder how far that extends to non-GitHub provided services. For instance, could we leverage GitHub actions, perhaps even Flat Data, to scrape some web site and store it (perhaps uploading elsewhere) in a more comprehensive way vs. storing some small snippet of the data in a git repo?
Yes. Or S3 bucket, or whatever. The thing I'm getting at is, can we use GitHub actions for application tasks like web sraping that need compute and network access, but that don't really do much with with a git repo. Does GitHub want to support that?
Funny, I’m currently working on a project where I’m fetching post data from a Wordpress backend with a few GQL queries via the WPGraphQL plug-in and `@urql/svelte` to populate a static SSG’d frontend. While developing locally, I copied and pasted the JSON response into a local file in the repo to develop against. I was thinking this would be nice to automate.
If I’m understanding correctly, it seems like this tool more or less automates that process?
This is a really powerful use-case! If you saw Alex Gaynor's election tracker[1] during the US 2020 elections, it's exactly how it worked. Actions scraped the NYT election results.json, and a static site on GH pages rendered the data, XHRing the scraped JSON out of the repo periodically.
There's no GraphQL backend yet! We've only done HTTP and SQL backends so far. If your GQL query is simple enough, you might be able to squeak by with an HTTP flat action whose target is https://your.site/graphql?query=whatever ?
Interesting, is not an official product from Github, but I love the idea and they being upfront about their inspiration from Simon, a really interesting person to follow, I love his investment in Datasette, SQLite utils and Django.
The thing about Git Scrapping, although I think the idea is awesome, I thought It was against Github Actions rules, or at the very least being on the edge. So I don't know what the position from Github is about this as this is not an official thing from them, but this gives me positive vibes.
Same here! I was reasonably confident that Git scraping was within the boundaries of GitHub Actions supported use-cases but it did always feel a little bit on the edge, this is fantastic confirmation that it's a supported technique.
As someone who has written so much boilerplate data-collection code (i.e. scripts that I cron on my local repo, then push to Github), this is really incredible. I've been really impressed with what Simon W. has shown off with Github Actions but hadn't yet felt compelled enough to dive in and learn the conventions...but this looks like a great entry point.
Don't know if this is the place to report bugs, but I was trying the github»flatgithub data viewer trick on an old repo that has a name of `white_house_salaries`.
My data subdirectories have several files named white_house_salaries.csv — e.g. data/wrangled/white_house_salaries.csv is the "finished" version. However, visiting that file in flatgithub.com gives me a "No valid data" error:
I get the same error when visiting data/fused/white_house_salaries.csv.
However, when I rename the file to something other than "white_house_salaries.csv", like, data/wrangled/white_house_salaries_wrangled.csv, it works as expected:
I'm guessing there must be some issue with the data filename (white_house_salaries.csv) sharing the same name as the repo (storydrivendatasets/white_house_salaries)?
Hey there! Matt from the DevEx team here. Apologies for the lack of polish – I think the issue here is that the flatgithub.com URL only works when you specify the repo owner and repo name, à la https://flatgithub.com/storydrivendatasets/white_house_salar....
It gets confused by all of the other stuff afterward, "tree/master/data/wrangled".
> We recommend repositories remain small, ideally less than 1 GB, and less than 5 GB is strongly recommended. Smaller repositories are faster to clone and easier to work with and maintain. Individual files in a repository are strictly limited to a 100 MB maximum size limit. For more information, see "Working with large files."
If your repository excessively impacts our infrastructure, you might receive an email from GitHub Support asking you to take corrective action.
Correct! And if you're Simon Willison, this is a super easy thing to Just™ implement manually.
The point of Flat Data is to push the edges of that bubble outwards. Add tooling and examples. Add a viewer. Make the "happy path" situations where this is helpful really fast and easy.
We're pretty upfront about this not being a major technological advance. The difference between a difficult-to-use API and a good API is usually just about the mental model. We like this mental model, and the kinds of patterns it encourages!
I am sensing some interesting capabilities here, but also get the impression that this is more about denormalized views of data (JSON/CSV/etc) than anything else. It's also in the name - 'Flat'.
Perhaps it is actually supported and I can't read properly, but I feel like you are just 1 tiny step away from allowing someone to write one of these things such that it can ETL any arbitrary data source into a SQLite database (i.e. many tables). There's not a whole lot of difference between CSV and SQLite when it comes to repository file management. Granted, SQLite databases would present as opaque blobs at code review time, but this is something we can tolerate because you still get all of the nice versioning & project consistency. Hell, you could probably write a special GitHub-branded diff viewer that allows you to compare 2 different SQLite databases, schema & all.
SQLite in general is such a force to be reckoned with. You could do a lot of damage (in a good way) with product features built up around the most popular database engine on earth.
I have an experiment from a while ago around this idea: sqlite-diffable is a tool for dumping out a SQLite database to disk in a format that is designed to live in a git repository and produce readable diffs: https://github.com/simonw/sqlite-diffable
I once ran a web scraper on an hourly schedule with GitHub Action that wrote to a json file in my gh-pages branch and saved its results with sh "git commit --amend". Glad to see this workflow in a more integrated environment than my janky hack
> Let's say we only want a snippet of the data in our original endpoint. We can create a postprocessing.js or postprocessing.ts file written in Deno to filter and save just the data we care about.
Nice to see Deno being used and preferred instead of Node for this use case: the post-processing script can be only a single file that imports its dependencies from URL and caches them, instead of having to create a whole folder with package.json for the script.
reply