Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Challenge, with context in my other comment at the same depth:

Suppose you have a table which includes power usage readings for all devices, sent every five seconds. Something like this:

  (id, timestamp, power)
Write me an ES query which gets the average power usage for device id=x over the last year over the seven days in the week, each divided over fifteen-minute intervals.

To clarify:

The groups will be the product of the days of the week with the ninety-six fifteen-minute intervals in a day, making 7*96 unique groups.

You will not be able to do this in any simple way.



sort by: page size:

We capture and store energy readings at ~5second intervals, then display total energy at various time granularity by aggregating the values over minutes, hours, days, months, etc.

> And IIRC, load average is represented usually as 3 numbers, which are over the last 5, 10, and 15 minutes

1, 5 and 15 minutes is the most common intervals


Absolutely! A starting point would be something like this to calculate total busy time per day: SELECT sum(end - start) from events group by JULIANDAY(start) Some extra aggregation and you could totally calculate your busy/free ratio

This is one of the queries I'm using to track a rolling average of how many times a week I've been cycling: WITH recursive dates(day) AS ( SELECT date($__unixepochfrom(), 'unixepoch') UNION ALL SELECT date(day, '+1 DAY') FROM dates WHERE day<date($__unixepochto(), 'unixepoch') ) SELECT day AS time, ( SELECT count(DISTINCT julianday(start)) FROM events WHERE summary = " Cycling" AND start <= date(day, '+1 DAY') AND start > date(day, '-7 DAY')) AS cycling, FROM dates ORDER BY day DESC

And yeah 100% admit the documentation is lacking. I'll be honest, I got to the 3 week mark of building and decided I need to validate the idea before putting any more work in. Going to be adding some more pre-done queries like this as inspiration.


If OP takes this approach, please please please make this usage data easily accessible, ideally both as a (free?) API endpoint and as a timechart/table in the web UI if applicable.

Less sophisticated, but recently I realized that just plain remembering certain numbers like "average hours per month" (~730) or "seconds per day" (86400) is really helpful to make very quick estimates. A customer wants a quota of 100k queries a day? That averages out to a little more than one query per second assuming equal distribution.

But what if I want the average over time? The query depends on every value, should everything be computed on the server?

A more functional mindset can definitely help here. Think of your "sales per day" model in terms of starting with a sequence of days--startingDay up to startingDay + n--as the input to a function that maps to an aggregate of that day's activity.

Aggregate functions in SQL are IMHO quite awesome once you develop a comfort level to stop worrying about them per se. I wouldn't like to try to get Excel to tell me--or write the code to do manually--something like "show me the standard deviation in units sold by day of week over the last ten summers."


> At the end of the script, the overall elapsed time and the geometric mean for all the queries is computed directly by querying the history view of all TPC-DS statements that have executed on the warehouse.

The geometric mean? Really? Feels a lot easier to think in terms of arithmetic mean, and perhaps percentiles.


People expect e.g. their "daily power consumption" to end at midnight local time. When they zoom in on the daily aggregates, they don't want a spike to suddenly disappear just because it belonged to the next local day and you only stored UTC-day aggregates. Even more so for billing.

That's a great point! I vaguely tried to go in that direction by plotting the average over the past 7 days, the average over the 7 days before and the average over the year to date. Yet, I fully agree one might want to push this a little further and do as you describe. Thanks for the suggestion!

Either you meant "per day" or you've got a really well tuned database.

The "Energy Impact" and "Avg. Energy Impact" columns (also in Activity Monitor) are also helpful for this.

As long as transactions are in the right categories, you can group multiple similar transactions and report the dates as 'various'.

How does that solve comparing days with 23 to 25 hours of data when aggregated by hour? Or are you suggesting using days in UTC?

I don't know about an official answer but using a simple variation on a round-robin database (RRD) ought to do the trick. Basically:

  day count = [last 23 hour buckets] + hour count
  hour count = [last 59 minute buckets] + minute count
  minute count = [last 59 second buckets] + second count
  second count = <running count cleared every 1 second>
You end up losing a small amount of accuracy, the worst being for keeping track of the second counts. You can avoid that by using an actual list of counter values for the seconds bucket and dumping those out when they get older than a second, or maybe by introducing another subdivision -- 1/10th of a second probably is good -- so that you actually have .9 seconds worth of data and can interpolate the last bit.

Thank you. Using days as your time unit for measuring requests over time is not useful. I'd be more interested in the req/sec numbers during peak.

In addition to this, I've started sampling days. Instead of recording transactions for every day, I can do something like every fifth day. This reduces precision, but it's still within a range acceptable for my purposes.

(Though it's important that it's either a systematic sample (every n days) or a random one (throw a dice each day, if it's 5 or higher, record that day). The days you pick have to be uncorrelated with the types of expenses those days, or your results will be biased.)


Thanks! It has proved to be very useful. Can you please put a link or reply to this comment detailing the source of the data? I don't care about the staleness but the accuracy of the data. I would like a more nuanced accuracy also, if possible, e.g. data from 3 hours old is really 3 hours old and not 5-6 hours old etc.

How often do you go over the raw data, in a manner that aggregates couldn't help you with, for data from 6+ months ago?
next

Legal | privacy