Challenge, with context in my other comment at the same depth:
Suppose you have a table which includes power usage readings for all devices, sent every five seconds. Something like this:
(id, timestamp, power)
Write me an ES query which gets the average power usage for device id=x over the last year over the seven days in the week, each divided over fifteen-minute intervals.
To clarify:
The groups will be the product of the days of the week with the ninety-six fifteen-minute intervals in a day, making 7*96 unique groups.
You will not be able to do this in any simple way.
We capture and store energy readings at ~5second intervals, then display total energy at various time granularity by aggregating the values over minutes, hours, days, months, etc.
Absolutely! A starting point would be something like this to calculate total busy time per day:
SELECT sum(end - start) from events group by JULIANDAY(start)
Some extra aggregation and you could totally calculate your busy/free ratio
This is one of the queries I'm using to track a rolling average of how many times a week I've been cycling:
WITH recursive dates(day) AS
(
SELECT date($__unixepochfrom(), 'unixepoch')
UNION ALL
SELECT date(day, '+1 DAY')
FROM dates
WHERE day<date($__unixepochto(), 'unixepoch') )
SELECT day AS time,
(
SELECT count(DISTINCT julianday(start))
FROM events
WHERE summary = " Cycling"
AND start <= date(day, '+1 DAY')
AND start > date(day, '-7 DAY')) AS cycling,
FROM dates
ORDER BY day DESC
And yeah 100% admit the documentation is lacking. I'll be honest, I got to the 3 week mark of building and decided I need to validate the idea before putting any more work in.
Going to be adding some more pre-done queries like this as inspiration.
If OP takes this approach, please please please make this usage data easily accessible, ideally both as a (free?) API endpoint and as a timechart/table in the web UI if applicable.
Less sophisticated, but recently I realized that just plain remembering certain numbers like "average hours per month" (~730) or "seconds per day" (86400) is really helpful to make very quick estimates. A customer wants a quota of 100k queries a day? That averages out to a little more than one query per second assuming equal distribution.
A more functional mindset can definitely help here. Think of your "sales per day" model in terms of starting with a sequence of days--startingDay up to startingDay + n--as the input to a function that maps to an aggregate of that day's activity.
Aggregate functions in SQL are IMHO quite awesome once you develop a comfort level to stop worrying about them per se. I wouldn't like to try to get Excel to tell me--or write the code to do manually--something like "show me the standard deviation in units sold by day of week over the last ten summers."
> At the end of the script, the overall elapsed time and the geometric mean for all the queries is computed directly by querying the history view of all TPC-DS statements that have executed on the warehouse.
The geometric mean? Really? Feels a lot easier to think in terms of arithmetic mean, and perhaps percentiles.
People expect e.g. their "daily power consumption" to end at midnight local time. When they zoom in on the daily aggregates, they don't want a spike to suddenly disappear just because it belonged to the next local day and you only stored UTC-day aggregates. Even more so for billing.
That's a great point! I vaguely tried to go in that direction by plotting the average over the past 7 days, the average over the 7 days before and the average over the year to date. Yet, I fully agree one might want to push this a little further and do as you describe. Thanks for the suggestion!
I don't know about an official answer but using a simple variation on a round-robin database (RRD) ought to do the trick. Basically:
day count = [last 23 hour buckets] + hour count
hour count = [last 59 minute buckets] + minute count
minute count = [last 59 second buckets] + second count
second count = <running count cleared every 1 second>
You end up losing a small amount of accuracy, the worst being for keeping track of the second counts. You can avoid that by using an actual list of counter values for the seconds bucket and dumping those out when they get older than a second, or maybe by introducing another subdivision -- 1/10th of a second probably is good -- so that you actually have .9 seconds worth of data and can interpolate the last bit.
In addition to this, I've started sampling days. Instead of recording transactions for every day, I can do something like every fifth day. This reduces precision, but it's still within a range acceptable for my purposes.
(Though it's important that it's either a systematic sample (every n days) or a random one (throw a dice each day, if it's 5 or higher, record that day). The days you pick have to be uncorrelated with the types of expenses those days, or your results will be biased.)
Thanks! It has proved to be very useful. Can you please put a link or reply to this comment detailing the source of the data? I don't care about the staleness but the accuracy of the data. I would like a more nuanced accuracy also, if possible, e.g. data from 3 hours old is really 3 hours old and not 5-6 hours old etc.
Suppose you have a table which includes power usage readings for all devices, sent every five seconds. Something like this:
Write me an ES query which gets the average power usage for device id=x over the last year over the seven days in the week, each divided over fifteen-minute intervals.To clarify:
The groups will be the product of the days of the week with the ninety-six fifteen-minute intervals in a day, making 7*96 unique groups.
You will not be able to do this in any simple way.
reply