01:30:40 GMT I'd like to revision full web pages, including html, css, images, js, etc. My current plan is to save each resource to a file based on its sha2 and then use redis as a lookup of url -> hash. Since I want these versioned, I'd need something like url -> [hash ...] and I'd need to quickly look up the latest hash for any url. 01:31:19 GMT Does someone with more familiarity of redis have suggestions for applying redis to this approach, or perhaps using an alternative tool? 01:33:05 GMT sounds like a use case for git or similar 01:33:30 GMT that's certainly how the web pages and their resources across revisions were originally managed when being published 01:34:23 GMT I was originally using a normal fs + git, but I have some instances where foo.com/bar is a file and also a directory, which is a real pain. 01:36:27 GMT yeah, that's going to be a real problem in any case 01:37:34 GMT A k/v lookup, such as what I could get with redis, seemed to rule that problem out, at the cost of complexity + indirection. Not yet a final decision, but seemed like one I should explore deeper. 01:41:58 GMT so basically archive the site at a certain point into a single file, SHA2 it, then rename the file with that SHA2, insert it into a sorted set inside redis with key=website_name and values=sorted set of hashes? 01:44:09 GMT I think this data also sounds relational, so you may consider a RDBMS (i.e. Postgres) that can store a table of metadata about the site for each time it was scraped (like timestamps, the hash, size, etc) and then another table can be full of matching records that store the on-disk filename for that site+scrape_time 01:44:13 GMT I was thinking about it on a per-file basis. So, visiting foo.com downloads some html for the index, with the url of foo.com, as the initial request. That gets hashed, saved, and added to redis for key=foo.com value=[hash]. Then, or meanwhile, another request is made to foo.com/img/bg.png and the process repeats. 01:44:58 GMT oh that doesn't sound very efficient wrt pipelining and keep-alive HTTP 1.1 requests 01:45:53 GMT The http layer isn't much of a concern, since I'm using an existing efficient proxy to do that work. I just have some middle-man logic for processing responses with specific mime types. 01:46:26 GMT honestly, I would wait to create any records until after the scraping process is 100% complete 01:47:17 GMT ACID style 01:47:26 GMT I see what you mean. 01:48:16 GMT but you can break it up in a job queue with workers doing the downloading if that suits your issues best 01:49:21 GMT I'm not yet sure how that can be applied here though. What I'm working on is not a scraper, but a proxy which persists all history. I could wait until a page load is entirely complete, which will likely quicken each page load, but it's just defering the same work. 01:49:46 GMT I may've misunderstood you. 01:51:45 GMT Perhaps you were suggesting that, with knowledge of all requests made for the foo.com page, the records created could be optimized? 01:52:34 GMT oh, you're making a caching proxy 01:53:02 GMT that basically doesn't evict based on the normal rules 01:53:21 GMT that isn't what I was thinking, sorry 01:55:18 GMT Redis' data structures cannot be nested inside other data structures, so storing a List inside a Hash is not possible. 01:57:57 GMT depending on your read/access patterns, you can do compound keys like url_timestamp -> hash 01:58:19 GMT then you can retrieve all the past records for a URL by searching keys matching url_* 02:00:08 GMT alternatively, you can store it in an RDBMS as I suggested since it's more relational (table of sites -> table of URLs -> table of saved records with timestamps and hashes) and then put the file on disk with the hash stored in the last table when you retrieve it 02:00:52 GMT making it easy to grab the history for a path, find the latest, pull the latest for every file on a site, etc in a single query 02:01:55 GMT Hm, giving this more thought. I appreciate your suggestions. 02:02:04 GMT np 02:04:31 GMT You've pointed out a key flaw in my original logic, which was oriented around per-file records. If I want to step back in time for the whole page, then a timestamp-based lookup, and likely a RDMS, seems like a saner approach. 02:05:00 GMT When I only care about individual files, then redis seems like a sane option. 02:13:58 GMT sounds right to me 02:19:20 GMT Yep, this has been enlightening. Thanks again, zpojqwfejwfhiunz. 17:43:19 GMT hi guys :) whats the recommented approach when i have like 100k keys that are kinda 'tied together', 100k is total pool of keys that exist at once, but i'm getting only about 1k every time i'm accessingthat 'pool', hash and hmget hmset or simply separate keys and mget? 17:45:02 GMT tried booth approaches and kinda both end in slow log at some point 18:00:53 GMT Hi, I have a question about redis keyspace notifications for evictions. 18:02:01 GMT I have a redis instance that had a massive number of evictions (~130k) over a 40 minute period. This cause the application that listens to the events to get saturated and stop responding to other requests. 18:02:37 GMT But even an hour after the evictions stopped, I noticed the same error pattern on my service. 18:03:13 GMT So does redis stagger notifications for such a massive number of events occurring together, or does it send them all at once? 18:04:58 GMT Basically, what I want to know is the behavior of redis when it has to deal with a massive number of keyspace events occuring together. Are they all sent out at the same time, or are they staggered/spaced out