16:39:00 GMT still question should there be a 1-1 relationship between redis process connections and redis clients? 16:40:22 GMT we have a redis server which is reporting 7680 connected clients but lsof reports the redis server has 23132 establised sockets 16:41:39 GMT we're investigating this as we have a golang daemon using the redigo client lib which believes it has active subscriptions, yet the redis server it thinks its connected to doesn't seem them 16:45:34 GMT it sounds like your client isn't properly closing the connections 16:45:36 GMT Recent versions of Redis (3.2 or greater) have TCP keepalive (SO_KEEPALIVE socket option) enabled by default and set to about 300 seconds. 16:45:56 GMT so you have lingering connections waiting to time out possibly 16:46:14 GMT I would expect those to show as half closed connections 16:46:24 GMT not fully established ones 16:47:14 GMT as far as the client is concerned it has the open connection and is waiting for published events 16:47:31 GMT however redis no longer sees these client connections 16:52:44 GMT what states are you seeing? 16:52:57 GMT are some amount in CLOSE_WAIT? 16:53:14 GMT nope 16:53:18 GMT all in ESTABLISED 16:55:15 GMT it looks like a client issue then, redigo uses a connection pool so even when you direct it to close a connection, it just keeps it open and returns it to the pool 16:55:53 GMT in the Pool struct, you can set IdleTimeout which defaults to zero (never close a connection) 16:56:06 GMT https://godoc.org/github.com/garyburd/redigo/redis#Pool 16:58:39 GMT we have an idle timeout of 3m 16:59:25 GMT on the client side? and then the default 5 minutes on Redis? 16:59:34 GMT <*> smh checks server 17:00:07 GMT I hope this is enough information for you to start troubleshooting how your client connection pool is handling the connections 17:00:25 GMT but it seems like the issue is indeed there with redigo and not in Redis itself 17:00:41 GMT 1) "timeout" 17:00:41 GMT 2) "0" 17:01:47 GMT I'm actually suspecting more tcp stack level issue or horrible golang level tcp issue 17:02:13 GMT as the sessions are there even though the redis server doesn't seem them as "clients" 17:09:23 GMT are you positive you're calling Close() and it's not deferred? are you sure there's no concurrent writes? are you sure there's only one Do() call for redigo? what's your MaxIdle set to for the Pool? are you sure you're only creating one Pool? 17:10:07 GMT concurrency always complicates things, so it will probably be something that's obvious in retrospect 17:10:18 GMT if we weren't calling close that would just increase the number of connected clients, we're seeing a lack of client in comparison to establised tcp sessions 17:10:50 GMT so you have 2 problems, which one do you want to talk about first? 17:11:11 GMT I think the two are actually related 17:11:35 GMT referring back to why we're investigating 17:11:53 GMT which I think is key 17:12:25 GMT the golang daemon opens an redis connection, and subscribes to some channels 17:12:52 GMT it then loops waiting for events, if it sees an error it will reinit the subscription 17:13:17 GMT currently its sat waiting for messages: 17:13:20 GMT # 0x65b73b github.com/multiplay/go/vendor/github.com/garyburd/redigo/redis.PubSubConn.Receive+0x4b /go/gopath/src/github.com/multiplay/go/vendor/github.com/garyburd/redigo/redis/pubsub.go:109 17:13:55 GMT so as far as its concerned it has a valid connection which has subscriptions as is waiting for messages 17:14:18 GMT we have published to the relavent channels, message wasn't recieved 17:14:39 GMT checking on the redis server it doesn't see any subscriptions for said channels 17:14:48 GMT can you isolate one of these connections with a tcpdump? 17:15:05 GMT not so far 17:15:23 GMT the reason beeing too many establised connections 17:16:07 GMT which lead to the other question, why so many establised connections with no corresponding "client" as far as the server is concertned 17:16:16 GMT to ensure that the subscription is ACK'd and that both sides are still responding, it might be the most thorough approach 17:17:25 GMT redigo has a lot of buried logic with that connection pool that you will have to dig into 17:17:31 GMT the docs don't match the code 100% 17:18:02 GMT hmmm 17:18:07 GMT like the onborrow logic to reuse a connection sends a PING to test the connection but only if it has been a minute since a timestamp of some sort 17:18:30 GMT yer 17:18:53 GMT this plus concurrency makes it complicated to know the exact state of connections and a simple mismatch in error handling can mean you think you're subscribing when really the PING timed out after a long period of minutes and it never told you 17:19:18 GMT that is the case with this is pubsub connections should be 100% seperate 17:19:30 GMT as they are never returned to the pool unless they break 17:20:17 GMT I can't find the exact problem for you, just point at some bits that are suspect in this particular client library 17:20:38 GMT thanks its appreciated 17:21:05 GMT I still think the wait of evidence points to an issue outside of redigo 17:21:58 GMT given idletimeout and maxidle is set 17:22:09 GMT so if you ported to plain go-redis, it would persist? 17:22:27 GMT and given we're seeing 10x the number establised connections than active clients 17:23:44 GMT and that is not just on the client side 17:23:58 GMT the descrepency is also on the server side 17:24:10 GMT but look at it from my perspective: millions of connections per hour from Python and Go clients and every socket and connection is accounted for 17:24:25 GMT what's different is redigo's Pool 17:24:53 GMT just reading the code for it makes me think there are many states unaccounted for 17:25:28 GMT my personal inclination would be to port the code to use the simple and reliable https://godoc.org/github.com/go-redis/redis#PubSub to see if that quashes it 17:25:29 GMT yer, but also a connection apparently waiting for pubsub events which the server doesn't believe said client connection exists 17:25:38 GMT or at least its removed the subscription for it 17:26:00 GMT I believe we know the cause of the issue 17:26:16 GMT we had VMware migrate the network endpoint 17:26:50 GMT the pyhon stuff died badly 17:27:12 GMT go kept running fine, but we've only just noticed the lack of subscriptions 17:27:40 GMT you restarted Redis after the migration, right? 17:27:45 GMT nope 17:27:50 GMT now you tell me this 17:28:42 GMT I think my next step is going to bounce one of the docker containers and see if we see no discrepency in connections to clients on that node 17:29:44 GMT one pertainent question may be if redis server can remove a client without closing the connection? 17:30:05 GMT redis server is outside if vmware 17:30:27 GMT and remained stable throughout the network switch 17:32:42 GMT yeah, maybe all the connections stayed open from pre-migration and can't actually transmit data 17:33:07 GMT sorry zpojqwfejwfhiunz lots of potential bits a play, so just trying to get my head round the details 17:33:22 GMT yes I think that may well be the case 17:33:46 GMT you can CLIENT KILL everything and start over if you don't want to actually restart Redis (i.e. if persisting to disk doesn't make sense for your case) 17:33:57 GMT as its very strange that we have so many more establised sessions than tcp sessions 17:34:04 GMT https://redis.io/commands/client-kill 17:34:21 GMT we can restart all the connected daemons 17:34:39 GMT which will be a nicer way to recycle things 17:35:14 GMT but trying to capture as much relavent info as possible before going there 17:52:20 GMT hmm this is going to take some hefty debugging 22:15:23 GMT can someone help me system requirements recommendations when using redis 22:52:22 GMT hi, I have a question about persistence 22:52:30 GMT in regards to redis streams 22:53:42 GMT how much of the data is actually kept in memory? 23:54:38 GMT scriptor: everything