09:18:32 GMT Good morning. Last Saturday we had a really weird problem with our redis cluster (6 Masters, 2 slaves per Master, version 3.0.7). At some point, one of the slaves of one master "stalled" and didn't return any results. Some minutes later, his master showed the same symptoms, which then broke the complete cluster. An hour later, the second slave also blocked every request, even info. Only after restarting the master, the cluster became healthy again. 09:19:19 GMT I have no real idea, what happened there - maybe we missed some important configuration… can someone help me? 09:40:38 GMT hard to say after the fact. unlikely it was a configuration issue. 09:40:55 GMT might have been useful to attach a debugger on hanging instances to see where they are stuck 09:41:21 GMT Sadly it was a production cluster and we had to bring it up quickly 09:41:44 GMT I just wonder how a stalled slave can bring a master down 09:41:53 GMT (and if there's any way to prevent that) 09:42:39 GMT maybe just hit a similar bug on both by accident? 09:42:52 GMT maybe, yes 09:43:24 GMT but then the second slave should take over 09:44:20 GMT you experienced buggy behaviour, all bets are off at that point^^ 09:45:20 GMT so, the next time this happens I'll only reboot the master and keep one of the slaves for debugging 09:45:37 GMT or quickly force a coredump of the master 09:46:50 GMT we're running it in an immutable infrastructure, this will be complicated 09:46:57 GMT but I'll find a way 15:21:59 GMT badboy_: sorry for disturbing you again… I've just seen that my failed slaves now show no interest in connecting to its master again 15:22:26 GMT I already killed one of the slaves, removed every data file except the redis.conf and the nodes-list 15:22:36 GMT after start up it only shows "link down" 15:36:17 GMT Net problem I've found: redis is no longer able to do a bgsave 15:36:24 GMT s/net/next 15:39:30 GMT gcore it and inspect the coredump 15:43:42 GMT this will break the master, right? 15:45:22 GMT getting closer… from the slave: "29720:S 19 Dec 15:45:05.714 # Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value." 15:58:31 GMT gcore will produce a coredump, but will not crash the program 15:58:43 GMT then, if it is already in some kind stuck it doesn't matter, right? 16:01:31 GMT it matters… the cluster itself is still working and serving 16:01:59 GMT I would love to have at least one slave for that master, so I can do some debugging in there 16:02:48 GMT the problem is: This thing is running in an immutable infrastructure, no gdb installed, no logfiles (my fault) 16:02:58 GMT I can't set a logfile during runtime 16:03:21 GMT I already tried to enable the diskless-sync, but this didn't help 16:08:45 GMT Hey, ive got a php redis driver and im trying to find what versions of redis it supports. Have there been significant changes in its API? 16:10:10 GMT max06: then you should get a gdb/gcore on there 16:10:21 GMT watmm: not since 2.0 16:33:26 GMT can I somehow create a slave from the masters aof-file? 17:12:59 GMT max06: no, you can only create a new instance that loads the file 17:16:32 GMT damn… okay, tomorrow morning I'll restart this misbehaving master and check if bgsave works again 17:17:13 GMT how much data is in your cluster? 17:17:30 GMT ~20GB in the masters 17:17:41 GMT (in total) 17:18:40 GMT this is running on 6 virtual machines, each with a 30G disk and 7.5G of memory, so each node handles between 3.5G and 4G of data 17:22:29 GMT I just tried to run a bgsave on the second failed slave, it also fails there - but on the first slave (with deleted data files) it worked well 17:22:59 GMT Wait a sec…. it spawns a second process - does it need the same amount of free memory? 17:40:10 GMT yes and no 17:40:44 GMT thanks to the wonders of Linux' CoW memory it probably won't unless you write the same amount at the same time into it 17:43:08 GMT <_Wise_> Hi * 17:44:10 GMT badboy_: but it helped me… once I enabled the overcommit-setting in the system, I was able to recreate the slaves :) 17:44:56 GMT it's noted in the faq's, but i haven't seen it today during my research… shame on me - thank you very much for your help! 17:45:15 GMT I'm going to add more masters tomorrow and redistribute the data a bit 17:45:21 GMT ah yeah, overcommit is necessary for that 17:45:26 GMT <_Wise_> I have a situation here with redis 3.0.5, I see zilions of EVALSHA in aof file. I assume that these are LUA scripts executed (I'l a sysadmin, not a dev... so I don't know what they are doing) 17:45:43 GMT max06: you should report that as an issue on the repo 17:45:54 GMT <_Wise_> Is there a way to avoid this to appear in aof file ? 17:46:01 GMT <_Wise_> these are read operations, right ? 17:46:12 GMT _Wise_: not necessarily 17:46:21 GMT <_Wise_> happens 10's of times per second 17:47:40 GMT <_Wise_> badboy_: so seeing these EVALSHA in aof file is not really something I can avoid, right ? 17:47:53 GMT maybe your devs actually know what they are doing? ;) 17:48:00 GMT nope 17:48:04 GMT badboy_: I can more imagine that this kernel setting was set up by our architecture guys… :D 17:50:30 GMT _Wise_: in redis 3.2 there is a mode to replicate the actual write commands instead of the script 17:51:06 GMT <_Wise_> badboy_: well, upgrading is not an option right now :/ 17:52:09 GMT then why do you worry? 17:53:48 GMT <_Wise_> badboy_: ah yes, good point. I have a master-slave sync issue. I assume due to too high disk activity (master writing the aof file) and the slave losing connectivity to the master and triggering sync over and over again. 17:54:14 GMT <_Wise_> slave loses connection to master 17:54:25 GMT <_Wise_> asks for re-sync 17:54:48 GMT <_Wise_> loses connection again, and "goto 10" :/ 17:55:33 GMT <_Wise_> "# I/O error trying to sync with MASTER: connection lost" 17:55:41 GMT writing the file should be less of a problem, even rewriting is done in another thread 17:55:55 GMT <_Wise_> yes 17:55:57 GMT you might try to change the config there 18:24:07 GMT <_Wise_> badboy_: maybe I should increase the repl-timeout value, right ? 18:24:20 GMT <_Wise_> it's 60 seconds by default 18:34:01 GMT <_Wise_> rebooted the slave, with empty database, and sync finally went through 18:55:15 GMT I'd like to bounce some ideas regarding doing Joins in redis. I've made a reddit post here: https://www.reddit.com/r/redis/comments/5iz0gi/joins_in_redis/ 18:56:07 GMT Has this idea been explored already? 19:54:16 GMT Is there anyone here?