2006-10-10 17:22:37
Lee:
It can get even worse... Right before the longer nov* units were released, one of my client machines had a long run of quick and easy tuesday results.... This caused BOINC to readjust the time-to-process estimation, and it thought it would be able to process a nov result in less than 2 minutes! It proceeded to download over 600 results to process!
I wouldn't have thought this would be a problem, but when BOINC went to reschedule the CPU, it pegged the CPU for about a minute and a half trying to sort through all those results, and since it was basically locked up at that point, child processes weren't getting heartbeats... So an RCN result would start, couldn't get heartbeat, and exited, and the exit of course forces another reschedule, another unit starts running, but has the same problem, and so on. Luckily I caught it in time and was able to shut it down and manually move a large chunk of the units out of the queue, so I could put them back in later and not lose any work.
Then the other feature I didn't know about - when BOINC connected to the report the completed results, it redownloaded all the "lost result"'s from the server again! Nice, but caused the problem all over again. So for the last week I've been nursing the stupid thing along, finally now at a point where all the results aren't overloading the client, and the scheduler goes down. Just figures, doesn't it?
A side note about the CPU problems, I know it's not an optimal solution, but on that client I've just suspended network access, all he's got is a couple hundred more RCN's to run and a CPDN, which has a couple thousand more hours left, so I can suspend there without worry, and the client won't burn cycles managing the upload queue, it's helped quite a bit. You may be able to get the same result by just suspending RCN, if it doesn't have any more work queued (just the uploads), but I haven't tried that. And of course, that's a pain if there's several clients to deal with. But just a thought!
-D