I'm working on upgrading the architecture of Refynr, and am trying to decide on the new primary database platform, specifically for large volumes of Tweets, which I plan to store as JSON from the Twitter API.
At the moment I'm using mySQL with nearly 4 Million tweets stored.
The new requirements will be:
- able to store up to 500 Millions tweets over the next 12 months
- also able to store 100 Million Facebook posts
- and able to store 10 Million RSS feed items
- clusterable for HA
- free, or nearly free
- established, proven platform (nothing super-new or Alpha)
- scalable for both high writes and high reads of the data:
- there will be a Refynr API that feeds to TweetDeck, HootSuite, and perhaps a few others that support TwitterAPI-compatible feeds
- up to 100K writes per minute, as data is pulled in from Twitter, FB & RSS
- Nice to haves:
- code libraries already built, that connect via CFML or Java
- fairly simple to set up (I'm new to the noSQL world)
- can easily import data from mySQL tables
- cloud server-compatible: preferably either Rackspace CloudServers, AWS, or GAE
- runs on Linux
- open-source
- easy to test/run on Mac OSX
I have my own ideas what to use, but don't want to sway your opinion, so am asking the open-ended question:
What would you use? What is your experience with your recommendation? And what are your reasons?
