(created by nicholas; last update by william, apr 2014)
Required infrastructure
- elasticsearch - 0.9 branch
- redis
- mongodb
- postgresql
Setting up local
There is a postgresql backup on the team's dropbox.
Create a local development database:
createdb oma_dev
Load the backup data:
gunzip -c marketfu_production.sql.gz | psql oma_dev
Make sure you have a config/database.yml which sets up the PG database. Example:
development:
username: postgre_username
database: oma_dev
adapter: postgresql
test:
username: postgre_username
database: oma_test
adapter: postgresql
Now copy config/application.yml.example to application.yml and make sure that names of development and test Postgres databases match those in database.yml. At this point you should be able to start the application.
bundle exec thin start -R oma.ru -p 3000 -e development
Viewing the landing page
After running the app you'll notice http://localhost:3000 gets you redirected tohttp://getoma.comand you're not viewing the development app anymore.
The problem is that the landing page expects a subdomain tied to a company. First you need to make sure that domains likeomadev.localhost.dev (this is an example) also point to localhost.
Ubuntu 12.04 : You can achieve this by editing /etc/hosts file:
127.0.0.1 localhost
127.0.0.1 localhost.dev
127.0.0.1 omadev.localhost.dev
Note: this works only for explicitly declared domains, in case you want a generic solution (*localhost.dev) consider using dnsmasq.
Now you should "Sign Up" theomadevcompany and expect it's landing page to be available athttp://omadev.localhost.dev:3000. The signup form is here:http://localhost.dev:3000/company_signup.
In the process you'll receive an activation email with an incorrect link. Replace the production domain (omadev.omaengine.com) with the local one(omadev.localhost.dev:3000) and paste into the browser to activate the newly created account.This isn't right and should be fixed.
Running the backend processors
Crawler
The crawler is run with rake tasks. The entire crawler consists of several running processors:
- crawler
- link
- hydra
- attribute
- issues
- writer
In addition there's a token_tap task which provides rate limiting and there are commands to turn the crawler on and off.
The token tap needs to be running at all times
bundle exec rake opportunity_pipeline:token_tap
Queue a domain for crawling
rails console:
rc = RedisCrawler::Console.new
rc.queue domain_id
When the crawler is running it will start immediately.
run the crawler
Crawler and link processor
The link and crawler need to run at the same time. The crawler task will fetch pages from the internet and store them in redis while the link processor will analyze those pages for new links to crawl and forward the crawled page to the hydra.
bundle exec rake redis_crawler:crawler
bundle exec rake redis_crawler:link
Hydra processor
The hydra processor checks links that are not part of the crawl-domain for their status codes. when all the links are checked it will push the page to the attribute processor
bundle exec rake redis_crawler:hydra
Attribute processor
The attribute processor analyzes the page and extracts attributes from it.
bundle exec rake redis_crawler:attribute
Issue processor
The issue processor analyzes the page for issues.
bundle exec rake redis_crawler:issue
Writer
This processor writes out the page to mongodb, elasticsearch and S3
bundle exec rake redis_crawler:writer
Opportunities
The opportunity pipeline is another important part of our infrastructure and pulls in data from social media and third party resources.
Sources: 1. facebook 2. twitter 3. bing news 4. forum 5. serps 6. profile providers for enrichment e.g. fullcontact api
initiating opps retrieval for a project
rails console:
> include OpportunityPipeline::Console
> queue_project_keywords_for(project.id)
> queue_states
2013-03-13 15:56:51 UTC
Enrichment queue: 0
Facebook queue: 9
Twitter queue: 9
News queue: 9
Forum queue: 9
Serps queue: 9
Twitter Write Count: 0
retrieving news opps
News opportunities are retrieved by running a news retriever and a news processor
bundle exec rake opportunity_pipeline:news_retriever
bundle exec rake opportunity_pipeline:news
retrieving facebook opps
facebook opportunities are retrieved by running a facebook retriever and a facebook processor
bundle exec rake opportunity_pipeline:facebook_retriever
bundle exec rake opportunity_pipeline:facebook
retrieving twitter opps
twitter opportunities are retrieved by running a twitter retriever and a twitter processor
bundle exec rake opportunity_pipeline:twitter_retriever
bundle exec rake opportunity_pipeline:twitter
retrieving forum opps
forum opportunities are keyword mentions on forums. forum opportunities are retrieved by running a forum retriever and a forum processor
bundle exec rake opportunity_pipeline:forum_retriever
bundle exec rake opportunity_pipeline:forum
retrieving serps opps
serps are loaded from the serps table and then treated as a source for opportunities. so generate a mention and a potential lead and load these up for enrichment
bundle exec rake opportunity_pipeline:forum_retriever
bundle exec rake opportunity_pipeline:forum
Enrichment
During the opportunity retrieval we've identified entities, these are potentially contactable items such as websites/social media users. The enrichment fase consists of digging in and trying to find more about them + hopefully isolating contact details.
bundle exec rake opportunity_pipeline:enrichment
some more processing
The mentions and leads section still doesn't work as we still need to write to elasticsearch and do some twitter processing
resqueworker: bundle exec rake resque:work QUEUE=twitter_processor
mention_resqueworker: bundle exec rake resque:work QUEUE=opp_mention_writer_queue