Wednesday, March 21, 2012

Notes from DrupalCon - Keeping the lights on (operations and monitoring best practices)

The following are my notes from Keeping the lights on - operations and monitoring best practices on Wednesday, March 21st, 2012 at DrupalCon Denver.
“Measurement is the link between mathematics and science” - Brian Ellis, Cambridge, 1968

Primary topics

  • Platform management, monitoring, and measurement
  • Security testing and monitoring
    • Monitoring - mean time to recovery is a key metric (how long does it take to fix)
  • Ongoing operational security

Essential Monitoring Features

  • Real-time AND trend monitoring
    • Infrastructure based
  • Custom plugin system
    • Avoid proprietary languages to ensure anyone can contribute
  • Runs your functional tests
  • Active AND passive monitoring
    • Push alerts
  • Log analysis
  • Escalation
    • Quality of life - levels, rotations
  • Remote command/”job” execution

Functional tests

  • Use Selenium

Business metrics

  • PageRank
  • Things that are relative to the business
  • Number of users

Technical monitoring
  • Apc tool
  • Service state
  • Cron - execute from remote monitoring system like Nagios

Nagios Module

Job Automation

  • Jenkins is the defacto standard for continuous integration and deployment
  • Codify and scripting all deployment activities


  • Turn on syslog logging - instead of database, write to a text file
  • Centralized off-server

Monitoring Overview

  • Ping or HTTP result code alert monitoring || Live user story testing and trend analysis
  • Crontabs and poormanscron || centralized cron management
  • Logging to database only || Syslog logging to central host
  • Logging in to see Drupal errors and available updates || Centralized Drupal monitoring
  • Offsite backups || Off-cloud backups

Book recommendation

  • The Visible Ops Handbook

Security Testing and Monitoring

  • Tools and services to detect and respond to vulnerabilities and threats.


Finding the problem


  • Mitigate, fix, alert
  • Having a response plan before incidents occur


  • Weaknesses


  • Ways to attack, whether or not they are succesful

Vulnerabilities (OAuth Top 10)

  1. Injection
  2. XSS - biggest problem in Drupal
  3. Broken auth/session - using core? OK
  4. Insecure direct object reference - manging access
  5. CSRF
  6. MIsconfiguration
  7. Insecure cryptographic storage - site specific, SSH, using a VPN to encrypt traffic
    1. Exception - password hash, encrypted information within site and database (encryption module)
  8. Failure to restrict URL access
  9. Insufficient transport layer protection - https
  10. Unvalidated redirects and forwards

Detecting Vulnerabilities

  • Automated code reviews
    • Static: Coder Module Secure Code Review module, Acquia
    • Dynamic: Not common
  • Automated penetration testing
    • Generic tools: Grendelscan (open sourcE), Fortify, Rational
    • Drupal Tools: Acquia
  • Manual code reviews
    • db_query(“DELETE FROM {users} WHERE name = “ $name”);
  • Manual penetration testing
    • Be an intelligent robot
    • Vuln.module (NEEDS PORT TO DRUPAL 7), Firefox: Tamperdata
Security review module

Responding to Vulnerabilities

Custom code:

  1. Fix it
  2. Test it
  3. Deploy it
  4. Contact customers (?)

Contributed Code

  1. 4 steps above
  2. Work out a simple, repeatable test case
  3. Report the issue to the Drupal Secuyrity Team
  4. Compare to
  5. Work with the Team and maintrainer to get a fix
  6. something else???

Detecting threats

Responding to threats

  • Spam
    • Mollom, Akismet
    • Spam, flag_abuse
  • Defacement
    • Revert to good copies from version control
    • Overwrite with new versions
    • Node revisions, db backup
  • Code injection
    • Keep code safe
    • Proactively block attackers at the firewall
  • Brute force password
    • login_security module
    • Included in Drupal 7 core
  • Help with everything: httpBL

Site monitoring

  • Internal/Free
    • Views
    • Mailmon - brand new
    • Quant - charting
    • Report - charting
    • Chart (system_charts)
  • External/Paid
    • Acquia network - ~$350/year, includes library, support
      • Acquia Insight
    • Droptor - $24/month/site, monitoring only
    • - unknown pricing

Three keys to ongoing operational security

  • Vigilance
  • Strong Chain
  • Incident Handling

What are the things that we need to do after launch on an ongoing basis after launch?

  • Maintain eternal vigilance
  • Automate as much as possible
    • Avoiding human error - often “I was too busy to get to it”
  • Conduct periodic audits
  • Never sleep

Periodic Audit Program

Avoiding weak links in the chain

  • Education
  • Training
  • Awareness


  • PCI DSS requires patching of all critical infrastructure within 30 days
  • What:
    • Linux or other underlying OS
    • Firewall infrastructure
    • Switches
    • Wireless Access Points
    • … more

Incident Management (needs to be written)

  1. Initial Response
  2. Notification and Escalation
    1. Smallest possible group for as long as possible, then figure out communication
  3. Response Strategy
    1. Do we need to update? Notify users?

One important take-away

  • Don’t use the same password on multiple sites you administer (Playstation Network)

Secure Site Admin Pledge

  • I pledge to take the following steps to be a responsible Drupal site administrator:
  • I have set a unique, strong password for any accounts with administrative privelegaes, and I do not share passwords across sites
  • I use multi-factor authentications (e.g., ssh keys) for OS-level access and have password-only access disabled on my systems.
  • I have and execute a patching plan that includes the OS, web server, and Drupal layers (including core, modules, and custom code)
  • I have and execute at least a minimalist periodic audit plan
  • I am aware of and comply with applicable information security requirements for the data that my site handles (HIPAA, PCI DSS, etc.)
  • I monitor vulnerability announcement mailing lists for the technologies I use on my site
  • I monitor my system regularly such that I know how it behaves under normal conditions
  • I have a documented incident handling plan that I am familiar with and can use in an emergency
  • I take responsibility for ensuring that any custom code is developed according to secure coding best practices and is evaluated before being put into production
  • I will be eternally vigilant and investigate any unusual/suspicious site behavior
  • I have a process in place to ensure non-production sites are appropriately protected from external/access /crawling
  • I am an advocate for practical information security practices and avoid “Security theater” showmanship

Thank You!

Please get in touch to chat about these topics:

No comments: