(My) CISSP Notes – Business Continuity and Disaster Recovery Planning

Note: This notes were made using the following books: “CISPP Study Guide” and “CISSP for dummies”.

Business Continuity and Disaster Recovery Planning is an organization’s last line of defense: when all other controls have failed, BCP/DRP is the final control that may prevent drastic events such as injury, loss of life, or failure of an organization.

An additional benefit of BCP/DRP is that an organization that forms a business continuity team, and conducts a thorough BCP/DRP process, is forced to view the organization’s critical processes and assets in a different, often clarifying, light. Critical assets must be identified and key business processes understood. Standards are employed. Risk analysis conducted during a BCP/DRP plan can lead to immediate mitigating steps.

BCP

The overarching goal of a BCP is for ensuring that the business will continue to operate before, throughout, and after a disaster event is experienced. The focus of a BCP is on the business as a whole, and ensuring that those critical services that the business provides or critical functions that the business regularly performs can still be carried out both in the wake of a disruption as well as after the disruption has been weathered.

Business Continuity Planning provides a long-term strategy for ensuring that continued successful operation of an organization in spite of inevitable disruptive events and disasters.

BCP deals with keeping business operations running, perhaps in other location or using different tools and processes, after the disaster has struck.

DRP

The DRP provides a short-term plan for dealing with specific IT-oriented disruptions. The DRP focuses on efficiently attempting to mitigate the impact of a disaster and the immediate response and recovery of critical IT systems in the face of a significant disruptive event.The DRP does not focus on long-term business impact in the same fashion that a BCP does. DRP deals with restoring normal business operations after the disaster takes place.

These two plans, which have different scopes, are intertwined. The Disaster Recovery Plan serves as a subset of the overall Business Continuity Plan, because a BCP would be doomed to fail if it did not contain a tactical method for immediately dealing with disruption of information systems.

Defining disastrous events

The three common ways of categorizing the causes for disasters are as to whether the threat agent is natural, human, or environmental in nature.

Natural disasters – fires and explosions, earthquakes, storms, floods, hurricanes, tornadoes, landslices, tsunamis, pandemics
Human disasters (intentional or unintentional threat) – accidents, crime and mischief, war and terrorism, cyber attacks/cyber warfare, civil disturbance
Environmental disasters – this class of threat includes items such as power issues (blackout, brownout, surge, spike), system component or other equipment failures, application or software flaws.

Though errors and omissions are the most common threat faced by an organization, they also represent the type of threat that can be most easily avoided.

The safety of an organization’s personnel should be guaranteed even at the expense of efficient or even successful restoration of operations or recovery of data.

Recovering from a disaster

The general process of disaster recovery involves responding to the disruption; activation of the recovery team; ongoing tactical communication of the status of disaster and its associated recovery; further assessment of the damage caused by the disruptive event; and recovery of critical assets and processes in a manner consistent with the extent of the disaster.

Respond – In order to begin the disaster recovery process, there must be an initial response that begins the process of assessing the damage. The initial assessment will determine if the event in question constitutes a disaster.
Activate Team – If during the initial response to a disruptive event a disaster is declared, then the team that will be responsible for recovery needs to be activated.
Communicate – After the successful activation of the disaster recovery team, it is likely that many individuals will be working in parallel on different aspects of the overall recovery process. In addition to communication of internal status regarding the recovery activities, the organization must be prepared to provide external communications, which involves disseminating details regarding the organization’s recovery status with the public.
Assess – A more detailed and thorough assessment will be done by the, now activated, disaster recovery team. The team will proceed to assess the extent of damage to determine the proper steps to ensure the organization’s mission is fulfilled.
Reconstitution – The primary goal of the reconstitution phase is to successfully recover critical business operations either at primary or secondary site.

BCP/DRP Project elements

A BCP project typically has four components: Scope determination, business impact assessment, identify preventive controls and implementation.

BCP Scope

The success and effectiveness of a BCP depends greatly on whether senior management and the project team properly defines the scope. Specific questions will need to be asked of the BCP/DRP planning team like “What is in and out of scope of this plan”.

Business impact assessment (BIA)

The BIA describes the impact that a disaster is expected to have on business operations. Any BIA should contains the following tasks:

Perform an vulnerability Assessment – The goal of the vulnerability assessment is to determine the impact of the loss of a critical business function.
Perform a critically assessmentThe team members need to estimate the duration of a disaster event to effectively prepare the critically assessment. Project team members needs to consider the impact of a disruption based on the length of time that a disasters impairs critical business functions.
Determine the Maximum Tolerable DowntimeThe primary goal of the BIA is to determine the Maximum Tolerable Downtime (MTD), also known as Maximum Tolerable Period Of Disruption (MTPD) for a specific IT asset. MTD is the maximum period of time that a critical business function can be inoperative before the company incurs significant and log-lasting damage.
Establish recovery targetsThese targets represent the period of time from the start of a disaster until critical processes have resumes functioning. Two primary recovery targets are established for each business process: Recovery Time Objective (RTO) and Recovery Point Objective(RPO).RTO is the maximum period of time in which a business prices must be restored after a disaster. The RTO is also called the system recovery time.
RPO is the maximum period of time in which data might be lost if a disaster strikes. The RPO represents the maximum acceptable amount of data/work loss for a given process because of a disaster or disruptive event.
Determine ressource requirements – This portion of the BIA is a listing of the resources that an organization needs in order to continue operating each critical business function.

Identify preventive controls

Preventive controls prevent disruptive events from having an impact. The BIA will identify some risks which might be mitigated immediately. Once the BIA is complete, the BCP team knows the Maximum Tolerable Downtime. This metric, as well as others including the Recovery Point Objective and Recovery Time Objective, are used to determine the recovery strategy.

Once an organization has determined its maximum tolerable downtime, the choice of recovery options can be determined. For example, a 10-day MTD indicates that a cold site may be a reasonable option. An MTD of a few hours indicates that a redundant site or hot site is a potential option.

A redundant site is an exact production duplicate of a system that has the capability to seamlessly operate all necessary IT operations without loss of services to the end user of the system.
A hot site is a location that an organization may relocate to following a major disruption or disaster.It is important to note the difference between a hot and redundant site. Hot sites can quickly recover critical IT functionality; it may even be measured in minutes instead of hours. However, a redundant site will appear as operating normally to the end user no matter what the state of operations is for the IT program.
A warm sitehas some aspects of a hot site, for example, readily-accessible hardware and connectivity, but it will have to rely upon backup data in order to reconstitute a system after a disruption.An organizations will have to be able to withstand an MTD of at least 1-3 days in order to consider a warm site solution.

A cold site is the least expensive recovery solution to implement. It does not include backup copies of data, nor does it contain any immediately available hardware.
Reciprocal agreements are a bi-directional agreement between two organizations in which one organization promises another organization that it can move in and share space if it experiences a disaster.
Mobile sites are “datacenters on wheels”: towable trailers that contain racks of computer equipment.

As discussed previously, the Business Continuity Plan is an umbrella plan that contains others plans. In addition to the Disaster recovery plan, other plans include the Continuity of Operations Plan (COOP), the Business Resumption/Recovery Plan (BRP), Continuity of Support Plan, Cyber Incident Response Plan, Occupant Emergency Plan (OEP), and the Crisis Management Plan (CMP).

The Business Recovery Plan (also known as the Business Resumption Plan) details the steps required to restore normal business operations after recovering from a disruptive event. This may include switching operations from an alternate site back to a (repaired) primary site.

The Continuity of Support Plan focuses narrowly on support of specific IT systems and applications. It is also called the IT Contingency Plan, emphasizing IT over general business support.

The Cyber Incident Response Plan is designed to respond to disruptive cyber events, including network-based attacks, worms, computer viruses, Trojan horses.

The Occupant Emergency Plan(OEP) provides the “response procedures for occupants of a facility in the event of a situation posing a potential threat to the health and safety of personnel, the environment, or property”. This plan is facilities-focused, as opposed to business or IT-focused.

The Crisis Management Plan(CMP) is designed to provide effective coordination among the managers of the organization in the event of an emergency or disruptive event. A key tool leveraged for staff communication by the Crisis Communications Plan is the Call Tree, which is used to quickly communicate news throughout an organization without overburdening any specific person. The call tree works by assigning each employee a small number of other employees they are responsible for calling in an emergency event.

Implementation

The implementation phase consists in testing, training and awareness and continued maintenance.

In order to ensure that a Disaster Recovery Plan represents a viable plan for recovery, thorough testing is needed. There are different types of testing:

The DRP Review is the most basic form of initial DRP testing, and is focused on simply reading the DRP in its entirety to ensure completeness of coverage.
Checklist(also known as consistency) testing lists all necessary components required for successful recovery, and ensures that they are, or will be, readily available should a disaster occur.Another test that is commonly completed at the same time as the checklist test is that of the structured walkthrough, which is also often referred to as a tabletop exercise.
A simulation test, also called a walkthrough drill (not to be confused with the discussion-based structured walkthrough), goes beyond talking about the process and actually has teams to carry out the recovery process. A pretend disaster is simulated to which the team must respond as they are directed to by the DRP.
Another type of DRP test is that of parallel processing. This type of test is common in environments where transactional data is a key component of the critical business processing. Typically, this test will involve recovery of critical processing components at an alternate computing facility, and then restore data from a previous backup. Note that regular production systems are not interrupted.
Arguably, the most high fidelity of all DRP tests involves business interruption testing. However, this type of test can actually be the cause of a disaster, so extreme caution should be exercised before attempting an actual interruption test.Once the initial BCP/DRP plan is completed, tested, trained, and implemented, it must be kept up to date.BCP/DRP plans must keep pace with all critical business and IT changes.Business continuity and disaster recovery planning are a business’ last line of defense against failure. If other controls have failed, BCP/DRP is the final control. If it fails, the business may fail.
A handful of specific frameworks are worth discussing, including NIST SP 800-34, ISO/IEC-27031, and BCI.

Adventures in the programming jungle

Adrian Citu's Technical Blog