|
Blackboard News from CUNY (03/12/2009)On Behalf Of Brian Cohen Dear Colleagues Below you will find a very detailed and technical accounting of the current problem with accessing the Blackboard system. Overall, the CIS team made progress today in restoring the system files. The restore/recovery process has been running for most of the day and is expected to continue throughout the night. We will continue this effort until completed and will, as part of our plan, use the morning hours to validate the restored files and test the Blackboard system. Testing will help us to ensure stability of the environment and confirm normal operations. We plan to reach out o some faculty members that have been active in our Blackboard Advisory Committee to assist in this testing effort. As of this writing of this message, we have not established a specific time as to when this work will be completed. However, once completed, we will restore access to the production environment. We greatly appreciate the University's understanding of the problem and the efforts currently underway to return key services to our students. In addition to the these efforts, CIS staff have also been working to create both a short- and long-term Disaster Recovery plan that includes a third party off-site Disaster Recovery facility with sufficient capacity to deliver Blackboard 8 services when failures like this occur. The University is committed to completing this DR effort this the greatest of urgency. I will continue to keep you informed as new information becomes available. Thank you. Associate Vice Chancellor and University CIO On Tuesday, March 10th, between 6-8AM the systems group was conducting a scheduled maintenance on the Blackboard 8 production environment. During this maintenance one of the principle activities was to move Oracle resources from one of the Sun Enterprise 25K frames to the other. This would give the Oracle databases additional memory and processing power. CIS had conducted this exercise on numerous occasions. During the process of moving these resources a problem was encountered with the SUN cluster services. Typically, this migration would do an orderly shutdown of Oracle cluster resources, including file systems, on one frame and start them up in an orderly fashion on the other. It was during the start up process that a problem occurred between the start of Oracle cluster resources and the underlying SUN ZFS file system. The cluster was unable to properly connect to the Oracle executable file system. This caused the system to hang and ultimately fail. The cause of this problem appears to be between the SUN Cluster software and SUN ZFS software. Once the system failed it was necessary to restart the services manually to try to reconnect to the proper file system and bring the services back on line. When this was attempted the system reported a problem with the ORACLEDATA and REDO zfs file systems. Closer examination revealed that the file systems were off line and unavailable to the cluster. The cluster was continuing to wait for the resources to be made available, but the file systems needed to be brought on line for this to happen. As a result, the cluster assumed that the file system wasn't available on the node where it was currently running. It initiated a restart and it tried to move all of the resources back to the original location on the other frame. As it tried to start the Oracle services again on the original frame it again encountered an off line file system. It again initiated a restart and tried to move the resources to the other cluster node. This "pingpong" effect caused both systems to reboot several times and took all SAN based storage off line. A determination was made to take a two track approach to resolving the outage state. The first was to try to recover the zfs file systems through the use of various system tools. The second was to create additional SAN storage to recover the data from tape. The system group worked closely with Sun Microsystems support and account executive to escalate services within the Sun support organization to focus on Sun Cluster and zfs recovery. Those resources began working with us by 9AM on Tuesday. These efforts lasted throughout the day and into the evening as the exact circumstances of the situation were evaluated and the base level issues were clarified. Both 25K nodes were rebooted into a non-clustered state to focus in on the zfs specific problems with the environment. These efforts went into late Tuesday evening without concrete results. The tape recovery was delayed for the day due to the frequent need to reboot the systems to try to recover the file systems. The effort continued again Wednesday morning at 6 am with a conference call with Sun support to again review progress and move forward with the recovery. The systems were now stable enough to format the additional disk storage for use in recovering the ORACLEDATA file system. Once the new file systems were made available the DBA group was tasked with configuring the disks for a tape recovery. The tape recovery is still in process at this time. However, there are possible ramifications with this approach, so CIS has decided to continue its efforts to restore the original Oracle data sets as a first option. The systems group continues to work with Sun support to recover the zfs systems. At around 1pm they were able to bring back the CONTENT file system. Once the file system was back on line a verification utility was run to make sure it was in usable shape. With the Sun account executive on site additional resources were brought in to look at the ORACLEDATA file system. Progress is being made in determining a recovery for that file system, it does however, remain unusable at this time. Staff are once again scheduled to be on site working on this into the evening and will also be here at 6AM tomorrow. In order to avoid this circumstance in the future additional space and facilities will need to be put in place for disaster recover and business continuity. This will require equipment, software and networking configurations capable of running the Blackboard system at sufficient capacity to run all of the classes for the period of outages in the primary data center. The University is committed to this effort and staff are actively working to procure a site and architect the proposed facility. Lester Jacobs |
|||
|
| Library Homepage | Brooklyn College Homepage Designed and Developed by Office of Academic Information Technologies Brooklyn College Library, All Rights Reserved © 1999-2008 |
||||
