Update
3:55 p.m.
IT is still recovering data for usernames that begin with T-Z. This data will be made available at /share/recovery/data as it is retrieved
Service has been restored to the NIC Cluster and users should now be able to sign in via SSH to nic.mst.edu. A massive storage system hardware failure was the cause of this outage and IT is currently working to complete the recovery of as much customer data as possible.
Access for NIC Cluster users
Because of the nature of the hardware failure, all users will have "blank slate" directories. These directories have a default quota of 1GB, but additional storage space can be requested by contacting the IT Help Desk at 573-341-HELP.
Approximately 75 percent of user data was recovered to a temporary location, and customers should check their data and copy it back to their home directories as appropriate, as IT is unable to hand-check individual files for correctness. The recovered data is available in read-only form at the directory path /share/recovery/users/[username]. This data will be retained for eight weeks (until June 10) while users copy any needed data back to their home directories.
3:55 p.m.
IT is still recovering data for usernames that begin with T-Z. This data will be made available at /share/recovery/data as it is retrieved
Service has been restored to the NIC Cluster and users should now be able to sign in via SSH to nic.mst.edu. A massive storage system hardware failure was the cause of this outage and IT is currently working to complete the recovery of as much customer data as possible.
Access for NIC Cluster users
Because of the nature of the hardware failure, all users will have "blank slate" directories. These directories have a default quota of 1GB, but additional storage space can be requested by contacting the IT Help Desk at 573-341-HELP.
Approximately 75 percent of user data was recovered to a temporary location, and customers should check their data and copy it back to their home directories as appropriate, as IT is unable to hand-check individual files for correctness. The recovered data is available in read-only form at the directory path /share/recovery/users/[username]. This data will be retained for eight weeks (until June 10) while users copy any needed data back to their home directories.
Outage details
The NIC Cluster suffered significant disk hardware failure that caused a loss of several portions of data. A failure happened over the weekend, but service was restored. A second failure happened that was quite a bit more serious and has caused IT to move data completely off of the failed RAID disk array and onto other temporary storage hardware. Additionally, IT created new "blank slate" home directories on temporary storage hardware, to restore service as soon as possible.
What IT is doing to fix the problem
The NIC Cluster has been moved to a temporary, more stable environment until new upgraded hardware can be delivered and installed. The new hardware has been ordered, but not delivered yet to campus. IT will need to take another NIC Cluster outage sometime between the Spring and Summer Semesters to migrate data to the new hardware, and will announce that outage when we have finalized all of the details.
We apologize for any inconvenience this may have caused and want to assure our customers that we're working to improve our high performance computing hardware with upgraded hardware and a data recovery system.
Going forward
Not all applications on the NIC Cluster have been restored. If you need an application that doesn't appear to have been restored, please contact the IT Help Desk at 573-341-HELP or put in a ticket at http://help.mst.edu.
Although IT is currently taking steps to provide a more robust high performance computing environment, including a "disaster recovery" backup system, it's important for NIC Cluster users to backup their own data. Our current policy regarding data stored on the NIC Cluster is still in place and available for viewing at https://wiki.mst.edu/nic/access/data_policy. Although hardware failures like this one occur very rarely, backing up your own data to either your Minerfiles home directory or other location is the best way to protect your work in the event of an outage.
Still have questions?
If you have any questions or concerns about this any of these issues, please contact the IT Help Desk at 573-341-HELP.
The NIC Cluster suffered significant disk hardware failure that caused a loss of several portions of data. A failure happened over the weekend, but service was restored. A second failure happened that was quite a bit more serious and has caused IT to move data completely off of the failed RAID disk array and onto other temporary storage hardware. Additionally, IT created new "blank slate" home directories on temporary storage hardware, to restore service as soon as possible.
What IT is doing to fix the problem
The NIC Cluster has been moved to a temporary, more stable environment until new upgraded hardware can be delivered and installed. The new hardware has been ordered, but not delivered yet to campus. IT will need to take another NIC Cluster outage sometime between the Spring and Summer Semesters to migrate data to the new hardware, and will announce that outage when we have finalized all of the details.
We apologize for any inconvenience this may have caused and want to assure our customers that we're working to improve our high performance computing hardware with upgraded hardware and a data recovery system.
Going forward
Not all applications on the NIC Cluster have been restored. If you need an application that doesn't appear to have been restored, please contact the IT Help Desk at 573-341-HELP or put in a ticket at http://help.mst.edu.
Although IT is currently taking steps to provide a more robust high performance computing environment, including a "disaster recovery" backup system, it's important for NIC Cluster users to backup their own data. Our current policy regarding data stored on the NIC Cluster is still in place and available for viewing at https://wiki.mst.edu/nic/access/data_policy. Although hardware failures like this one occur very rarely, backing up your own data to either your Minerfiles home directory or other location is the best way to protect your work in the event of an outage.
Still have questions?
If you have any questions or concerns about this any of these issues, please contact the IT Help Desk at 573-341-HELP.





Leave a comment