Information about upcoming maintenance, current downtime, and RFO for past downtime will be posted here. Updates more than six months old may be purged.
Please contact email@example.com for technical support inquiries.
There is no planned maintenance or ongoing outage at this time. Please open a ticket if your services are down.
One hypervisor in Toronto (cbec3404b692) was down for approximately 10 minutes due to disk issue. The hypervisor is restored now.
Update 2 (28 November 2023 19:40 EST): a prolonged outage occurred today due to incorrect identification of affected hypervisor while diagnosing the issue. Once we identified the correct hypervisor, the hypervisor was brought back online by replacing a faulty disk. Network connectivity to VMs that were using this hypervisor as network node was affected as well.
Update 1 (28 November 2023 12:40 EST): services have been restored at this time.
One hypervisor in Toronto (7907d4eeaa3d) is down due to disk issue. We are working on restoring the hypervisor at this time.
Update 5 (6 September 2023 15:00 EDT): we have heard from the datacenter that the power issue was due to breaker tripping incorrectly. Cogent says they have swapped the breaker with a new breaker and we should not experience any further issues with that unit.
Update 4 (3 September 2023 18:00 EDT): all services should be online at this time. Please open a ticket if your service remains offline.
Update 3 (3 September 2023 17:45 EDT): most services are restored at this time. Remaining services should be online shortly.
Update 2 (3 September 2023 17:00 EDT): one of our racks in Toronto remains offline due to power issue. Block storage is impacted, as well as virtual machines running on servers in that rack. Our website is offline due to the power issue as well. We now expect resolution by 18:00 EDT.
Update 1 (3 September 2023 13:00 EDT): we expect resolution in 5 hours.
We are investigating downtime in one rack in Toronto which is affecting many services.
Update 3 (29 August 2023 21:35 EDT): services should be restored now.
Update 2 (29 August 2023 21:05 EDT): packet loss has come up again. It is due to DDoS attack targeting multiple IPs making it difficult to null route. We are working again to restore connectivity.
Update 1 (29 August 2023 20:35 EDT): services are restored at this time, more details later.
We are investigating networking issues with Toronto region (95% packet loss).
One Toronto hypervisor (a4120f5bedfa) offline for 20-30 minutes due to hardware failure (disk plane). Services are restored after swapping the disks to a spare physical server.
Update 3 (19 February 2023 09:47 EST): VMs startup.
Update 2 (19 February 2023 09:39 EST): hypervisor is booting back up.
Update 1 (19 February 2023 09:06 EST): We will swap out both power supplies so firmwares match, but will require hypervisor to be powered off.
One Toronto hypervisor (1ee830f71342) has one power supply failed. Due to mismatch in firmware we were unable to swap the failed power supply with one in stock.
Update 1 (18 February 2023 04:59 EST): Raid array repaied, and service restured.
Update 1 (18 February 2023 04:02 EST): We are investigating a potential issue with raid array..
Potential raid array issue detected on hypervisor e3d6894c0619
Update 1 (15 February 2023 01:00 EST): services have been restored.
We are performing maintenance on Toronto hypervisor 033ab13d2de5 to replace failed hardware component.
As previously announced, we have begun closing our Montreal and Roubaix locations. Remaining VMs in Montreal and Roubaix will be migrated to Toronto.
Update 1 (02 February 2023 01:45 EST): services have been restored.
Hypervisor a9a516272cf8 is offline we are investigating.
Due to substantial cost increases imposed by the datacenter we use in Montreal and Roubaix, we are sunsetting those locations on 31 January 2023.
You can migrate your virtual machines and attached volumes to our flagship Toronto location by following this guide.
Alternatively, open a support ticket and we will be happy to migrate your services over for you.
Make sure to migrate your VMs before 31 January 2023. You will receive three months of free credit for VMs migrated before 31 January 2023.
We understand that Toronto will not be suitable for some customers. If you would like a refund of remaining credit on your account, please open a support ticket by 31 January 2023 after deleting any active services.
Update 2 (26 January 2023 23:00 EST): we have not observed further issues following the configuration change.
Update 1 (18 January 2023 13:00 EST): we have updated router configuration to disable features that we believe is related to the network blips. We will monitor to see if this resolves the issue or not.
The network issues due to router malfunction similar to "19 October 2022" issue have recurred. The latest network downtime was for two minutes at 11:00 EST. We upgraded router firmware last night as change list indicated it may solve the problem but it did not. We are continuing to look for alternative solutions as previously our primary and backup routers both had the same problem.
Update 1 (26 January 2023 04:30 EST): We observed two brief network interruptions lasting a couple minutes each; network is now stable.
Our upstream provider in Toronto is performing a software update on the core router resulting in intermitten network issues. We expect the update to be completed within the next 30 minutes..
Update 1 (18 January 2023 11:00 EST): this memory stick still needs to be replaced, we will do it soon.
One Toronto hypervisor (fe6b5c048e87) has one memory stick failed. We will migrate all VMs off this hypervisor so we can swap the memory stick. The migration will involve a reboot.
Update 1 (11 December 2022 22:45 EST): services are back online.
Hypervisor 1ee830f71342 crashed and we are investigating.
Update 4 (10 December 2022 11:40 EST): b4fa814b28f1 is online now. The outage started during routine maintenance and checkup of power bar in one rack. However, power utilization surged seemingly due to faulty network switch and tripped two breakers. After power was restored, several hypervisors had issues booting, most of them booted after removing disks not associated with root filesystem. But b4fa814b28f1 and 1ee830f71342 had different boot issue due to outdated partition layout, it's unclear why they booted fine in the past but didn't now; however, after updating partition layout to modern format, they were restored as well.
Update 3 (10 December 2022 11:20 EST): 1ee830f71342 is online now and b4fa814b28f1 along with lndyn.com will be restored shortly.
Update 2 (10 December 2022 06:45 EST): most services are online, with the exception of two SSD hypervisors b4fa814b28f1 and 1ee830f71342, which should be restored within four hours. If your services are offline and not on these two hypervisors, please open a ticket.
Update 1 (10 December 2022 04:40 EST): most services are restored and we are working on restoring remaining services.
Our team is working to repair outage in Toronto related to power equipment.
Update 1 (07 December 2022 20:12 EST): looks like they fixed it.
One SSD hypervisor in Montreal is offline, it is due to datacenter internal network issue and we are waiting for response.
Update 1 (09 November 2022 06:23 EDT): controller node is back online, controller panel functions restored
Control actions like restarting VM or creating new VM will be unavailable during this downtime which should be resolved in 8 to 12 hours.
Update 2 (04 November 2022 06:30 EDT): raid array repair complete and services have been restored
Update 1 (04 November 2022 01:10 EDT): we are recovering the disk and we expect it to take eight hours.
One hypervisor in Toronto has failed due to hardware issue. We are investigating.
Update 5 (04 November 2022 01:00 EDT): we have not seen more outages, so it appears that the sfp module replacement on 21 October 2022 has resolved the issue despite the one final outage after the replacement.
Update 4 (22 October 2022 17:20 EDT): there was another outage now so it is not fixed. We are considering next step.
Update 3 (22 October 2022 15:00 EDT): we did replace the sfp module last night and we are monitoring to see if there are more issues.
Update 2 (21 October 2022 22:00 EDT): there was another brief outage due to router issue today morning, tonight at around 21:00 EDT we will swap sfp modules to see if it is issue with an sfp module.
Update 1 (19 October 2022 03:05 EDT): services are back online now. Both current router and previous primary router appear to have issue, we continue to monitor.
The replacement router appears to have failed. We are investigating the network outage in Toronto.
Update 2 (17 October 2022 11:58 EDT): this maintenance is done.
Update 1 (17 October 2022 11:49 EDT): this maintenance is happening now there will be two minutes downtime.
We are investigating intermittent external network connectivity drops in Toronto. Since the issue appears to be isolated to the primary router, we will be replacing it with our backup router on either 17 October 2022 or 18 October 2022.
Update 1 (1 July 2022 09:00 EST): Our upstream provider, Cogent Communications, will be performing network maintenance in the Toronto datacenter starting midnight of 29 July 2022. The maintenance will cause network downtime in our Toronto region for up to 60 minutes between 30 July 2022 00:00 EST and 05:00 EST.
Update 5 (25 July 2022 14:50 EST): all services back online again.
Update 4 (25 July 2022 13:30 EST): network is offline again, we are waiting for update from OVH.
Update 3 (25 July 2022 08:24 EST): hypervisor e3ba7defacb5 is now in service and we will be powering on residing virtual machines momentarily
Update 2 (25 July 2022 06:30 EST): most networking is restored but two HDD hypervisor (9958ca9b121d and e3ba7defacb5) have disk issues which we are investigating now.
Update 1 (25 July 2022 06:00 EST): This appears due to datacenter outage of rack 46-D06 (http://vms.status-ovhcloud.com/index_rbx6.html). We are waiting for reply from the datacenter (OVH).
We are investigating network issues in Roubaix.
Update 3 (16 June 2022 17:35 EST): We continue to see temperature returning to normal at the datacenter. At this time all services in toronto region should be online, if you continue to experience difficulties please open a ticket or email firstname.lastname@example.org.
We apologize for the inconvenience this incident have caused.
Update 2 (16 June 2022 16:54 EST): HVAC units are being restored and temperature alerts on the servers are clearing
Update 1 (16 June 2022 15:20 EST): the issues in Toronto today are caused by failure in the datacenter HVAC system due to a recent storm, meaning that a wide range of equipment is overheating. We will post updates as we receive more information from the datacenter (Cogent Toronto).
We are investigating network issues in Toronto.
We are aware of issue with some virtual machine not receiving network connectivity on boot, and are working to remedy the issue
Update 35 (23 April 2022 04:44 EST): 7887b099ad9f complete. This concludes the scheduled maintenance. If your virtual machine does not have network please log into dynamic.lunanode.com and power cycle the virtual machine; if the problem persists, please contact us at email@example.com
Update 34 (23 April 2022 04:34 EST): 17295cd83422 complete, we will begin the upgrade on hypervisor 7887b099ad9f now
Update 33 (23 April 2022 04:34 EST): 3e4c356e1dfa complete, we will begin the upgrade on hypervisor 17295cd83422 now
Update 32 (23 April 2022 04:28 EST): b3700f1d6136 complete, we will begin the upgrade on hypervisor 3e4c356e1dfa now
Update 31 (23 April 2022 04:23 EST): 3204943a46b6 complete, we will begin the upgrade on hypervisor b3700f1d6136 now
Update 30 (23 April 2022 04:13 EST): bce6e7722174 complete, we will begin the upgrade on hypervisor 3204943a46b6 now
Update 29 (23 April 2022 04:13 EST): a4120f5bedfa complete, we will begin the upgrade on hypervisor bce6e7722174 now
Update 28 (23 April 2022 04:07 EST): 482950451844 complete, we will begin the upgrade on hypervisor a4120f5bedfa now
Update 27 (23 April 2022 03:58 EST): fe6b5c048e87 complete, we will begin the upgrade on hypervisor 482950451844 now
Update 26 (23 April 2022 03:47 EST): 364793ac6b38 complete, we will begin the upgrade on hypervisor fe6b5c048e87 now
Update 25 (23 April 2022 03:39 EST): 7fb2d0fc768d complete, we will begin the upgrade on hypervisor 364793ac6b38 now
Update 24 (23 April 2022 03:32 EST): 049a4bdf1cf1 complete, we will begin the upgrade on hypervisor 7fb2d0fc768d now
Update 23 (23 April 2022 03:26 EST): 37cc55afc851 complete, we will begin the upgrade on hypervisor 049a4bdf1cf1 now
Update 22 (23 April 2022 03:20 EST): c81245aedd39 complete, we will begin the upgrade on hypervisor 37cc55afc851 now
Update 21 (23 April 2022 03:16 EST): 9eecfd763135 complete, we will begin the upgrade on hypervisor c81245aedd39 now
Update 20 (23 April 2022 03:11 EST): a9a516272cf8 complete, we will begin the upgrade on hypervisor 9eecfd763135 now
Update 19 (23 April 2022 03:04 EST): e3d6894c0619 complete, we will begin the upgrade on hypervisor a9a516272cf8 now
Update 18 (23 April 2022 03:00 EST): b1fda546d1e3 complete, we will begin the upgrade on hypervisor e3d6894c0619 now
Update 17 (23 April 2022 02:53 EST): cbec3404b692 complete, we will begin the upgrade on hypervisor b1fda546d1e3 now
Update 16 (23 April 2022 02:43 EST): 3aa210dd4314 complete, we will begin the upgrade on hypervisor cbec3404b692 now
Update 15 (23 April 2022 02:36 EST): 55b885149ef4 complete, we will begin the upgrade on hypervisor 3aa210dd4314 now
Update 14 (23 April 2022 02:31 EST): 03461a49b783 complete, we will begin the upgrade on hypervisor 55b885149ef4 now
Update 13 (23 April 2022 02:25 EST): fe77de27d9fb complete, we will begin the upgrade on hypervisor 03461a49b783 now
Update 12 (23 April 2022 02:17 EST): ceac64db9351 complete, we will begin the upgrade on hypervisor fe77de27d9fb now
Update 11 (23 April 2022 02:13 EST): b795bb864bbd complete, we will begin the upgrade on hypervisor ceac64db9351 now
Update 10 (23 April 2022 02:08 EST): 92dd02984a07 complete, we will begin the upgrade on hypervisor b795bb864bbd now
Update 9 (23 April 2022 02:03 EST): b86cab87888c complete, we will begin the upgrade on hypervisor 92dd02984a07 now
Update 8 (23 April 2022 01:56 EST): ae7093dfe3e1 complete, we will begin the upgrade on hypervisor b86cab87888c now
Update 7 (23 April 2022 01:50 EST): 7fd1830faf11 complete, we will begin the upgrade on hypervisor ae7093dfe3e1 now
Update 6 (23 April 2022 01:44 EST): 1ee830f71342 complete, we will begin the upgrade on hypervisor 7fd1830faf11 now
Update 5 (23 April 2022 01:19 EST): b4fa814b28f1 complete, we will begin the upgrade on hypervisor 1ee830f71342 now
Update 4 (23 April 2022 01:10 EST): 6e9041a6623f complete, we will begin the upgrade on hypervisor b4fa814b28f1 now
Update 3 (23 April 2022 00:59 EST): 033ab13d2de5 complete, we will begin the upgrade on hypervisor 6e9041a6623f now
Update 2 (23 April 2022 00:42 EST): b9a3298d39d3 complete, we will begin the upgrade on hypervisor 033ab13d2de5 now
Update 1 (23 April 2022 00:08 EST): We will begin the upgrade on hypervisor b9a3298d39d3 now
Announcement (16 April 2022 12:00 EST): We will be performaning networking upgrade on our hypervisors in the Toronto region on 23 April 2022 00:00 EST. Unfortunately virtual machines will need to be rebooted for the new configuration to take effect.
We will perform the upgrade one hypervisor at a time and expect each VM to be down for no more than 10 minutes.
If your virtual machine's network remains unresponsive after the completion of the task, please log into dynamic.lunanode.com and power-cycle the vm; if the issue persists, please contact us at firstname.lastname@example.org
Update 1 (17 April 2022 04:45 EST): The maintenance is complete.
We are performing system maintenance in Toronto. System actions like starting and stopping VMs will be disrupted during the maintenance. Virtual machine power and network will not be impacted.
The networking upgrade has been completed, virtual machines had to be rebooted in the region for the new configuration to take effect. We apologize for the inconvinience.
If your virtual machine's network remains unresponsive, please log into dynamic.lunanode.com and power-cycle the vm; if the issue persists, please contact us at email@example.com
Update: services are back online.
There is downtime on some services in Montreal due to network maintenance in the datacenter. See OVH webpage for details.
The same Montreal SSD hypervisor (55d26960353e) was offline again due to an unrelated issue (OVH vRack outage), until approximately 19:10 EST.
A Montreal SSD hypervisor (55d26960353e) went down again. It is online now but we are investigating further.
Update 1 (29 January 2022 02:50 EST): services are back online. We believe the recurring kernel panic is resolved.
The same Montreal SSD hypervisor again has issue. We have identified the fault and there should be no more recurrence after the current downtime.
Update 1 (19 January 2022 23:48 EST): services are back online.
We again have kernel panic on the same Montreal SSD hypervisor. We are rebooting the machine and we will investigate further.
Update 5 (16 January 2022 07:21 EST): maintenance is now finished. If your vm is still offline please open a ticket..
Update 4 (16 January 2022 06:04 EST): we are nearing the conclusion of our maintenance that took longer than expected; we will be rebooting virtual machines with volumes attached to restore them to functional status .
Update 3 (16 January 2022 04:22 EST): the system upgrades are still ongoing and during this time operations such as creating new VM or restarting VM will not be functional.
Update 2 (15 January 2022 22:35 EST): the system upgrades are underway and during this time operations such as creating new VM or restarting VM will not be functional.
Update 1 (8 January 2022 11:00 EST): the maintenance has been re-scheduled for 15 January 2022 22:00 EST.
We will be performing system upgrades in Toronto starting at 22:00 EST on
14 15 January 2022. Virtual machines may experience brief network connectivity blips during the upgrade process.
Update 1 (11 January 2022 11:26 EST): services are back online.
We are investigating downtime on one Montreal SSD hypervisor.
We have resolved downtime on one Toronto SSD hypervisor (b3700f1d6136). Services are back online at this time (06:03 EST).
Update 1 (12 November 2021 21:45 EST): services are back online.
We are investigating downtime on one Montreal SSD hypervisor. Currently the hypervisor is rebooting following kernel panic/freeze.
Update 10 (22 October 2021 15:40 EDT): the second disk that was giving read errors has now also been replaced, and the RAID1 building has finished. At this time both disks have been replaced and the RAID1 array is in good status.
Update 9 (21 October 2021 07:50 EDT): services should be restored at this time. Please open ticket if your VM is still offline.
Update 8 (21 October 2021 05:20 EDT): copy from backup to new disk is 42% complete. Once copying is finished we will begin restoring services.
Update 7 (21 October 2021 02:36 EDT): backup has completed. Since there were read errors, we need to restore from the backup rather than build the RAID1 from the second disk. This may take a couple more hours.
Update 6 (20 October 2021 23:38 EDT): backup is 85% complete.
Update 5 (20 October 2021 20:45 EDT): backup is 67% complete.
Update 4 (20 October 2021 17:05 EDT): backup is now 45% complete, and we do see a few read errors now. Progress has also slowed, and the current ETA is backup completion at 23:30 EDT (six hours from now). Furthermore, due to the read errors, another copy operation from the backup to the new SSD may be needed before we can bring services back online. Depending on your requirements, it may make sense to provision a new VM from your snapshots or other backups to avoid this extended downtime.
Update 3 (20 October 2021 14:40 EDT): backup is 25% complete, still no errors so far.
Update 2 (20 October 2021 13:22 EDT): backup is 10% complete with no errors so far. We expect it will take 6-7 hours total. We have replaced the first (failed) disk already, so once the backup of the second (working) disk is done, we will immediately try to bring services back online.
Update 1 (20 October 2021 12:55 EDT): one disk in RAID1 is failed and we see disk errors on the other disk. We are performing full backup of this disk so there is minimal data loss. ETA is now 5-7 hours, for the backup phase of the maintenance.
We are performing emergency maintenance on one SSD hypervisor in Montreal (55d26960353e) to investigate potential disk or filesystem issue. During the maintenance we will perform backup of all virtual machines. Expected downtime is one hour.
Update 2 (13 October 2021 04:21 EDT): services are back online.
Update 1 (13 October 2021 04:09 EDT): services remain offline. There is update from OVH CEO (from Twitter): "Suite a une erreur humaine durant la reconfiguration du network sur notre DC a VH (US-EST), nous avons un souci sur la toute la backbone. Nous allons isoler le DC VH puis fixer la conf".
Our upstream provider in Roubaix and Montreal, OVH, is offline. As a result of this our services in Roubaix and Montreal regions are impacted. We've received no updates and are unable to provide an ETA for resolution at this time. We will post reguar updates until service is restored
Update 1 (01 September 2021 16:27 EDT): services have been restored.
We are investigating network downtime in Roubaix. Services appear to be offline due to issue with vrack infrastructure from the datacenter that we use in Roubaix (OVH).
Update 4 (16 July 2021 14:48 EDT): services remain online but we continue to monitor the situation. We still do not have details regarding the vrack infrastructure outage from the datacenter (OVH), which caused the network outage. An additional issue with volume storage was identified that arose because of our interventions while investigating and attempting to resolve the network outage; specifically, some storage machines were rebooted but storage services were not started immediately. We apologize for the extended downtime, unfortunately we cannot guarantee that the issue will not recur since it is caused by issues with vrack infrastructure provided by a third party (OVH), so our only options would be to move VMs to a new datacenter or to close the region.
Update 3 (16 July 2021 13:15 EDT): services are back to normal but we have still not gotten a detailed reason for the network outage from the datacenter (OVH).
Update 2 (16 July 2021 13:00 EDT): services are still deteriorated and we are working with datacenter to resolve the issue with OVH vrack infrastructure.
Update 1 (16 July 2021 12:30 EDT): Services have been restored. It is unclear at this time whether outage is caused by datacenter infrastructure issue or our network node server having kernel lockup. We will continue to investigate.
We are investigating network downtime in Roubaix.
Update 1 (16 June 18:12 EDT): Services have been restored.
Our upstream provider OVH is experiencing outage on their vrack infrastructure, majority of our services in Roubaix region are offline as a result. No ETA.
Update 1 (25 May 22:00 EDT): all services were successfully migrated off the hypervisor. However after further investigation, we do not find any disk issues on the hypervisor, so we have brought it back into service.
Due to a detected potential disk issue on Roubaix hypervisor e3ba7defacb5, we will perform emergency maintenance involving migrating all virtual machines off the hypervisor to other hypervisors. This will involve a reboot of each VM with 2-10 minutes downtime depending on the disk size of the VM. The maintenance will begin immediately.
Update 1 (03 May 2021 21:49 EDT): we have switched over from primary router to backup router, since it seems to be router crashing problem. We will investigate further if the backup router has the same problem.
We are aware of intermittent one-minute network drops in Toronto following the network upgrades on 30 April 2021. We are still investigating these drops. We replaced a fiber cable that seemed potentially broken on 01 May 2021 17:37 EDT (after observing a drop at 01 May 2021 16:51 EDT), but today we see another drop at 21:03 EDT.
Update 1 (30 April 2021): the maintenance was performed at 23:45 instead due to some issues. But it is done now with up to 2 minute network interruption to most services.
Notice (24 April 2021): Services in Toronto will experience a 1 to 5 minute network interruption on 2021-04-30 during the maintenance window from 22:00 EDT to 23:00 EDT due to planned network upgrades.
Update 1 (30 March 2021 21:20 EDT): services are back online.
We are investigating downtime on one Toronto hypervisor, 049a4bdf1cf1.
Update 2 (05 March 2021 14:36 EST): RFO: our monitoring system detected that one disk in RAID10 group failed today morning. We use service from OVH in Montreal and Roubaix and requested disk replacement. It seems they needed to turn server off to replace the disk instead of hot-swapping it. But now the RAID10 group is restored to normal. Note that in our main location Toronto we can always hot swap disk.
Update 1 (05 March 2021 14:35 EST): services are back online.
One Montreal SSD hypervisor (de6e4fa83fdc) is offline for emergency maintenance due to disk issue. We expect 10-20 minutes downtime.
Update 1 (15 February 2021 02:50 EST): services are back online.
We are investigating downtime on one Toronto SSD hypervisor: fe77de27d9fb.
Update 2 (14 January 2021 17:33 EST): the downtime was due to kernel panic. We will check further to determine if the current kernel is sufficient to avoid future recurrence of this downtime incident.
Update 1 (14 January 2021 17:27 EST): services are back online.
We are investigating downtime on one Montreal SSD hypervisor, 55d26960353e.
Update 1 (04 January 2021 22:10 EST): services are back online.
We perform emergency maintenance on hypervisor 7fd1830faf11 in Toronto due to memory bugs associated with old kernel. Downtime ten minutes needed to update to new kernel to resolve these issues.
Services were offline from 21:10 EDT to 21:20 EDT due to upstream issue. See http://travaux.ovh.net/?do=details&id=46759 for details.
Update 2 (29 July 2020 14:55 EDT): services are back online.
Update 1 (29 July 2020 14:20 EDT): the OVH issue link is at http://travaux.ovh.net/?do=details&id=45822.
Again network is down due to upstream (OVH) vrack network issue.
Update 1 (29 July 04:56 EDT): services are back online.
Network is down due to upstream (OVH) vrack issue. http://travaux.ovh.net/?do=details&id=45816
Update 1 (10 July 2020 18:30 EDT): services are back online.
Hypervisor 1ee830f71342 is offline due to power cable tension issue and technician error during installation/migration of equipment for/to new 20A circuit. Services should be back online in five to ten minutes.
RFO Update (06 July 2020 21:00 EDT): Outage summary: at 06 July 2020 10:05 EDT there was power surge at Cogent's Toronto 245 Consumers Rd datacenter which led to server restarts and one PDU failure in one of our racks. (Other Cogent customers were also impacted and throughout the day the datacenter was quite crowded.) Due to the PDU failure, servers in that rack did not come back online. Our technician decided to replace the PDU at 11:00, and most servers began booting normally, but four servers had issues booting. One came online at 11:50 EDT at which point services that did not depend on the volume storage system, with the exception of virtual machines on the other servers, were able to come online. It is not clear the exact issue on the other three servers, but after re-seating the RAID controllers and taking out some disks and some other changes they were able to boot. (We only had two spare servers so would not have been able to quickly bring them online if none of the three were able to boot.) At 14:55 the first of those three servers came online and volume storage system operation was restored, and at 15:40 the last of those servers came online. There were a few other issues during the afternoon due to DHCP responses not being sent to booting VMs and IPv6 connectivity not working due to radvd daemon not running; we have added monitoring for the latter issue. We plan to investigate improved Ceph server grouping to reduce the probability of storage system downtime, and potentially more extensive virtual network monitoring to get details on each user virtual network.
Update 13 (06 July 2020 15:40 EDT): ceac64db9351 is back online now. All services should be restored at this time. Please open a ticket if your services are still down.
Update 12 (06 July 2020 15:05 EDT): cbec3404b692 is also back online now. At this time only ceac64db9351 is offline. Volume storage should be operational. Please open ticket if your VM is not on ceac64db9351 but you have services offline even after pressing shutdown/startup and/or reboot.
Update 11 (06 July 2020 14:55 EDT): volume storage system and hypervisor b1fda546d1e3 are back online. For volume VMs: if your VM is not working correctly please shutoff and then restart your VM. VMs on ceac64db9351, cbec3404b692 remain offline.
Update 10 (06 July 2020 14:20 EDT): all three hypervisors and volume storage system remain offline. We continue to work to resolve the issues but do not currently have ETA on resolution.
Update 9 (06 July 2020 14:00 EDT): three hypervisors remain offline due to hardware failure after power surge: ceac64db9351, cbec3404b692, and b1fda546d1e3. Volume storage system remains impacted and may show timeout operations on some volumes. If your service is not on one of those hypervisors and does not use volumes, but remains offline, please open support ticket.
Update 8 (06 July 2020 13:50 EDT): if you see your VM is online in VNC but not reachable, please try restart, and if it does not work please open ticket.
Update 7 (06 July 2020 13:16 EDT): we have resolved the issue on one of the three failed storage nodes and it is coming online now. We are working on the other two nodes but the first one should be sufficient to bring entire storage system online.
Update 6 (06 July 2020 12:48 EDT): most services are back online but three storage nodes are offline which means distributed storage system is offline so VMs with volume cannot be booted until we resolve this issue.
Update 5 (06 July 2020 11:50 EDT): controller node is booted.
Update 4 (06 July 2020 11:40 EDT): no replacement needed after removing serial port connection. ETA 10 minutes until most services except three hypervisors are online.
Update 3 (06 July 2020 11:35 EDT): some servers are damaged due to power surge. We need to replace it with backup server.
Update 2 (06 July 2020 11:05 EDT): PDU fail due to datacenter power surge. Replace and now services coming back online soon.
Update 1 (06 July 2020 10:20 EDT): appears to be power issue. Our technicians arrive on-site shortly.
Many services in Toronto are down. We are investigating.
Update 1 (26 June 2020 21:45 EDT): services are back online after replacing hardware component.
One hypervisor in Toronto (b9a3298d39d3) is offline. We are investigating.
Update 1 (25 June 11:00 EDT): the reboot last night did not solve the latency problem, but we apply additional tuning steps today and confirm the issue is resolved and latency is low and stable.
We need to perform maintenance on one Montreal SSD hypervisor (55d26960353e) to fix kernel issue. The maintenance will involve reboot and we estimate five minute downtime. The maintenance will occur within a window between 11:30 pm and midnight on 24 June 2020. Please contact us via support ticket if you need your VM migrated off the hypervisor before the maintenance window. We apologize for the inconvenience caused by this maintenance event.
Update 3 (24 June 2020 16:35 EDT): the network appears to be stable now and we are closing this issue.
Update 2 (24 June 2020 16:30 EDT): the network is back online now but it may still be very unstable. We have received absolutely no update from the datacenter.
Update 1 (24 June 2020 15:40 EDT): finally the datacenter has issue open about this incident: http://travaux.ovh.net/?do=details&id=45319&. Also at this time the network connectivity is fully offline.
We observe heavy packet loss in Roubaix starting at 14:40 EDT due to datacenter issue. We are contacting the datacenter to investigate this problem.
Update 1 (07 June 2020 12:52 EDT): services are back online.
Roubaix services are offline due to datacenter (OVH) issue.
Update 3 (08 May 2020 05:40 EDT): services are back online after five minutes.
Update 2 (08 May 2020 05:35 EDT): network is offline for the maintenance.
Update 1 (01 May 2020 22:00 EDT): our datacenter provider in Montreal (OVH BHS) plans network maintenance affecting one Montreal SSD hypervisor (de6e4fa83fdc) on 08 May 2020 05:30 EDT. The maintenance affects network equipment only, meaning VMs will remain online but will be inaccessible over network during the maintenance, which is expected to last up to 15 minutes. Please check http://travaux.ovh.net/?do=details&id=44377 for more details/updates.
Update 3 (29 April 2020 23:21 EDT): yes, after the reboot this hypervisor looks stable, all is good.
Update 2 (29 April 2020 23:10 EDT): services are back online after eight minutes. We need to verify the errors are gone shortly.
Update 1 (29 April 2020 18:00 EDT): customers have been notified of emergency planned maintenance tonight involving reboot of one Montreal SSD hypervisor (55d26960353e). It is necessary due to errors indicating some kernel issue.
Update 1 (26 April 2020 17:45 EDT): services are back online.
One Toronto SSD hypervisor is offline. We are investigating.
Update 1 (24 April 2020 15:55 EDT): they fix it, back online.
One Montreal SSD hypervisor has network downtime due to datacenter vrack issue (OVH).
Update 3 (20 April 2020 16:00 EDT): we do not see more issues recently. We believe reboot and updated kernel solves the problem.
Update 2 (13 April 2020 17:15 EDT): services are back online.
Update 1 (13 April 2020 16:58 EDT): vrack is not up after reboot. We try hard reboot.
One Roubaix SSD hypervisor is offline for ten minutes from 16:45 to 16:55 EDT due to emergency maintenance to correct kernel issue causing high latency and dropped packets.
Update 4 (10 April 2020 15:00 EDT): we do not see any further issues.
Update 3 (09 April 2020 13:25 EDT): services are back online at this time. We continue to monitor the hypervisor, but we believe the issues should be resolved.
Update 2 (09 April 2020 13:00 EDT): we continue to see intermittent issues. We are rebooting the hypervisor now.
Update 1 (09 April 2020 08:40 EDT): we have not seen the issue (brief loss of network connectivity and lockup of VMs) recur since our last intervention, which was to restart hypervisor software. It is not clear why hypervisor software would have caused this problem, so we continue to monitor the hypervisor.
We see intermittent issues on hypervisor fe77de27d9fb in Toronto (you can check the hypervisor ID by selecting VM and look at ID next to "Hypervisor"). We are investigating the issues. Please open ticket if you would like your VM to be migrated to a different hypervisor.
Update 2 (09 April 2020 08:40 EDT): our team cannot find issue last night, switching to backup router and replacing fiber module does not help. Now the packet loss is gone. It must have been datacenter issue. Big waste of time.
Update 1 (09 April 2020 01:23 EDT): we still see 1% packet loss in Toronto. It does not appear to be datacenter-wide issue. Our team is on-site and still investigating. We may switch to backup router or perform other similar actions that may cause brief network disruptions (less than one minute).
We observe packet loss in Toronto. We are investigating.
Update 1 (30 March 2020 11:43 EDT): services are back online at this time. Here is OVH issue.
We see network downtime due to datacenter issue at this time.
There may be 30-60 minutes of network downtime between 00:00 EDT and 03:00 EDT in Toronto due to Cogent planned network maintenance.
Update 1 (18 February 2020 00:10 EST): services are back online at this time.
We are conducting emergency maintenance related to CPU cooling on one Roubaix hypervisor.
Update 2 (5 February 2020 16:25 EST): services are back online at this time. Here is OVH issue.
Update 1 (5 February 2020 16:10 EST): we expect resolution will require reboot of this server and may take 30 minutes.
Internal and external network for VMs on one Montreal hypervisor is offline due to datacenter issue. We are in communication with OVH to resolve the problem.
Update 1 (14 January 2020 00:20 EST): services are back online at this time.
We are investigating downtime on one HDD hypervisor in Montreal.