Information about upcoming maintenance, current downtime, and RFO for past downtime will be posted here. Updates more than six months old may be purged.
Please contact firstname.lastname@example.org for technical support inquiries.
Update 4 (16 July 2021 14:48 EDT): services remain online but we continue to monitor the situation. We still do not have details regarding the vrack infrastructure outage from the datacenter (OVH), which caused the network outage. An additional issue with volume storage was identified that arose because of our interventions while investigating and attempting to resolve the network outage; specifically, some storage machines were rebooted but storage services were not started immediately. We apologize for the extended downtime, unfortunately we cannot guarantee that the issue will not recur since it is caused by issues with vrack infrastructure provided by a third party (OVH), so our only options would be to move VMs to a new datacenter or to close the region.
Update 3 (16 July 2021 13:15 EDT): services are back to normal but we have still not gotten a detailed reason for the network outage from the datacenter (OVH).
Update 2 (16 July 2021 13:00 EDT): services are still deteriorated and we are working with datacenter to resolve the issue with OVH vrack infrastructure.
Update 1 (16 July 2021 12:30 EDT): Services have been restored. It is unclear at this time whether outage is caused by datacenter infrastructure issue or our network node server having kernel lockup. We will continue to investigate.
We are investigating network downtime in Roubaix.
Update 1 (16 June 18:12 EDT): Services have been restored.
Our upstream provider OVH is experiencing outage on their vrack infrastructure, majority of our services in Roubaix region are offline as a result. No ETA.
Update 1 (25 May 22:00 EDT): all services were successfully migrated off the hypervisor. However after further investigation, we do not find any disk issues on the hypervisor, so we have brought it back into service.
Due to a detected potential disk issue on Roubaix hypervisor e3ba7defacb5, we will perform emergency maintenance involving migrating all virtual machines off the hypervisor to other hypervisors. This will involve a reboot of each VM with 2-10 minutes downtime depending on the disk size of the VM. The maintenance will begin immediately.
Update 1 (03 May 2021 21:49 EDT): we have switched over from primary router to backup router, since it seems to be router crashing problem. We will investigate further if the backup router has the same problem.
We are aware of intermittent one-minute network drops in Toronto following the network upgrades on 30 April 2021. We are still investigating these drops. We replaced a fiber cable that seemed potentially broken on 01 May 2021 17:37 EDT (after observing a drop at 01 May 2021 16:51 EDT), but today we see another drop at 21:03 EDT.
Update 1 (30 April 2021): the maintenance was performed at 23:45 instead due to some issues. But it is done now with up to 2 minute network interruption to most services.
Notice (24 April 2021): Services in Toronto will experience a 1 to 5 minute network interruption on 2021-04-30 during the maintenance window from 22:00 EDT to 23:00 EDT due to planned network upgrades.
Update 1 (30 March 2021 21:20 EDT): services are back online.
We are investigating downtime on one Toronto hypervisor, 049a4bdf1cf1.
Update 2 (05 March 2021 14:36 EST): RFO: our monitoring system detected that one disk in RAID10 group failed today morning. We use service from OVH in Montreal and Roubaix and requested disk replacement. It seems they needed to turn server off to replace the disk instead of hot-swapping it. But now the RAID10 group is restored to normal. Note that in our main location Toronto we can always hot swap disk.
Update 1 (05 March 2021 14:35 EST): services are back online.
One Montreal SSD hypervisor (de6e4fa83fdc) is offline for emergency maintenance due to disk issue. We expect 10-20 minutes downtime.
Update 1 (15 February 2021 02:50 EST): services are back online.
We are investigating downtime on one Toronto SSD hypervisor: fe77de27d9fb.
Update 2 (14 January 2021 17:33 EST): the downtime was due to kernel panic. We will check further to determine if the current kernel is sufficient to avoid future recurrence of this downtime incident.
Update 1 (14 January 2021 17:27 EST): services are back online.
We are investigating downtime on one Montreal SSD hypervisor, 55d26960353e.
Update 1 (04 January 2021 22:10 EST): services are back online.
We perform emergency maintenance on hypervisor 7fd1830faf11 in Toronto due to memory bugs associated with old kernel. Downtime ten minutes needed to update to new kernel to resolve these issues.
Services were offline from 21:10 EDT to 21:20 EDT due to upstream issue. See http://travaux.ovh.net/?do=details&id=46759 for details.
Update 2 (29 July 2020 14:55 EDT): services are back online.
Update 1 (29 July 2020 14:20 EDT): the OVH issue link is at http://travaux.ovh.net/?do=details&id=45822.
Again network is down due to upstream (OVH) vrack network issue.
Update 1 (29 July 04:56 EDT): services are back online.
Network is down due to upstream (OVH) vrack issue. http://travaux.ovh.net/?do=details&id=45816
Update 1 (10 July 2020 18:30 EDT): services are back online.
Hypervisor 1ee830f71342 is offline due to power cable tension issue and technician error during installation/migration of equipment for/to new 20A circuit. Services should be back online in five to ten minutes.
RFO Update (06 July 2020 21:00 EDT): Outage summary: at 06 July 2020 10:05 EDT there was power surge at Cogent's Toronto 245 Consumers Rd datacenter which led to server restarts and one PDU failure in one of our racks. (Other Cogent customers were also impacted and throughout the day the datacenter was quite crowded.) Due to the PDU failure, servers in that rack did not come back online. Our technician decided to replace the PDU at 11:00, and most servers began booting normally, but four servers had issues booting. One came online at 11:50 EDT at which point services that did not depend on the volume storage system, with the exception of virtual machines on the other servers, were able to come online. It is not clear the exact issue on the other three servers, but after re-seating the RAID controllers and taking out some disks and some other changes they were able to boot. (We only had two spare servers so would not have been able to quickly bring them online if none of the three were able to boot.) At 14:55 the first of those three servers came online and volume storage system operation was restored, and at 15:40 the last of those servers came online. There were a few other issues during the afternoon due to DHCP responses not being sent to booting VMs and IPv6 connectivity not working due to radvd daemon not running; we have added monitoring for the latter issue. We plan to investigate improved Ceph server grouping to reduce the probability of storage system downtime, and potentially more extensive virtual network monitoring to get details on each user virtual network.
Update 13 (06 July 2020 15:40 EDT): ceac64db9351 is back online now. All services should be restored at this time. Please open a ticket if your services are still down.
Update 12 (06 July 2020 15:05 EDT): cbec3404b692 is also back online now. At this time only ceac64db9351 is offline. Volume storage should be operational. Please open ticket if your VM is not on ceac64db9351 but you have services offline even after pressing shutdown/startup and/or reboot.
Update 11 (06 July 2020 14:55 EDT): volume storage system and hypervisor b1fda546d1e3 are back online. For volume VMs: if your VM is not working correctly please shutoff and then restart your VM. VMs on ceac64db9351, cbec3404b692 remain offline.
Update 10 (06 July 2020 14:20 EDT): all three hypervisors and volume storage system remain offline. We continue to work to resolve the issues but do not currently have ETA on resolution.
Update 9 (06 July 2020 14:00 EDT): three hypervisors remain offline due to hardware failure after power surge: ceac64db9351, cbec3404b692, and b1fda546d1e3. Volume storage system remains impacted and may show timeout operations on some volumes. If your service is not on one of those hypervisors and does not use volumes, but remains offline, please open support ticket.
Update 8 (06 July 2020 13:50 EDT): if you see your VM is online in VNC but not reachable, please try restart, and if it does not work please open ticket.
Update 7 (06 July 2020 13:16 EDT): we have resolved the issue on one of the three failed storage nodes and it is coming online now. We are working on the other two nodes but the first one should be sufficient to bring entire storage system online.
Update 6 (06 July 2020 12:48 EDT): most services are back online but three storage nodes are offline which means distributed storage system is offline so VMs with volume cannot be booted until we resolve this issue.
Update 5 (06 July 2020 11:50 EDT): controller node is booted.
Update 4 (06 July 2020 11:40 EDT): no replacement needed after removing serial port connection. ETA 10 minutes until most services except three hypervisors are online.
Update 3 (06 July 2020 11:35 EDT): some servers are damaged due to power surge. We need to replace it with backup server.
Update 2 (06 July 2020 11:05 EDT): PDU fail due to datacenter power surge. Replace and now services coming back online soon.
Update 1 (06 July 2020 10:20 EDT): appears to be power issue. Our technicians arrive on-site shortly.
Many services in Toronto are down. We are investigating.
Update 1 (26 June 2020 21:45 EDT): services are back online after replacing hardware component.
One hypervisor in Toronto (b9a3298d39d3) is offline. We are investigating.
Update 1 (25 June 11:00 EDT): the reboot last night did not solve the latency problem, but we apply additional tuning steps today and confirm the issue is resolved and latency is low and stable.
We need to perform maintenance on one Montreal SSD hypervisor (55d26960353e) to fix kernel issue. The maintenance will involve reboot and we estimate five minute downtime. The maintenance will occur within a window between 11:30 pm and midnight on 24 June 2020. Please contact us via support ticket if you need your VM migrated off the hypervisor before the maintenance window. We apologize for the inconvenience caused by this maintenance event.
Update 3 (24 June 2020 16:35 EDT): the network appears to be stable now and we are closing this issue.
Update 2 (24 June 2020 16:30 EDT): the network is back online now but it may still be very unstable. We have received absolutely no update from the datacenter.
Update 1 (24 June 2020 15:40 EDT): finally the datacenter has issue open about this incident: http://travaux.ovh.net/?do=details&id=45319&. Also at this time the network connectivity is fully offline.
We observe heavy packet loss in Roubaix starting at 14:40 EDT due to datacenter issue. We are contacting the datacenter to investigate this problem.
Update 1 (07 June 2020 12:52 EDT): services are back online.
Roubaix services are offline due to datacenter (OVH) issue.
Update 3 (08 May 2020 05:40 EDT): services are back online after five minutes.
Update 2 (08 May 2020 05:35 EDT): network is offline for the maintenance.
Update 1 (01 May 2020 22:00 EDT): our datacenter provider in Montreal (OVH BHS) plans network maintenance affecting one Montreal SSD hypervisor (de6e4fa83fdc) on 08 May 2020 05:30 EDT. The maintenance affects network equipment only, meaning VMs will remain online but will be inaccessible over network during the maintenance, which is expected to last up to 15 minutes. Please check http://travaux.ovh.net/?do=details&id=44377 for more details/updates.
Update 3 (29 April 2020 23:21 EDT): yes, after the reboot this hypervisor looks stable, all is good.
Update 2 (29 April 2020 23:10 EDT): services are back online after eight minutes. We need to verify the errors are gone shortly.
Update 1 (29 April 2020 18:00 EDT): customers have been notified of emergency planned maintenance tonight involving reboot of one Montreal SSD hypervisor (55d26960353e). It is necessary due to errors indicating some kernel issue.
Update 1 (26 April 2020 17:45 EDT): services are back online.
One Toronto SSD hypervisor is offline. We are investigating.
Update 1 (24 April 2020 15:55 EDT): they fix it, back online.
One Montreal SSD hypervisor has network downtime due to datacenter vrack issue (OVH).
Update 3 (20 April 2020 16:00 EDT): we do not see more issues recently. We believe reboot and updated kernel solves the problem.
Update 2 (13 April 2020 17:15 EDT): services are back online.
Update 1 (13 April 2020 16:58 EDT): vrack is not up after reboot. We try hard reboot.
One Roubaix SSD hypervisor is offline for ten minutes from 16:45 to 16:55 EDT due to emergency maintenance to correct kernel issue causing high latency and dropped packets.
Update 4 (10 April 2020 15:00 EDT): we do not see any further issues.
Update 3 (09 April 2020 13:25 EDT): services are back online at this time. We continue to monitor the hypervisor, but we believe the issues should be resolved.
Update 2 (09 April 2020 13:00 EDT): we continue to see intermittent issues. We are rebooting the hypervisor now.
Update 1 (09 April 2020 08:40 EDT): we have not seen the issue (brief loss of network connectivity and lockup of VMs) recur since our last intervention, which was to restart hypervisor software. It is not clear why hypervisor software would have caused this problem, so we continue to monitor the hypervisor.
We see intermittent issues on hypervisor fe77de27d9fb in Toronto (you can check the hypervisor ID by selecting VM and look at ID next to "Hypervisor"). We are investigating the issues. Please open ticket if you would like your VM to be migrated to a different hypervisor.
Update 2 (09 April 2020 08:40 EDT): our team cannot find issue last night, switching to backup router and replacing fiber module does not help. Now the packet loss is gone. It must have been datacenter issue. Big waste of time.
Update 1 (09 April 2020 01:23 EDT): we still see 1% packet loss in Toronto. It does not appear to be datacenter-wide issue. Our team is on-site and still investigating. We may switch to backup router or perform other similar actions that may cause brief network disruptions (less than one minute).
We observe packet loss in Toronto. We are investigating.
Update 1 (30 March 2020 11:43 EDT): services are back online at this time. Here is OVH issue.
We see network downtime due to datacenter issue at this time.
There may be 30-60 minutes of network downtime between 00:00 EDT and 03:00 EDT in Toronto due to Cogent planned network maintenance.
Update 1 (18 February 2020 00:10 EST): services are back online at this time.
We are conducting emergency maintenance related to CPU cooling on one Roubaix hypervisor.
Update 2 (5 February 2020 16:25 EST): services are back online at this time. Here is OVH issue.
Update 1 (5 February 2020 16:10 EST): we expect resolution will require reboot of this server and may take 30 minutes.
Internal and external network for VMs on one Montreal hypervisor is offline due to datacenter issue. We are in communication with OVH to resolve the problem.
Update 1 (14 January 2020 00:20 EST): services are back online at this time.
We are investigating downtime on one HDD hypervisor in Montreal.