Open
Description
We are getting 503s multiple times a day on https://metacpan.org/
What has been done:
-
Add uWSGI stats monitor
-
New DataDog account https://app.datadoghq.eu/
- Send Fastly CDN logs for Web ( source: fastly service_name: Web )
- Send Fastly CDN logs for API ( source: fastly service_name: API )
- Send uWSGI logs
- Add K8s DataDog agent to cluster
- Create Public Dash board for web
- Create Private Dashboard to help identify IPs to block
-
https://metacpan.org/robots.txt - updated to list most bots and specify do not crawl
-
VCL snipped deployed in Fastly to block based on user agents (since rolled back)
-
MANY IPs and IP ranges (A very large number belonging to Alibaba.com) blocked using Fastly's IP Block list feature
Reached out to Fastly and signed a contract with them (they are giving us these services for free)
- Enable DDoS protection, but the traffic we get is not triggering it
- Trying to enable their WAF but having problems, this is with their support team currently
Our own blocking attemps:
- Initial Attempts to add to add Anubis to our cluster failed and we had to disable it - not sure how this would actually have worked with Fastly as our CDN Caching proxy in anycase.
Other actions:
- RSS fixed in the containers for uWSGI which seemed to solve memory issues...
- our web containers have been given even more memory as of 4pm on Sun 1 Jun - but they were close to memory usage, but not often getting OOMKilled and when they were it didn't match with our 503 spikes
- Review https://grafana.do.metacpan.org/ to see if any other info visible
Areas to explore
-
Is our k8s nginx ingress somehow causing this issue under load?
- Grafana ingress nginx logs shows we are getting upstream errors from nginx-ingress