Devops L3 Q&A

SET – 1
1. How would you design a scalable and resilient CI/CD pipeline for a multi-region
microservices architecture?

 Using distributed build agents in regions to reduce latency.
 Global load balancers for distributing traffic across services.
 Implementing multi-region artifact repositories (e.g., Nexus, Artifactory).
 Automating deployments using GitOps with multi-region clusters. Adding
canary deployments and auto-scaling features to ensure zero downtime.

2. How do you handle infrastructure drift in a cloud environment, and what tools would
you use?

 Infrastructure drift occurs when manual changes are made outside of IaC
tools, causing discrepancies.
 Use tools like Terraform or Pulumi for managing drift by detecting changes in
state and applying corrective actions. Implement policy as code with tools
like Open Policy Agent (OPA) to ensure compliance with defined
infrastructure standards.

3. Can you walk through the design of a High-Availability (HA) Kubernetes cluster across
multiple regions?

 Use multi-master clusters with etcd distributed across regions. Set up cross-
region load balancers (e.g., AWS Global Accelerator).

 Utilize Persistent Volume Claims (PVCs) and object storage (e.g., S3) for
distributed data storage.
 Implement horizontal scaling with auto-scaling policies and node affinity for
region-specific pods.

4. How do you handle Disaster Recovery (DR) in a microservices environment?

 Use multi-region deployments with data replication (e.g., RDS Read Replicas).
 Maintain backups and point-in-time restores for databases. Implement a
runbook for failover strategies.
 Use chaos engineering tools like Gremlin or Chaos Monkey to simulate
failures and test DR capabilities.

5. How would you implement security at various stages of a DevOps pipeline?
 Pre-commit: Use static code analysis and tools like SonarQube.
 Build: Scan dependencies for vulnerabilities using Snyk or OWASP
Dependency-Check.
 Pre-deploy: Container security scanning using Aqua, Twistlock, or Clair.
 Post-deploy: Monitor for security anomalies using Falco or AWS GuardDuty.

6. What strategies would you use to handle scaling in a hybrid cloud environment?
 Implement autoscaling policies for both on-prem and cloud workloads using a
mix of Kubernetes Cluster Autoscaler and cloud-native auto-scaling (AWS,
Azure, GCP).
 Use service mesh tools like Istio to manage network traffic and routing
between on-prem and cloud environments. Implement cost-based scaling to
optimize resource allocation based on cloud provider pricing models.

7. What’s your approach to ensuring zero downtime during major infrastructure changes?
 Use blue-green or canary deployments to safely roll out changes. Leverage
feature toggles to switch between new and old infrastructure.
 Use tools like Kubernetes Rolling Updates and ensure proper health checks
for services.

8. How would you ensure observability in a complex system with multiple microservices?
 Implement distributed tracing using tools like Jaeger or OpenTelemetry to
track requests across services. Set up centralized logging with the ELK stack
or Fluentd.
 Implement metrics monitoring with Prometheus and visualize it using
Grafana dashboards. Use correlation IDs to track a single request across
multiple services for easier debugging.

9. Explain how you would secure container images and the registry.

 Use tools like Clair or Trivy for scanning container images for vulnerabilities.
 Sign images with Docker Content Trust or Notary. Implement role-based
access control (RBAC) in the registry to limit who can push/pull images.
 Enforce TLS for registry communication and use private registries like Harbor
for secure storage.

10. What is your approach to managing secrets in a distributed environment?

 Use secret management tools like HashiCorp Vault, AWS Secrets Manager, or
Azure Key Vault.
 Ensure secrets are not hardcoded and are injected into applications at
runtime using environment variables or mounted files. Rotate secrets
regularly and apply auditing to ensure no unauthorized access.

SET – 2
1. What are some strategies for optimizing cost in cloud-based DevOps pipelines?
Use spot instances or reserved instances for non-production workloads. Right-size
VMs and containers based on usage patterns.Implement auto-scaling to match
capacity with demand. Use tools like AWS Cost Explorer or Google Cloud Pricing
Calculator to monitor and optimize cloud spend.

2. What are the key differences between event-driven architecture and traditional
request-response architecture in a microservices setup?

 Event-driven architecture: Services communicate via asynchronous events,
allowing decoupled and highly scalable systems. Examples include Kafka and
RabbitMQ.
 Request-response architecture: Services directly communicate
synchronously, which can lead to tight coupling and higher latency but is
easier to debug.

3. How do you handle scaling of stateful applications in Kubernetes?
Use StatefulSets for stateful applications that require unique network IDs and
persistent storage.Implement volume replication and multi-zone Persistent Volumes.
Utilize Kubernetes storage classes with cloud provider-backed storage (e.g., AWS
EBS, GCP Persistent Disks).

4. How would you implement a GitOps workflow for infrastructure management?
Use Git as the single source of truth for both application code and infrastructure
code (IaC). Implement tools like ArgoCD or Flux to automatically deploy changes
from the Git repository to the Kubernetes cluster. Ensure changes are reviewed and
approved via pull requests before they are merged and deployed.

5. How would you design a multi-tenant Kubernetes environment?
Use namespaces to isolate workloads for different tenants. Implement network
policies to restrict communication between tenant namespaces. Use RBAC to ensure
only authorized users can manage resources within their own namespaces. Set up
resource quotas to limit the amount of CPU, memory, and storage available to each
tenant.

6. What strategies would you use to monitor and debug networking issues in a Kubernetes
cluster?
Use Kubernetes network policies to enforce rules on pod communication and isolate
network traffic. Implement CNI plugins like Calico or Weave for managing pod

network traffic. Debug using tcpdump, kubectl exec to ping pods, and network
visualization tools like Kiali for tracing service mesh traffic.

7. How do you ensure observability in a serverless architecture?
Implement distributed tracing with AWS X-Ray or Google Cloud Trace for serverless
functions. Use centralized logging systems like CloudWatch or Stackdriver Logging.
Monitor function performance and trigger rates with metrics using Prometheus,
Datadog, or cloud-native monitoring services.

8. What’s your approach to handling multi-cloud DevOps environments?
Use tools like Terraform or Pulumi to manage infrastructure across different cloud
providers.Implement cloud-agnostic CI/CD pipelines with tools like Spinnaker or
Jenkins. Ensure centralized monitoring and logging across clouds using Grafana or
Prometheus with multi-cloud support.

9. How would you implement a secure and highly available container registry in a cloud
environment?
Use cloud-native container registries like Amazon ECR, Azure Container Registry, or
Google Container Registry with private repositories. Enable encryption at rest and in
transit using SSL/TLS. Implement role-based access control (RBAC) for fine-grained
permissions. Replicate registries across multiple regions for high availability and use
geo-replication to reduce latency.

10. How do you handle monitoring and alerting in a multi-cloud environment?
Use centralized monitoring tools like Prometheus with federation or Datadog to
aggregate metrics across multiple clouds. Set up cloud-native monitoring tools like
AWS CloudWatch, Google Cloud Monitoring, and integrate them into one dashboard
(e.g., Grafana).
Implement cross-cloud alerting with unified systems like PagerDuty or Opsgenie to
trigger alerts based on consolidated metrics. Use tags to correlate cloud-specific
resources for granular monitoring and create cloud-agnostic views.

SET – 3
1. How would you design a global content delivery network (CDN) for a large-scale
application?
Leverage CDNs like AWS CloudFront, Azure CDN, or Google Cloud CDN to serve static
assets globally. Implement origin failover and health checks to ensure high
availability in case of CDN outages. Optimize caching policies (TTL) for frequently
requested resources and implement cache invalidation when updates are required.

2. What is your approach to managing infrastructure across multiple cloud providers using
Infrastructure as Code (IaC)?
Implement a modular architecture in Terraform to manage reusable and versioned
infrastructure components. Use state backends like S3 with DynamoDB locks for
state management in Terraform, ensuring consistency across teams. Apply GitOps
principles to manage infrastructure changes via pull requests, versioning, and
continuous integration.

3. What is your strategy for database schema management in a CI/CD pipeline for
distributed microservices?.
Automate database migrations as part of the CI/CD pipeline, ensuring that
migrations are applied alongside code deployments. Implement blue/green or
canary deployments for schema changes to minimize downtime and avoid breaking
changes.
Use backward-compatible database schemas (e.g., adding nullable columns, avoiding
drop statements in production) to prevent breaking active connections.

4. How do you implement and maintain service level agreements (SLAs) and service level
objectives (SLOs) in a microservices architecture?
Automate error budget tracking by monitoring the difference between the current
performance and SLOs, and use this to throttle deployments. Set up SLIs (Service
Level Indicators) for key metrics and track them using monitoring systems like
Grafana, Datadog, or New Relic. Implement automated incident management when
SLAs or SLOs are violated, ensuring that alerts trigger corrective actions.

5. How do you design an architecture for deploying applications across multiple
Kubernetes clusters?
Implement service mesh (e.g., Istio) to handle cross-cluster communication and
traffic routing.Deploy cluster-local services for intra-cluster communication and use
global ingress controllers for inter-cluster load balancing.

Ensure consistent configuration management across clusters using Helm or
Kustomize with a GitOps workflow.

6. What is your approach to managing secrets securely in Kubernetes, especially in a
multi-cloud environment?
Implement Kubernetes Secrets for sensitive data like API keys, but encrypt secrets at
rest using Kubernetes EncryptionConfiguration. Use tools like Sealed Secrets to
encrypt Kubernetes secrets and store them safely in version control.

7. How would you design a deployment strategy for a hybrid cloud architecture with both
on-prem and cloud environments?

Implement a VPN or Direct Connect between on-prem and cloud for secure, low-
latency communication. Use a service mesh (e.g., Linkerd or Istio) to handle network

traffic between on-prem and cloud services, ensuring failover mechanisms.
Automate deployment and scaling using IaC tools like Terraform and CI/CD tools that
work across both environments (e.g., Jenkins, GitLab CI).

8. What strategies would you use to ensure compliance with industry standards (e.g.,
HIPAA, GDPR) in a DevOps environment?
Implement policy as code with tools like OPA (Open Policy Agent) to enforce
compliance across infrastructure and application configurations. Use automated
security scanning tools (e.g., Anchore, Aqua Security) as part of the CI/CD pipeline to
ensure containers are compliant. Encrypt sensitive data both at rest and in transit,
using tools like HashiCorp Vault for secrets management.

9. How would you design a multi-tenant SaaS application with isolated data and
environments per tenant?
Implement tenant-specific databases or schema-based isolation to ensure data
separation, with strict access controls. Use network policies and service meshes (e.g.,
Istio) to segregate tenant network traffic and ensure secure communication
between services. Implement resource quotas to limit CPU, memory, and storage
per tenant, preventing one tenant from exhausting shared resources.

10. How do you manage incident response in a complex microservices architecture with
multiple teams?
Implement centralized logging and monitoring systems (e.g., ELK stack, Prometheus,
Datadog) to provide visibility into incidents across all services. Use incident
management tools like PagerDuty or Opsgenie to automate alerting and assign

incidents to relevant teams based on predefined rules. Establish a runbook with well-
documented incident response procedures for each service, ensuring on-call teams

can respond quickly.

SET – 4
1. How would you approach autoscaling for a machine learning (ML) pipeline in
Kubernetes?
Use Horizontal Pod Autoscaler (HPA) to scale the number of pods based on resource
usage (e.g., CPU, memory) or custom metrics (e.g., inference requests). Implement
GPU-based autoscaling with support for specialized hardware in Kubernetes for ML
workloads. Use batch processing for training jobs with tools like Kubeflow or Airflow
to manage large datasets and scale clusters dynamically based on job requirements.

2. How would you implement an immutable infrastructure strategy in a large-scale
system?
Implement automated image-building pipelines (e.g., with Packer or Docker) to
generate images for each deployment, ensuring that no changes are applied directly
to live systems. Leverage blue/green deployments or canary releases to switch
between versions with zero downtime, ensuring infrastructure remains immutable.
Use tools like Terraform or AWS CloudFormation to automate the creation of new
infrastructure with each change, avoiding manual intervention.

3. How would you manage and monitor thousands of containers running across multiple
Kubernetes clusters?
Implement a service mesh such as Istio or Linkerd for observability, service
discovery, and traffic management. Use ELK Stack or Fluentd for centralized logging,
collecting logs from multiple clusters and aggregating them in one place.

4. How do you approach root cause analysis (RCA) in a large, distributed system where
multiple services are integrated?
Use distributed tracing tools like Jaeger or OpenTelemetry to follow the life cycle of a
request across microservices and identify bottlenecks. Centralize logs from multiple
services using the ELK stack or Splunk, and set up structured logging to track specific
events or errors.
Use correlation IDs across services to trace the flow of a single transaction through
various components.Implement automated alerting systems based on key
performance indicators (KPIs) like error rates, latency, and CPU/memory usage.

5. How would you design a secure CI/CD pipeline that integrates security checks at each
stage?
Implement pre-commit hooks to run static code analysis and code linting for security
issues. Add SAST tools like SonarQube, Checkmarx, or Bandit in the build stage.
Include dependency scanning tools like Snyk, OWASP Dependency-Check, or
WhiteSource to detect vulnerable libraries.

Implement DAST (Dynamic Application Security Testing) tools to test the running
application for vulnerabilities in real-time. Use role-based access control (RBAC) in
your CI/CD tools to enforce the principle of least privilege, ensuring only authorized
personnel can deploy.

6. What would be your approach to implementing chaos engineering in a production
environment?
Use chaos engineering tools like Gremlin or Chaos Monkey to simulate failures such
as node failures, network outages, or pod restarts. Monitor the system’s resilience
and response to failures using real-time monitoring tools like Datadog, Prometheus,
and ELK. Analyze the results, identify single points of failure, and implement
necessary changes to increase system resilience.

7. How do you ensure that containerized applications follow best practices for
performance and security?
Implement multi-stage builds to optimize image size by separating the build and
runtime environments.Regularly scan containers using security tools like Clair, Aqua,
or Trivy to detect vulnerabilities in base images and dependencies. Continuously
monitor container performance using tools like cAdvisor, Prometheus, and
Kubernetes Horizontal Pod Autoscaler for dynamic resource allocation.

8. How can you copy Jenkins from one server to another?
Move the job from one Jenkins installation to another by copying the corresponding
job directory. Create a copy of an existing job by making a clone of a job directory
with a different name. Rename an existing job by renaming a directory.

9. How can you temporarily turn off Jenkins security if the administrative users have
locked themselves out of the admin console?
When security is enabled, the Config file contains an XML element named
useSecurity that will be set to true. By changing this setting to false, security will be
disabled the next time Jenkins is restarted.

10. Can Selenium test an application on an Android browser?
Selenium is capable of testing an application on an Android browser using an
Android driver. You can use the Selendroid or Appium framework to test native apps
or web apps in the Android browser.

SET – 5
1. Explain the architecture of Docker.
Docker uses a client-server architecture.
Docker Client is a service that runs a command. The command is translated using the
REST API and is sent to the Docker Daemon (server).
Docker Daemon accepts the request and interacts with the operating system to build
Docker images and run Docker containers.
A Docker image is a template of instructions, which is used to create containers.
Docker container is an executable package of an application and its dependencies
together.
Docker registry is a service to host and distribute Docker images among users

2. How do we share Docker containers with different nodes?
It is possible to share Docker containers on different nodes with Docker Swarm.
Docker Swarm is a tool that allows IT administrators and developers to create and
manage a cluster of swarm nodes within the Docker platform. A swarm consists of
two types of nodes: a manager node and a worker node.

3. How does Nagios help in the continuous monitoring of systems, applications, and
services?
Nagios enables server monitoring and the ability to check if they are sufficiently
utilized or if any task failures need to be addressed. Verifies the status of the servers
and services Inspects the health of your infrastructure Checks if applications are
working correctly and web servers are reachable

4. Explain what state stalking is in Nagios.
State stalking is used for logging purposes in Nagios. When stalking is enabled for a
particular host or service, Nagios will watch that host or service very carefully. It will
log any changes it sees in the output of check results. This helps in the analysis of log
files.

5. Which open-source or community tools do you use to make Puppet more powerful?
Changes in the configuration are tracked using Jira, and further maintenance is done
through internal procedures. Version control takes the support of Git and Puppet’s
code manager app. The changes are also passed through Jenkin’s continuous
integration pipeline.

6. What is virtualization, and how does it connect to DevOps Virtualization is creating a virtual version of something, such as a server, storage
device, or network. In DevOps, virtualization allows teams to create and manage
virtual environments that can be used for development, testing, and deployment.
This can help improve efficiency, reduce costs, and enable greater flexibility and
scalability.

7. What is the best way to make content reusable/redistributable?
Roles are used to managing tasks in a playbook. They can be easily shared via Ansible
Galaxy. “include” is used to add a submodule or another file to a playbook. This
means a code written once can be added to multiple playbooks. “import” is an
improvement of “include,” which ensures that a file is added only once. This is
helpful when a line is run recursively.

8. Explain how you can set up a Jenkins job?
To create a Jenkins Job, we go to the top page of Jenkins, choose the New Job option
and then select Build a free-style software project. Optional triggers for controlling
when Jenkins builds.
Optional steps for gathering data from the build, like collecting javadoc, testing
results and/or archiving artifacts. A build script (ant, maven, shell script, batch file,
etc.) that actually does the work. Optional source code management system (SCM),
like Subversion or CVS.

9. Explain how you manage secrets and sensitive data in a DevOps environment.
To manage secrets, I use HashiCorp Vault, which provides secure secret storage and
tight access control. I’ve set up policies that grant access to secrets based on roles,
ensuring that only the necessary services and team members have access.
Additionally, all sensitive data is encrypted in transit and at rest, and we audit access
logs regularly to maintain compliance with security standards.

10. How do you foster a DevOps culture in a team or organization?
To foster a DevOps culture, I emphasize the importance of collaboration and shared
responsibility. In my last role, I organized cross-functional workshops and regular
‘blameless’ retrospectives to encourage open communication and collective
problem-solving. I also set up internal platforms for knowledge sharing and
championed a ‘you build it, you run it’ philosophy, which empowered developers to
take ownership of their code from development to production. This not only
improved our deployment frequency but also increased team morale and
productivity.